Comparing Distributions (2023)

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When we want to see how something varies across categories, the trellis or small multiple plot is a good friend. We repeatedly draw the same graph once for each category, lining them up in a way that makes them comparable. Here’s an example from my book, using the gapminder data, which provides a cross-national time series of GDP per capita for many countries.

r

 1 2 3 4 5 6 7 8 91011
library(tidyverse)library(gapminder)p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))p + geom_line(color="gray70", aes(group = country)) + geom_smooth(size = 1.1, method = "loess", se = FALSE) + scale_y_log10(labels=scales::dollar) + facet_wrap(~ continent, ncol = 5) + labs(x = "Year", y = "GDP per capita", title = "GDP per capita on Five Continents", subtitle = "Individual countries shown in gray, trend in blue.")
Comparing Distributions (1)

Sometimes, we’re interested in comparing distributions across categories in something like this way. In particular, I’m interested in cases where we want to compare a distribution to some reference category, as when we look at subpopulations in comparison to an overall distribution.

Generate some population and subgroup data

Say we have some number of observed units, e.g., three thousand “counties” orwhatever. Each county has some population. Across all counties, the populationis distributed log-normally. Within counties, the populations are divided intothree groups. The particular proportions of groups A, B, and C will vary acrosscounties but always sum to one within each county.

r

 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536373839404142434445
## Keep track of labels for as_labeller() functions in plots later.grp_names <- c(`a` = "Group A", `b` = "Group B", `c` = "Group C", `pop_a` = "Group A", `pop_b` = "Group B", `pop_c` = "Group C", `pop_total` = "Total", `A` = "Group A", `B` = "Group B", `C` = "Group C") set.seed(1243098) N <- 3e3 alphas <- c(1.5, 0.9, 2) pop <- round(rlnorm(N, meanlog = 10.3, sdlog = 1.49), 0) df <- as_tibble(gtools::rdirichlet(N, alphas), .name_repair = "unique") %>% rename_with(~ c("a", "b", "c")) %>% rowid_to_column("unit") %>% add_column(pop_total = pop) %>% mutate(across(a:c, .fns = list(pop = ~ round(.x * pop_total, 0) + 1), .names = "{fn}_{col}")) df ## # A tibble: 3,000 × 8 ## unit a b c pop_total pop_a pop_b pop_c ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 0.156 0.196 0.648 6467 1008 1269 4194 ## 2 2 0.288 0.154 0.558 211075 60729 32579 117771 ## 3 3 0.165 0.391 0.443 128243 21186 50184 56876 ## 4 4 0.294 0.124 0.582 76843 22561 9555 44730 ## 5 5 0.301 0.146 0.553 12178 3671 1780 6730 ## 6 6 0.397 0.148 0.455 2707 1076 401 1232 ## 7 7 0.364 0.258 0.378 143261 52112 36987 54166 ## 8 8 0.859 0.0375 0.103 61109 52517 2290 6305 ## 9 9 0.129 0.477 0.394 61718 7968 29433 24320 ## 10 10 0.185 0.182 0.632 3217 597 588 2035 ## # … with 2,990 more rows

In the tibble we’ve just made up, unit is our county, a, b, and c are the proportions of the groups within each county, and the pop_ columns are the total populations and the subgroup populations. We make a vector of 3,000 populations using rlnorm and plausible values based on the mean and standard deviations of the logged population of actual US counties. A call to rdirichlet produces the matrix of subgroup proportions where each row sums to one. Then we multiply the populations by their respective proportions, and now we have a world of three thousand counties, each with some population that we’ve also broken out by group.

We can look at the distribution of group populations across counties:

(Video) Example: Comparing distributions | AP Statistics | Khan Academy

r

 1 2 3 4 5 6 7 8 910111213
df %>% pivot_longer(a:c) %>% ggplot() + geom_area(mapping = aes(x = value, y = ..count.., color = name, fill = name), stat = "bin", bins = 20, size = 0.5) + scale_fill_manual(values = alpha(my_oka, 0.7)) + scale_color_manual(values = alpha(my_oka, 1)) + guides(color = "none", fill = "none") + labs(x = "Logged Population", y = "Count", title = "Subgroup distribution across units") + facet_wrap(~ name, nrow = 1, labeller = as_labeller(grp_names)) 
Comparing Distributions (2)

From now on let’s just work with the population counts.

r

 1 2 3 4 5 6 7 8 910111213141516171819
df <- df %>% select(unit, pop_a:pop_c, pop_total)df## # A tibble: 3,000 × 5## unit pop_a pop_b pop_c pop_total## <int> <dbl> <dbl> <dbl> <dbl>## 1 1 1008 1269 4194 6467## 2 2 60729 32579 117771 211075## 3 3 21186 50184 56876 128243## 4 4 22561 9555 44730 76843## 5 5 3671 1780 6730 12178## 6 6 1076 401 1232 2707## 7 7 52112 36987 54166 143261## 8 8 52517 2290 6305 61109## 9 9 7968 29433 24320 61718## 10 10 597 588 2035 3217## # … with 2,990 more rows

Here’s what our population totals look like across groups, including the total:

r

 1 2 3 4 5 6 7 8 91011
df %>% pivot_longer(pop_a:pop_total) %>% group_by(name) %>% summarize(total = sum(value)) %>% ggplot(mapping = aes(x = total, y = name, fill = name)) + geom_col() + guides(fill = "none") + scale_fill_manual(values = alpha(c( my_oka[1:3], "gray40"), 0.9)) + scale_x_continuous(labels = scales::label_number_si()) + scale_y_discrete(labels = as_labeller(grp_names)) + labs(y = NULL, x = "Population")
Comparing Distributions (3)

Single panels

Now we can plot the group-level population distributions across counties. Again, wewant to compare group distributions to one another and to the overall populationdistribution by county. A single-panel histogram showing all four distributions isn’t very satisfactory. Even though we’re using alpha to make the columns semi-transparent, it’s still very muddy.

r

 1 2 3 4 5 6 7 8 910111213141516
df %>% pivot_longer(cols = pop_a:pop_total) %>% ggplot() + geom_histogram(mapping = aes(x = log(value), y = ..count.., color = name, fill = name), stat = "bin", bins = 20, size = 0.5, alpha = 0.7, position = "identity") + scale_color_manual(values = alpha(c( my_oka[1:3], "gray40"), 1), labels = as_labeller(grp_names)) + scale_fill_manual(values = alpha(c( my_oka[1:3], "gray40"), 0.6), labels = as_labeller(grp_names)) + labs(x = "Logged Population", y = "Count", color = "Group", fill = "Group", title = "Comparing Subgroups: Histograms", subtitle = "Overall distribution shown in gray")
(Video) Comparing Distributions
Comparing Distributions (4)

If we use a geom_density() rather than geom_histogram() we’ll generate kernel density estimates for each group. These look a little better, but not great.

r

 1 2 3 4 5 6 7 8 91011121314
df %>% pivot_longer(cols = pop_a:pop_total) %>% ggplot() + geom_density(mapping = aes(x = log(value), color = name, fill = name), alpha = 0.5) + scale_color_manual(values = alpha(c( my_oka[1:3], "gray40"), 1), labels = as_labeller(grp_names)) + scale_fill_manual(values = alpha(c( my_oka[1:3], "gray40"), 0.6), labels = as_labeller(grp_names)) + labs(x = "Logged Population", y = "Density", title = "Comparing Subgroups: Density", color = "Group", fill = "Group")
Comparing Distributions (5)

Better, but still not great. A very serviceable compromise that has many of the virtues of a small multiple but has the advantage of keeping things in one panel is a ridgeline plot, courtesy of geom_ridgeline() from Claus Wilke’s ggridges package:

r

 1 2 3 4 5 6 7 8 9101112
df %>% pivot_longer(cols = pop_a:pop_total) %>% ggplot() + geom_density_ridges(mapping = aes(x = log(value + 1), y = name, fill = name), color = "white") + scale_fill_manual(values = alpha(c( my_oka[1:3], "gray40"), 0.7)) + scale_y_discrete(labels = as_labeller(grp_names)) + guides(color = "none", fill = "none") + labs(x = "Logged Population", y = NULL, title = "Comparing Total and Subgroups: Ridgelines") + theme_ridges(font_family = "Myriad Pro SemiCondensed")
Comparing Distributions (6)

Ridgeline plots look good and scale pretty well when there are larger numbers of categories to put on the vertical axis, especially if there’s a reasonable amount of structure in the data, such as a trend or sequence in the distributions. They can be slightly inefficient in terms of space with smaller numbers of categories. When the number of groups gets large they work best in a very tall and narrow aspect ratio that can be hard to integrate into a page.

Histograms with a reference distribution

Like with the Gapminder plot, we can facet our plot so that every subgroup gets its ownpanel. But instead of having the Total population be its own panel, we will put it inside each group’s panel as a reference point. This allows us to compare the group to the overall population, and also makes eyeballing differences between the group distributions a little easier. To do this, we’re going to need to have some way to put the total population distribution into every panel. The trick is to hold on to the total population by only pivoting the subgroups to long format. That leaves us with repeatedvalues for the total population, pop_total, like this:

r

 1 2 3 4 5 6 7 8 91011121314151617
df %>% pivot_longer(cols = pop_a:pop_c)## # A tibble: 9,000 × 4## unit pop_total name value## <int> <dbl> <chr> <dbl>## 1 1 6467 pop_a 1008## 2 1 6467 pop_b 1269## 3 1 6467 pop_c 4194## 4 2 211075 pop_a 60729## 5 2 211075 pop_b 32579## 6 2 211075 pop_c 117771## 7 3 128243 pop_a 21186## 8 3 128243 pop_b 50184## 9 3 128243 pop_c 56876## 10 4 76843 pop_a 22561## # … with 8,990 more rows
(Video) Comparing means of distributions | Probability and Statistics | Khan Academy

When we draw the plot, we first call on geom_histogram() to draw thedistribution of the total population, setting the color to gray. Then wecall it again, separately, to draw the subgroups. Finally we facet onthe subgroup names. This leaves us with a faceted plot where each panelshows one subpopulation’s distribution and, for reference behind it, theoverall population distribution.

r

 1 2 3 4 5 6 7 8 91011121314151617
df %>% pivot_longer(cols = pop_a:pop_c) %>% ggplot() + geom_histogram(mapping = aes(x = log(pop_total), y = ..count..), bins = 20, alpha = 0.7, fill = "gray40", size = 0.5) + geom_histogram(mapping = aes(x = log(value), y = ..count.., color = name, fill = name), stat = "bin", bins = 20, size = 0.5, alpha = 0.7) + scale_fill_okabe_ito() + scale_color_okabe_ito() + guides(color = "none", fill = "none") + labs(x = "Logged Population", y = "Count", title = "Comparing Subgroups: Histograms", subtitle = "Overall distribution shown in gray") + facet_wrap(~ name, nrow = 1, labeller = as_labeller(grp_names)) 
Comparing Distributions (7)

This is a handy trick. We’ll use it repeatedly in the remaining figures, as we look at different ways of drawing the same comparison.

While putting the reference distribution behind the subgroup distribution is nice, the way the layering works produces an overlap that some viewers find difficult to read. It seems like a third distribution (the darker color created by the overlapping area) has appeared along with the two we’re interested in. We can avoid this by taking advantage of the underused geom_step() and its direction argument. We can tell geom_step() to use a binning method (stat = "bin") that’s the same as geom_histogram(). Here we’re also using the computed ..density.. value rather than ..count.., but we could use ..count.. just fine as well.

r

 1 2 3 4 5 6 7 8 910111213141516171819
df %>% pivot_longer(cols = pop_a:pop_c) %>% ggplot() + geom_histogram(mapping = aes(x = log(value), y = ..density.., color = name, fill = name), stat = "bin", bins = 20, size = 0.5, alpha = 0.7) + geom_step(mapping = aes(x = log(pop_total), y = ..density..), bins = 20, alpha = 0.9, color = "gray30", size = 0.6, stat = "bin", direction = "mid") + scale_fill_manual(values = alpha(my_oka, 0.8)) + scale_color_manual(values = alpha(my_oka, 1)) + guides(color = "none", fill = "none") + labs(x = "Logged Population", y = "Density", title = "Comparing Subgroups: Histograms", subtitle = "Overall distribution shown in outline") + facet_wrap(~ name, nrow = 1, labeller = as_labeller(grp_names)) 

With geom_step(), we get a histogram with just its outline drawn. This works quite well, I think. Because we’re just drawing the outline, we call it after we’ve drawn our histograms, so that it sits in a layer on top of them.

Comparing Distributions (8)

Frequency polygons

A final option, half way between histograms and smoothed kernel density estimates, is to use filled and open frequency polygons. Like geom_histogram(), these use stat_bin() behind the scenes but rather than columns they draw filled areas (geom_area) or lines (geom_freqpoly). The code is essentially the same as geom_histogram otherwise. We switch back to counts on the y-axis here.

(Video) Comparing distributions

r

 1 2 3 4 5 6 7 8 910111213141516
df %>% pivot_longer(cols = pop_a:pop_c) %>% ggplot() + geom_area(mapping = aes(x = log(value), y = ..count.., color = name, fill = name), stat = "bin", bins = 20, size = 0.5) + geom_freqpoly(mapping = aes(x = log(pop_total), y = ..count..), bins = 20, color = "gray20", size = 0.5) + scale_fill_manual(values = alpha(my_oka, 0.7)) + scale_color_manual(values = alpha(my_oka, 1)) + guides(color = "none", fill = "none") + labs(x = "Logged Population", y = "Count", title = "Comparing Subgroups: Frequency Polygons", subtitle = "Overall distribution shown in outline") + facet_wrap(~ name, nrow = 1, labeller = as_labeller(grp_names)) 
Comparing Distributions (9)

We can do the same thing with kernel densities, of course:

r

 1 2 3 4 5 6 7 8 9101112131415
df %>% pivot_longer(cols = pop_a:pop_c) %>% ggplot() + geom_density(mapping = aes(x = log(value), color = name, fill = name), size = 0.5) + geom_density(mapping = aes(x = log(pop_total)), color = "gray20", size = 0.5) + scale_fill_manual(values = alpha(my_oka, 0.7)) + scale_color_manual(values = alpha(my_oka, 1)) + guides(color = "none", fill = "none") + labs(x = "Logged Population", y = "Density", title = "Comparing Subgroups: Kernel Densities", subtitle = "Overall distribution shown in outline") + facet_wrap(~ name, nrow = 1, labeller = as_labeller(grp_names)) 
Comparing Distributions (10)

While these look good, kernel densities can be a little tricker for people to interpret than straightforward bin-and-count histograms. So it’s nice to have the frequency polygon as an option to use when you just want to show counts on the y-axis.

The full code for this post is available on GitHub.

Related

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(Video) Understanding & Comparing Distributions

FAQs

How do you compare the differences between two distributions? ›

The simplest way to compare two distributions is via the Z-test. The error in the mean is calculated by dividing the dispersion by the square root of the number of data points.

What should you look for when comparing distributions? ›

The most common way to compare three or more distributions is with boxplots. Things to look at are the medians, interquartile ranges, and outliers.

How to compare two distributions with different sample sizes? ›

One way to compare the two different size data sets is to divide the large set into an N number of equal size sets. The comparison can be based on absolute sum of of difference. THis will measure how many sets from the Nset are in close match with the single 4 sample set.

How do you test if two distributions are the same? ›

One approach to determine if two distributions are the same is to conduct a K-S test. The K-S test compares the empirical distribution function of one sample to either that of another sample, or the cumulative distribution function of a theoretical distribution.

What are the 2 most important things to remember when you are asked to compare distributions? ›

When comparing two distributions, students should compare shape, center, variability and outliers between the two distributions using comparative words (less than, greater than, similar to). Don't simply list shape, center, variability, and outliers for each distribution.

What is the best way to compare two sets of data? ›

A Dual Axis Line Chart is one of the best graphs for comparing two sets of data. The chart has a secondary y-axis to help you display insights into two varying data points.

What are the 4 things to discuss when describing a distribution? ›

When describing distributions on the AP® Statistics exam, there are 4 key concepts that you need to touch on every time: center, shape, spread, and outliers. Below is a preview of the main elements you will use to describe each of these concepts.

How to measure the difference between two probability distributions? ›

To measure the difference between two probability distributions over the same variable x, a measure, called the Kullback-Leibler divergence, or simply, the KL divergence, has been popularly used in the data mining literature. The concept was originated in probability theory and information theory.

How do you evaluate a distribution? ›

For quick and visual identification of a normal distribution, use a QQ plot if you have only one variable to look at and a Box Plot if you have many. Use a histogram if you need to present your results to a non-statistical public. As a statistical test to confirm your hypothesis, use the Shapiro Wilk test.

What statistical analysis should I use to compare two groups? ›

The two most widely used statistical techniques for comparing two groups, where the measurements of the groups are normally distributed, are the Independent Group t-test and the Paired t-test.

Which type of graph is best for comparing two distributions of data? ›

Bar graphs can help you compare data between different groups or to track changes over time. Bar graphs are most useful when there are big changes or to show how one group compares against other groups.

What statistical test would you use to compare two groups of different sizes? ›

What Is a T-Test? A t-test is an inferential statistic used to determine if there is a significant difference between the means of two groups and how they are related.

How do you know if data is equally distributed? ›

The statistical way to check if the data is normally distributed is to perform the Anderson-Darling test of normality. In this approach, the data points are used to compute a test statistic (A) which measures the distance between the expected distribution and the actual distribution.

Can we compare scores from different distributions? ›

This example illustrates why z-scores are so useful for comparing data values from different distributions: z-scores take into account the mean and standard deviations of distributions, which allows us to compare data values from different distributions and see which one is higher relative to their own distributions.

How do you compare distributions in math? ›

Comparing Distributions
  1. Choose the appropriate average (mode, median or mean) The mean includes all the data. ...
  2. Consider whether it is better for the average to be bigger or smaller. ...
  3. Give numerical values for the average and explicitly compare. ...
  4. Give your comparison in context.

How do you determine the best data distribution? ›

Using Probability Plots to Identify the Distribution of Your Data. Probability plots might be the best way to determine whether your data follow a particular distribution. If your data follow the straight line on the graph, the distribution fits your data. This process is simple to do visually.

What is the best measure of distribution? ›

The median is the most informative measure of central tendency for skewed distributions or distributions with outliers.

How do you compare two sets of data in a match? ›

When comparing two lists of data, select both columns of data, press F5 key on the keyboard, select the “Go to special” dialog box. Then select “Row difference” from the options. Matching cells of data across the rows in the columns are in white color and unmatched cells appear in grey color.

How do you compare values in two sets? ›

So, the equals() method is one of the most used and fast ways to compare two sets in Java. The equals() method compares two sets based on the type of the elements, size of the set, and value of the elements.

What is used to compare the distribution of two or more data sets? ›

Coefficient of variation is used to compare the variation or depression in two or more sets of data even though they are measured in different units.

How do you summarize a distribution? ›

The three common ways of looking at the center are average (also called mean), mode and median. All three summarize a distribution of the data by describing the typical value of a variable (average), the most frequently repeated number (mode), or the number in the middle of all the other numbers in a data set (median).

What are the three ways that we may describe a distribution of scores? ›

Three characteristics of distributions. There are 3 characteristics used that completely describe a distribution: shape, central tendency, and variability. We'll be talking about central tendency (roughly, the center of the distribution) and variability (how broad is the distribution) in future chapters.

How do you explain data distribution? ›

Data distribution is a function that specifies all possible values for a variable and also quantifies the relative frequency (probability of how often they occur). Distributions are considered any population that has a scattering of data.

Why is it useful to compare different distributions? ›

Why is it useful to compare different distributions? By comparing two different distributions, it can be tested if they belong to the same population or if they have the same distribution or different. Most of the statistical tests assume that the samples are taken from the normally distributed population.

How do you compare the probability of two events? ›

Rule of Probability for the Difference between Two Events

The probability of the difference between two events 𝐴 and 𝐵 is 𝑃 ( 𝐴 − 𝐵 ) = 𝑃 ( 𝐴 ) − 𝑃 ( 𝐴 ∩ 𝐵 ) .

How do you compare frequency distributions statistically? ›

Grouped bar plots are used to compare the frequency distributions of nominal or ordinal variables. If the variables are measured in interval or ratio scale, we can use the kernel density plots and strip plots or box plots for better understanding.

How do you ensure effective distribution? ›

Six ways to improve distribution channel performance
  1. Make it a priority. ...
  2. Develop measurements and track performance. ...
  3. Communicate! ...
  4. Drive revenue through the channel. ...
  5. Avoid pricing conflicts. ...
  6. Address conflicts swiftly.

How do I choose which distribution to use? ›

To select the correct probability distribution:
  1. Look at the variable in question. ...
  2. Review the descriptions of the probability distributions. ...
  3. Select the distribution that characterizes this variable. ...
  4. If historical data are available, use distribution fitting to select the distribution that best describes your data.

What are the three measures of distribution? ›

There are three main measures of central tendency: the mode, the median and the mean. Each of these measures describes a different indication of the typical or central value in the distribution. What is the mode? The mode is the most commonly occurring value in a distribution.

Which statistical tool is used for comparison? ›

Some of the most common and convenient statistical tools to quantify such comparisons are the F-test, the t-tests, and regression analysis.

How do you compare data statistically? ›

The four major ways of comparing means from data that is assumed to be normally distributed are:
  1. Independent Samples T-Test. ...
  2. One sample T-Test. ...
  3. Paired Samples T-Test. ...
  4. One way Analysis of Variance (ANOVA).

Can we compare two samples with different sizes? ›

Problems with Unequal Sample Sizes

Unequal sample sizes can lead to: Unequal variances between samples, which affects the assumption of equal variances in tests like ANOVA. Having both unequal sample sizes and variances dramatically affects statistical power and Type I error rates (Rusticus & Lovato, 2014).

How do you know if data has equal or unequal variance? ›

An F-test (Snedecor and Cochran, 1983) is used to test if the variances of two populations are equal. This test can be a two-tailed test or a one-tailed test. The two-tailed version tests against the alternative that the variances are not equal.

How do you know when to use equal or unequal variance? ›

1. Use the Variance Rule of Thumb. As a rule of thumb, if the ratio of the larger variance to the smaller variance is less than 4 then we can assume the variances are approximately equal and use the Student's t-test.

What to do if data is not normally distributed? ›

Many practitioners suggest that if your data are not normal, you should do a nonparametric version of the test, which does not assume normality. From my experience, I would say that if you have non-normal data, you may look at the nonparametric version of the test you are interested in running.

Why can standard scores be used to compare scores from different distributions? ›

Standard scores can be used to compare test performance across the two class distributions. Because your score is farther away from the mean than Stu's score, in standard deviations you did better on the test. Thus standard scores provide a way to compare across distributions.

What is comparing distributions in statistics? ›

Students use statistical calculations of center and spread to compare different distributions. They identify the advantages and disadvantages of using different graphical representations to compare distributions of the same measurement collected from different populations.

Can two distributions have the same mean? ›

Nevertheless, comparing means and standard deviations do not guarantee that the distributions are similar -- you may have two distributions with the same mean and standard deviation that, e.g., have different skewness and/or kurtosis. So, to compare distributions, you can use the two-sample Kolmogorov–Smirnov test.

Can you compare Z scores across two or more different distributions? ›

This example illustrates why z-scores are so useful for comparing data values from different distributions: z-scores take into account the mean and standard deviations of distributions, which allows us to compare data values from different distributions and see which one is higher relative to their own distributions.

What measures the difference between distributions? ›

One measure of the difference between two distribution is the "maximum mean discrepancy" criteria, which basically measures the difference between the empirical means of the samples from the two distributions in a Reproducing Kernel Hilbert Space (RKHS).

How to determine if two sets of data are statistically different? ›

A t-test is an inferential statistic used to determine if there is a statistically significant difference between the means of two variables. The t-test is a test used for hypothesis testing in statistics.

How do you compare two groups in statistics? ›

A common way to approach that question is by performing a statistical analysis. The two most widely used statistical techniques for comparing two groups, where the measurements of the groups are normally distributed, are the Independent Group t-test and the Paired t-test.

Which tool should be used to compare distributions of two or more variables? ›

The Pearson's χ2 test is the most commonly used test for assessing difference in distribution of a categorical variable between two or more independent groups.

Can you compare 2 z-scores? ›

Comparing z-scores of two points from two different variables can tell you only that one of them is more standard deviations away from mean of its sample comparing to the other.

Can you compare z-scores from different data sets? ›

Z-scores help us compare values across multiple data sets by describing each value in the context of how much variation there is in its data set.

When comparing z-scores which is better? ›

A positive z-score indicates the raw score is higher than the mean average. For example, if a z-score is equal to +1, it is 1 standard deviation above the mean. A negative z-score reveals the raw score is below the mean average. For example, if a z-score is equal to -2, it is 2 standard deviations below the mean.

Videos

1. AP Statistics 1-5 Comparing Distributions
(David Dobervich)
2. Comparing distributions with dot plots (example problem) | 7th grade | Khan Academy
(Khan Academy)
3. 10 3 Using statistics to compare two distributions
(MATHcolic)
4. Comparing distributions
(Todd Anderson)
5. Comparing Distributions Part 1
(Amalia deGuzman)
6. 2.12 Comparing distributions
(Mrs O'Gram's Maths)
Top Articles
Latest Posts
Article information

Author: Jerrold Considine

Last Updated: 03/03/2023

Views: 5620

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Jerrold Considine

Birthday: 1993-11-03

Address: Suite 447 3463 Marybelle Circles, New Marlin, AL 20765

Phone: +5816749283868

Job: Sales Executive

Hobby: Air sports, Sand art, Electronics, LARPing, Baseball, Book restoration, Puzzles

Introduction: My name is Jerrold Considine, I am a combative, cheerful, encouraging, happy, enthusiastic, funny, kind person who loves writing and wants to share my knowledge and understanding with you.