5 Data Distributions for Data Scientists (2023)

Data distributions help us understand our random variables a bit better. This post provides a summary and resources regarding some 5 important data distributions that will help you build some statistical foundations.

5 Data Distributions for Data Scientists (1)

Statistics have played a huge role in shaping modern societies, being one of the main tools to help us understand the world around us. Since the dawn of times, statistics and mathematics are two disciplines that contributed to inference and classical probability theory.

One of the major concepts from the statistics world is that the majority of random variables are modeled into a shape that can be described by parameters. This is what we normally call data distributions.

Data distributions are one of the cornerstones of statistics. Hypothesis Testing and Probability Theory spun from the simple concept of plotting the multiple occurences of a Random Variable, visualizing it’s shape and performing some math with it.

Basically, distributions are a way for a random variable to present itself —they help us understand:

  • What’s the expected average of a random variable?
  • What’s the expected spread of a random variable?
  • What are the range of values?

Before we start, it’s also important to separate continuous from discrete distributions.

Discrete distributions model random variables that can take a certain number of values (such as rolling a dice, ages in years, etc.) that are countable — typically they are integers. Continuous distributions model random variables that can take an infinitesimal number of values — some examples are temperature, weight of people, etc — typically represent by float (decimal) numbers.

In the context of Data Science & Analytics, knowing the data distribution of your features and target will give you better assumptions on building models (for instance, when a linear model is appropriate or not). Additionally, understanding data distributions will help you model and build your target variable.

In this post, we’ll explore some of the most common data distributions — and also use Python to simulate a couple of examples.

Arguably, the most famous data distribution is the normal one. A lot of different real world phenomena revolve around the famous bell-shaped curve — for instance:

  • Heights of a specific population;
  • Birth weight;
  • Students’s test scores;

These examples are really diverse. That’s why the normal distribution is such an impressive concept — it is able to model many real world phenomena from different fields of study ranging from economics, biology and sociology.

The most important features of the normal distribution:

  • Mean, Median and Mode are similar.
  • Standard Deviation determines the spread of the distribution;
  • It is roughly simmetric.

Let me exemplify this using Python. First, let me load a couple of libraries that will help me generate the examples on this post:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Let’s plot a normal distribution generated by 7 random samples — we’ll use seaborn’s density plot:

sns.kdeplot(np.random.normal(loc = 100, scale = 3, size = 7))
.set(title='Density Plot of Random Normal Distribution')
5 Data Distributions for Data Scientists (2)

Notice that our random variable is gravitating towards a “bell-shaped” curve — we have a small sample and, at the moment, the shape only approximates a bell-shaped one. This is totally normal when we are working with smaller samples — the underlying distribution for the population may have a real normal distribution but because your sample is so low, the sample pattern may diverge from the perfect bell shapped curve.

Going through the characteristics of the normal distribution:

  • Mean — this is the peak of the distribution and where most of the examples will land — in our normal function from numpy.random this is represented by the loc parameter.
  • Standard Deviation is the expected spread of the distribution around the mean . A lower value means that values vary less. This parameter is represented by the scale in the function above.

What happens if we raise the number of samples generating the data above? Let’s plot 10.000 mean samples taken from a normal distribution — we change this by changing the size argument in the plot:

sns.kdeplot(np.random.normal(loc = 100, scale = 3, size = 10000))
.set(title='Density Plot of Random Normal Distribution')
5 Data Distributions for Data Scientists (3)

Cool! Our random variable is now a perfect bell shaped curve. The standard deviation sets the width of the bell — a normally distributed variable with a lower standard deviation will end up having a thiner bell shape curve.

You will find the normal distribution in several features of a typical data science project:

  • For linear models, one of the assumptions is that the errors in our linear models should be normally distributed. If they aren’t, then probably we shouldn’t be using a linear model (such as regression) or you need more features.
  • Another example is models that rely on p-value tests to infer the ability of a feature to predict target. P-values are nothing more than a test against probability values of normally distributed effects.
  • A more general application is to check the normality of features and apply some kind of standard deviation range to exclude outliers from the training table. If the underlying variable follows an approximately bell shaped curve, it may be a good idea to use some statistics to perform outlier detection.

Most normal distributions don’t have a lot of skew, meaning they are centered around the mean. But there are other phenomena that have a really long right tail — it’s so common that we even have a name for it! Let’s see on the next topic.

If you want to explore the math behind the normal distribution, the Normal Distribution Wikipedia page is very good.

Another continuous probability distribution, the log normal distribution may seem similar to the normal distribution but it has a few catches:

  • It’s skewed to the right, meaning it has a positive fat-tail.
  • It only contains positive values.

One way to think of the parameters of the log-normal distribution is to think of mean and variance also, although not of the underlying values themselves, but of the underlying log transformation of the variables.

An example, let me use Python to show the shape of this distribution with parameter mean = 3 and standard deviation = 1:

5 Data Distributions for Data Scientists (4)

Again, I’m using numpy to generate this log-normal distribution:

sns.kdeplot(np.random.lognormal(mean = 3, sigma = 1, size = 1000))
.set(title='Density Plot of Log Normal Distribution')

Now, one of the cool things is that if I apply a logarithm (using np.log ) to the data that was generated, something interesting will happen:

sns.kdeplot(np.log(np.random.lognormal(mean = 3, sigma = 1, size = 1000)))
.set(title='Density Plot of Transformed Log Normal Distribution')
5 Data Distributions for Data Scientists (5)

Recognize this? It’s actually a normal distribution! Log-normal distribution are pretty common in real life scenarions:

  • Most variables that contain monetary values tend to have a log-normal distribution — for example purchases of customers in a shop;
  • Income distribution;
  • Adult weight has a tendency to follow log-normal distribution;

Mostly, phenomena that don’t have a chance of having negative values and where we may find extreme positive values can be modeled by this distribution.

If your model depends on pure linearity, transforming these variables with a log tranformation may improve the performance and stability of it.

Again, visit this Wikipedia page if you are interested on the math behind the Log-Normal distribution.

The Bernoulli Distribution is one of the most simple ones. But don’t confuse simplicity with importance — in the end, modelling the value behind the Bernoulli distribution may be a really hard task.

It consists of attaching a single probability value to an event with probability p, considering only a single “trial”. These events have two outcomes (yes or no, 1 or 0, basically, dichotomous values).

Let’s visualize it using Python — let’s imagine we would do a randomized trial with 100 patients and give them a specific medicine. The outcome of that treament could be cure (1) or no cure(0). Each of the outcomes can only happen once — no one can be cured twice.

Of course, the survivors could be submitted to a new treatment later but that is outside of the scope of the Bernoulli Distribution— we run our experiment and reached the following values:

5 Data Distributions for Data Scientists (6)

Unfortunately, 70 of our patients were not cured, while 30 did. We can now extract the probability of success: 30/(30+70) which yields 30%. The Bernoulli Distribution is normally represented by this probability p — called probability of success — simulating with Python and using stat='probability':

bernoulli = np.array([0]*70+[1]*30)
sns.histplot(bernoulli, stat='probability').set_title('Bernoulli Trials')
5 Data Distributions for Data Scientists (7)

Now, the number of multiple individuals on the study was done just for an experiment — they don’t consist of the distribution itself. The Bernoulli distribution consists of a single p (the 30%) that we attach to some type of outcome. This p is hard to estimate and requires further testing (particularly for non-deterministic events).

It is really hard to get this value p in real life scenarios — for instance, in our example, we assumed that 30% was the real probability of getting cured with the medicine, but this number may have happened by chance, alone. And in real life setting, this is normally the case — we assume a value p that is an estimation of the real value p for the entire population.

A Bernoulli trial for a fair coin toss is easier to understand — because these values are determinist (50% chance for heads vs. tails):

5 Data Distributions for Data Scientists (8)

In some cases, you are interested in estimating this p for real life scenarios:

  • Probability of success of a surgery;
  • Probability that a customer will default on his debt during the first 90 days of a loan.
  • Probability that a customer clicks on a specific ad.

An extension of the Bernoulli Distribution is called the Binomial — let’s check it next!

Remember that we are assuming that the value of p is 30% — meaning that our medicine will cure 30 in 100 patients in our Bernoulli trial. Binomial distribution is the extension of Bernoulli distribution for multiple trials (or experiments).

If this is true and we end up doing 1.000 trials similar to this one, we would expect our Binomial distribution to look like the following:

5 Data Distributions for Data Scientists (9)
sns.histplot(np.random.binomial(n = 100, p = 0.3, size = 1000))
.set(title='Histogram of Binomial Distribution of 1000 trials with n=100 and p=30%')

The arguments for the binomial function of the np.random are:

  • n — the number of samples in the trial — in this case we are considering a “sample” one single individual.
  • p — the underlying value of probability. For n=1 we are talking about the p of the Bernoulli distribution.
  • size — number of trials.

There are certain trials where we ended up having 25, 35 and even 40 patients that were cured — this might be a way for us to challenge our assumption of the Bernoulli distribution that our p was 30%.

But.. what happens if instead of doing each trial with 100 individuals, we end up doing them with 1000?

5 Data Distributions for Data Scientists (10)

Look at that! Recognize this shape? A normal distribution! What we’ve witnessed is a showcase of the Central Limit Theorem.

The Binomial distribution is a great way to verify the real p you are estimating for a Bernoulli trial. Imagine that for our medicine application example, the real Binomial distribution would be the following:

5 Data Distributions for Data Scientists (11)

In this case, the mean is centered around 320/325 — this may mean that, likely, the real p in the Bernoulli would be 32% or 32.5%.

Check this Khan Academy Statistics and Probability course section to get more intuition on the Binomial.

Another famous discrete distribution is the Poisson distribution.

This distribution models the ocurrence of a specific event ocurring but instead of looking at the number of trials, we look at a time period and we don’t have a theoretical upper bound on “success” — meaning fixed individuals. While in the binomial we had a fixed number of trials, in the Poisson we have a fixed time interval.

For instance, let’s imagine a soccer team called “A-Team”. You’ve checked the past 10 games and “A-Team” scored 25 goals. You’ll use a simple average to estimate the average number of goals scored by “A-Team” which is 2.5 goals.

You’ve just found your lambda parameter for the Poisson Distribution. This lambda is analogous to p in the Bernoulli distribution.

What would be the expected distribution of goals by “A-Team” on the next 100 games, assuming they score 2.5 goals, on average?

5 Data Distributions for Data Scientists (12)
sns.histplot(np.random.poisson(lam = 2.5, size = 100))
.set(title='Histogram of Poisson Distribution of 100 games with lambda=2.5')

With this data, we can now visualize the probability that a team will score any number of goals for the next games:

5 Data Distributions for Data Scientists (13)

Cool right? Roughly we can see:

  • The “A-Team” has a ~33% chance of scoring 0 or 1 goals in the next game.
  • The “A-Team” has a ~27% chance of scoring exactly 2 goals in the next game.
  • The “A-Team” has a ~40% chance of scoring more than 2 goals in the game.

. The Poisson Distribution contains a single parameter:

  • lambda, the expected number of occurences of the event during the time period, represented by lam .

There’s a small catch on our example — for a process to be modeled using the poisson distribution the events must be independent. This is something we can argue that is not the case for a soccer team’s form because one game can impact the morale of the next one and influence the expected goals. Even so, it has been tested several times that Goals scored by soccer teams tend to follow a Poisson Distribution over a season.

Other real-life phenomena that tend to follow a Poisson distribution:

  • Number of network failures in your city per week;
  • Number of arrivals of trains at a station over the course of one hour;
  • Number of specific products sold in a store per week;

Check the Poisson Distribution wikipedia page for more details.

And we’re done! Thank you for taking the time to read this post. I hope you enjoyed the post and I invite you to check the math behind each of these distributions. It’s also really interesting to see the huge amount of real life phenomena that tend to follow one of these distributions.

As a data scientist, there’s a really high chance that you will end up dealing with one of these distributions in the future — knowing them will help you improve your statistics game and get more familiar with the processes that generate your features and target.

I’ve set up a bootcamp on learning Data Science on Udemy where I introduce students to statistics and algorithms! The course is tailored for beginners and I would love to have you around.

5 Data Distributions for Data Scientists (14)
Top Articles
Latest Posts
Article information

Author: Domingo Moore

Last Updated: 02/28/2023

Views: 6067

Rating: 4.2 / 5 (73 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Domingo Moore

Birthday: 1997-05-20

Address: 6485 Kohler Route, Antonioton, VT 77375-0299

Phone: +3213869077934

Job: Sales Analyst

Hobby: Kayaking, Roller skating, Cabaret, Rugby, Homebrewing, Creative writing, amateur radio

Introduction: My name is Domingo Moore, I am a attractive, gorgeous, funny, jolly, spotless, nice, fantastic person who loves writing and wants to share my knowledge and understanding with you.