# Foundations of Probability that Every Data Scientist Should Know

## Understanding Random Events

Lets say that you have a random likelihood that a user will click on a call-to-action(I’ll call it a CTA from here on out, but this is anytime you invite the reader to buy, shop, give an email, etc.) within your application. Once they have clicked the call-to-action there is a subsequent likelihood that they will press the submit button to send along their email information (in this case).

Now let’s assign some probabilities… To click the CTA: 50% likelihood, and the submit button: 10% likelihood.

Now the question we have is that before either of these events takes place, what is the likelihood that BOTH will occur?

## Revisiting the Classic Probability Example: Dice

Now you may have seen this before, so at the very least this serves as an excellent refresher.

To start out, let’s answer the following question: what is the probability of rolling two subsequent 6’s.

For your first roll, you can roll the die each of the following ways: 1-2-3-4-5-6. In total you have six options.

Obvious? yes yes, but now here’s where we think a bit more critically about this. If you roll a 1, what is your next option? well it’s any of the following numbers: 1-2-3-4-5-6. If you had rolled a 2, you’d have the same 6 total options for the second roll. In this scenario, you would have 6 options for your second throw. In total this gives us 36 (6*6) different outcomes.

So back to the question… what is the likelihood that you might roll two 6’s in a row?

Well if we roll a die, the likelihood of the first landing on 6 is 1/6. Once that has happened, we have a likelihood of 1/6 again. In all of the 36 different combinations of two rolls, there is only one in which a 6 is rolled twice in a row. The math for this is to simply multiply the two probabilities together giving you a 1/36 likelihood that you would roll two 6s in a row.

## Back to The App Example

Lets apply what we just learned to our original example. If there is a 1/2 chance of a user hitting the CTA and a 1/10 chance of them hitting the submit button; then we can multiply our probabilities together to say that there is a 1/20 or 5% chance that they will hit both buttons.

## Theoretical Aside, Lets Write Some Code

To preface here; if you haven’t seen my medium post on statistical inference I detail the `rbinom` function: https://towardsdatascience.com/an-introduction-to-binomials-inference-56394956e1a4. To give you the quick version of the significance of this, you use this function to simulate randomly occurring binomial events.

## Lets Require Both Using…. And

We have 5000 draws of the event, with their corresponding probabilities. We will see 5000 draws of either 1 or 0. When we specify the `&`, we are effectively saying that both must be true. When we take an average of the occurrences of both being true gives us the probability that both would occur.

``````CTA <- rbinom(5000, 1, .5)
SUBMIT <- rbinom(5000, 1, .1)

mean(CTA & SUBMIT)``````

We can repeat this process with however many steps that we want. Above we can see that 5% of the time both actions occurred together.

## From And to Or

So lets go back to the dice example, lets say that rather than the likelihood that both would be 6’s, we want to calculate the probability that either of the two rolls would be a 6.

### Conceptual Approach

The way to think about this is to start with their independent likelihood. Either one has a 1/6 chance of being rolled. Let’s take those two 1/6 likelihoods and add them together. Nearly there… but there’s one issue… Implicit in that 1/6 + 1/6 is also the group of occurrences where both would occur– both being the keyword there. Those occurrences will have to come out!

This is where we bring together what we have learned so far.

We will add the probabilities together, but then we will subtract out the probabilities that indicate both. That being 1/6*1/6.

### Mathematical Approach

This gives us a formula of 1/6 + 1/6 – 1/6*1/6

### Programatic Approach

Let’s replicate this like we did earlier.

``````roll_1 <- rbinom(5000, 1, .17)
roll_2 <- rbinom(5000, 1, .17)

round(mean(roll_1 | roll_2),2)

.17 + .17 - .17*.17``````

Above you’ll see something very similar to what we had the first time– the only difference being the `OR` or `|` operator. We can see that when we generated many random occurrences we had a 31% occurrence rate in the randomly generated dataset. We were able to replicate that number by performing that simple equation we just came up with validating that the two approaches are consistent.

## From Or to Conditionals

Let’s say you are going to simulate the occurrence of five individuals coming to your site and either clicking on the CTA or not with a 50% likelihood of either outcome. Similar to what we did before, we are going to simulate the groups of 5 a full 50,000 times.

``````CTA <- rbinom(50000, 5, .5)
mean(CTA)``````

You can see the number of times the CTA was clicked in each of the trials or simulations. When we take the mean of the CTAs, we see an average of 2.5 across each trial, which represents 50% of the 5 in each trial.

Now that we’ve come this far, let’s say you want to understand the probability that at least 2 of the 5 users will click the CTA, then let’s do the same thing for 3 of the 5, then 4 of the 5.

``````mean(CTA >= 2)
mean(CTA >= 3)
mean(CTA >= 4)``````

As you plug in this conditional statement that evaluates to TRUE or FALSE. Our mean function treats TRUE as 1 and FALSE as 0 allowing us to take the average occurrences where the statement was true.

We can now leverage this idea with OR and AND as well.

## Conclusion

We visited a lot of ideas in a very short time, I hope this breakdown of probability foundations was helpful! Be sure to check out my blog at datasciencelessons.com to learn more!

As always, happy data science-ing!