Basic Probability

We will go over a few examples, to illustrate how you can use R to help you with probabilities.
In this lab, I want you to work from a script file (if you don’t know what I mean, then please ask!).

1 prob Package Info

In R, there is a package called prob that we can use to help us answer questions about probability. So, let’s install and load the prob package.

#to install prob package and dependencies remove the # from the next 3 lines of code
#install.packages("devtools", dependencies = TRUE)
#install.packages("combinat")
#devtools::install_github("mzinkgraf/prob")

require("prob")

Athough there are a bunch of functions in the package, we will focus on a few.

iidspace()
Creates a sample space with the associated probabilities.
subset()
Creates a new set from another set.
union()
Creates a new set that is the union of two sets.
intersect()
Creates a new set that is the intersection of two sets.
setdiff()
Creates a new set that includes everything in the first set but not the second set.
Prob()
Calculates the probability for a set or a conditional probability of one set given another set.

2 Crocodile Eggs

Recall from lecture the questions about the probability of crocodile eggs hatching. Below we will work through a few examples that are in the lecture.

To quickly create a sample space and associated probablities, we can use the function iidspace(). We can include the outcomes, the number of trials, and the probabilities of each outcome.

Let’s create the sample space for the situtation in which we have 3 eggs. Assume that each egg can only hatch or not hatch, and the probability of hatching is 0.8 and the probability of not hatching is 0.2. You are welcome, and encouraged, to modify this example.

egg.ss <- iidspace(c("Hatched", "Not Hatched"), ntrials = 3, probs = c(0.8, 0.2))
egg.ss
##            X1          X2          X3 probs
## 1     Hatched     Hatched     Hatched 0.512
## 2 Not Hatched     Hatched     Hatched 0.128
## 3     Hatched Not Hatched     Hatched 0.128
## 4 Not Hatched Not Hatched     Hatched 0.032
## 5     Hatched     Hatched Not Hatched 0.128
## 6 Not Hatched     Hatched Not Hatched 0.032
## 7     Hatched Not Hatched Not Hatched 0.032
## 8 Not Hatched Not Hatched Not Hatched 0.008

Wow, that was easy!!!! Of course you need to understand how to calculate each probability by hand. So, now calculate by hand each of these proabilities. If you don’t recall how, then look at your notes or ask me for help.

You can now subset the sample space (i.e., create new sets).

The first new set, which I will call f, is all the outcomes in which the first egg hatched. So, that is rows 1, 3, 5, 7. The second set, which I will call t, is all the outcomes in which two or more egg hatched. So, that is rows 1, 2, 3, 5. Let’s create those subsets now.

f <- egg.ss[c(1,3,5,7), ] #Alternatively you could use the function subset, see below.
f
##        X1          X2          X3 probs
## 1 Hatched     Hatched     Hatched 0.512
## 3 Hatched Not Hatched     Hatched 0.128
## 5 Hatched     Hatched Not Hatched 0.128
## 7 Hatched Not Hatched Not Hatched 0.032
t <- egg.ss[c(1,2,3,5), ] #Alternatively you could use the function subset, see below.

For the rest of this example, we will use the two sets we just created. However, below are some examples of selecting sets using the function subset(). In particular, we are going to use “or” and “and” statements to create new sets. In R, you indicate “or” with a vertical bar | and “and” with an ampersand &.

#Select all the cases in which the first egg hatched
#Same as the set f we created above
subset(egg.ss, X1 == "Hatched")
##        X1          X2          X3 probs
## 1 Hatched     Hatched     Hatched 0.512
## 3 Hatched Not Hatched     Hatched 0.128
## 5 Hatched     Hatched Not Hatched 0.128
## 7 Hatched Not Hatched Not Hatched 0.032
#Select all the cases in which two or more eggs hatched
#Same as the set t we created above
subset(egg.ss, 
            X1 == "Hatched" & X2 == "Hatched" | 
            X2 == "Hatched" & X3 == "Hatched" |
            X1 == "Hatched" & X3 == "Hatched")
##            X1          X2          X3 probs
## 1     Hatched     Hatched     Hatched 0.512
## 2 Not Hatched     Hatched     Hatched 0.128
## 3     Hatched Not Hatched     Hatched 0.128
## 5     Hatched     Hatched Not Hatched 0.128
#Select all the cases in which the second and third eggs did not hatch
subset(egg.ss,X2 == "Not Hatched" & X3 == "Not Hatched")
##            X1          X2          X3 probs
## 7     Hatched Not Hatched Not Hatched 0.032
## 8 Not Hatched Not Hatched Not Hatched 0.008
#Select all the cases in which at least one egg hached
subset(egg.ss, X1 == "Hatched" | X2 == "Hatched" | X3 == "Hatched")
##            X1          X2          X3 probs
## 1     Hatched     Hatched     Hatched 0.512
## 2 Not Hatched     Hatched     Hatched 0.128
## 3     Hatched Not Hatched     Hatched 0.128
## 4 Not Hatched Not Hatched     Hatched 0.032
## 5     Hatched     Hatched Not Hatched 0.128
## 6 Not Hatched     Hatched Not Hatched 0.032
## 7     Hatched Not Hatched Not Hatched 0.032

Here is a more advanced example to illustrate how you can create more flexible ways to create sets. You are welcome to ignore the select_set example, but you are doing well if you understand it. We first create a function with the arguments set (the set you want to selection from), selection_number (the number of successful trials that you want to select), selection_name (the name of a success in the set), and , n_trials (the number of trials in the set). I have included default values so the function will return 2 or more eggs that hatched. The function uses a for loop to repeat an operation, in this case determine whether there are two or more eggs that hatched in each row. It then returns all the rows from the set the meet the our criteria. Please ask if you have questions.

#Our homemade function to create a subset
select_set <- function(set, selection_number = 2, selection_name = "Hatched", n_trials = 3) {
  in.t <- c()
  for (i in 1:dim(set)[1]) {
    in.t[i] <- sum(set[i, 1:n_trials] == selection_name) >= selection_number
  }
  return(subset(set, in.t))
}

#Select all the cases in which two or more eggs hatched
#Same as the set t we created above
select_set(egg.ss)
##            X1          X2          X3 probs
## 1     Hatched     Hatched     Hatched 0.512
## 2 Not Hatched     Hatched     Hatched 0.128
## 3     Hatched Not Hatched     Hatched 0.128
## 5     Hatched     Hatched Not Hatched 0.128
#Select all the cases in which 1 or more eggs did not hatched
select_set(egg.ss, 1, "Not Hatched")
##            X1          X2          X3 probs
## 2 Not Hatched     Hatched     Hatched 0.128
## 3     Hatched Not Hatched     Hatched 0.128
## 4 Not Hatched Not Hatched     Hatched 0.032
## 5     Hatched     Hatched Not Hatched 0.128
## 6 Not Hatched     Hatched Not Hatched 0.032
## 7     Hatched Not Hatched Not Hatched 0.032
## 8 Not Hatched Not Hatched Not Hatched 0.008

We can now determine the union and intersection of two sets, f and t. Remember to think about the answer before you run each function. You should not ignore this.

u <- union(f, t)
u
##            X1          X2          X3 probs
## 1     Hatched     Hatched     Hatched 0.512
## 2 Not Hatched     Hatched     Hatched 0.128
## 3     Hatched Not Hatched     Hatched 0.128
## 5     Hatched     Hatched Not Hatched 0.128
## 7     Hatched Not Hatched Not Hatched 0.032
i <- intersect(f, t)
i
##        X1          X2          X3 probs
## 1 Hatched     Hatched     Hatched 0.512
## 3 Hatched Not Hatched     Hatched 0.128
## 5 Hatched     Hatched Not Hatched 0.128

There is also a function setdiff() which gives the outcomes in the first set but not the second. Notice that which set you put first matters.

setdiff(f, t)
##        X1          X2          X3 probs
## 7 Hatched Not Hatched Not Hatched 0.032
setdiff(t, f)
##            X1      X2      X3 probs
## 2 Not Hatched Hatched Hatched 0.128
not.t <- setdiff(egg.ss, t) #Gives t^c
not.f <- setdiff(egg.ss, f) #Gives f^c

You can use the function Prob() (notice the P is capitalized) to calculate the probabilities of these sets.

Prob(f)
## [1] 0.8
Prob(t)
## [1] 0.896
Prob(u)
## [1] 0.928
Prob(i)
## [1] 0.768
Prob(not.t)
## [1] 0.104
Prob(not.f)
## [1] 0.2

The function Prob() can also calculate conditional probabilities.

Prob(i)/Prob(f) #Formula for conditional probs
## [1] 0.96
Prob(t, given = f)
## [1] 0.96
Prob(f, given = t)
## [1] 0.8571429

3 Genetics Example

Now let’s do a question that Dr. Trent might ask you in Genetics. There is a population of rabbits with three alleles for a particular gene. Let’s call the alleles a, b, and c. In the population there are 560 copies of a, 105 copies of b, and 35 copies of c. 

  • From these frequencies, calculate the allelic proportions for each allele (you learned this in Biol 204).

Now assume Hardy-Weinberg equilibrum and answer the following questions.

  • What is the expected proportions of each genotype.
  • Assuming that a is dominant over the other two alleles, what is the probability that an individual randomly pulled from the population will have an “a” phenotype?

Try to graph the data with a bargraph or pie chart.

4 Ladybird Beetle Example

In this example, we are going to import data from a csv file. You can find the file, named “ladybirddata.csv”, on Canvas. Save it, and remember the location. Now import the data into R. This is a standard format for representing categorical data. Each row represents an observation.

#set your working directory
#use dir() to see files in the working directory

data <- read.csv("../inst/extdata/ladybirddata.csv")
str(data)
## 'data.frame':    49 obs. of  4 variables:
##  $ Species   : chr  "Chilocorus stigma" "Chilocorus stigma" "Chilocorus stigma" "Chilocorus stigma" ...
##  $ BG_Color  : chr  "Black" "Black" "Black" "Black" ...
##  $ Spot_Color: chr  "Red" "Red" "Red" "Red" ...
##  $ Spots     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

We can use the functon table() to quickly calculate the frequencies. Let’s calculate how many individuals had spots for each species.

table(data$Species, data$Spots)
##                            
##                             FALSE TRUE
##   Chilocorus stigma             0   13
##   Coccinella septempunctata     0   13
##   Harmonia axyridis             8   15

We can use the with() function if we don’t want to type the name of the data.frame multiple times. The first argument is a data.frame, and the second is the code that want to execute with that data.frame. Notice I only have to type Species and not data$Species. The following code produces the same result as above.

with(data, table(Species, Spots))
##                            Spots
## Species                     FALSE TRUE
##   Chilocorus stigma             0   13
##   Coccinella septempunctata     0   13
##   Harmonia axyridis             8   15

How would you determine how many individuals have a given background color for each species?

We can also ask R to give us more specific information. For example, we can use the code below to determine the number of individuals with a given color of spots and a given background for each species.

with(data, table(BG_Color, Spot_Color, Species))
## , , Species = Chilocorus stigma
## 
##          Spot_Color
## BG_Color  Black Red
##   Black       0  13
##   DarkRed     0   0
##   Orange      0   0
##   Red         0   0
## 
## , , Species = Coccinella septempunctata
## 
##          Spot_Color
## BG_Color  Black Red
##   Black       0   0
##   DarkRed     0   0
##   Orange      0   0
##   Red        13   0
## 
## , , Species = Harmonia axyridis
## 
##          Spot_Color
## BG_Color  Black Red
##   Black       0   0
##   DarkRed     1   0
##   Orange      9   0
##   Red         5   0

Change the order of the column names, and see what happens.

We can use the function prop.table() to calculate the proportions, which are also the probilities, from the frequencies.

prop.table(with(data, table(Species, Spots)))
##                            Spots
## Species                         FALSE      TRUE
##   Chilocorus stigma         0.0000000 0.2653061
##   Coccinella septempunctata 0.0000000 0.2653061
##   Harmonia axyridis         0.1632653 0.3061224

Practice calculating the frequencies and proportions for other combinations of characters, like background color and spot color. You should also practice making barplots and pie charts to represent these data.

So far, all the functions we have used in the Ladybird Beetle example are loaded when you start R. In other words, they are not in a special package. There is also another useful function empirical() in the prob package that will calculate the proportions for each of the combination of characters in the data.

empirical(data)
##                     Species BG_Color Spot_Color Spots      probs
## 1         Harmonia axyridis  DarkRed      Black  TRUE 0.02040816
## 2         Harmonia axyridis   Orange      Black  TRUE 0.18367347
## 3 Coccinella septempunctata      Red      Black  TRUE 0.26530612
## 4         Harmonia axyridis      Red      Black  TRUE 0.10204082
## 5         Chilocorus stigma    Black        Red  TRUE 0.26530612

5 Playing Around

Here are some examples of functions in the prob package that you might find helpful for practicing R and learning probability.

There are several functions that will quickly generate some common sample spaces.

tosscoin
Creates a sample space from flipping a coin.
rolldie
Creates a sample space from rolling dice.
cards
Creates a sample space representing a deck of cards.
roulette
Creates a sample space representing a roulette table.
urnsamples
Creates a sample space from pulling items from a jar.

Let’s create a few sample spaces now.

Create the sample space for flipping a coin 6 times. For all the examples below, I use the head function so I don’t print out the entire sample space (which are large in a few examples). These functions have an argument makespace that will also add the probabilities of each outcome.

coin.ss <- tosscoin(6)
head(coin.ss)
##   toss1 toss2 toss3 toss4 toss5 toss6
## 1     H     H     H     H     H     H
## 2     T     H     H     H     H     H
## 3     H     T     H     H     H     H
## 4     T     T     H     H     H     H
## 5     H     H     T     H     H     H
## 6     T     H     T     H     H     H
coin.ssp <- tosscoin(6, makespace = T)
head(coin.ssp)
##   toss1 toss2 toss3 toss4 toss5 toss6    probs
## 1     H     H     H     H     H     H 0.015625
## 2     T     H     H     H     H     H 0.015625
## 3     H     T     H     H     H     H 0.015625
## 4     T     T     H     H     H     H 0.015625
## 5     H     H     T     H     H     H 0.015625
## 6     T     H     T     H     H     H 0.015625

Create the sample space for rolling 2 dice. This sample space is used to model the game craps.

dice.ss <- rolldie(2)
head(dice.ss)
##   X1 X2
## 1  1  1
## 2  2  1
## 3  3  1
## 4  4  1
## 5  5  1
## 6  6  1
dice.ssp <- rolldie(2, makespace = T)
head(dice.ssp)
##   X1 X2      probs
## 1  1  1 0.02777778
## 2  2  1 0.02777778
## 3  3  1 0.02777778
## 4  4  1 0.02777778
## 5  5  1 0.02777778
## 6  6  1 0.02777778

Create the sample space for a deck of card that includes the jokers.

cardDeck <- cards(jokers = T)
head(cardDeck)
##   rank suit
## 1    2 Club
## 2    3 Club
## 3    4 Club
## 4    5 Club
## 5    6 Club
## 6    7 Club
cardDeckp <- cards(jokers = T, makespace = T)
head(cardDeckp)
##   rank suit      probs
## 1    2 Club 0.01851852
## 2    3 Club 0.01851852
## 3    4 Club 0.01851852
## 4    5 Club 0.01851852
## 5    6 Club 0.01851852
## 6    7 Club 0.01851852

Create the sample space for a roulette table.

roulTable <- roulette()
head(roulTable)
##   num color
## 1  27   Red
## 2  10 Black
## 3  25   Red
## 4  29 Black
## 5  12   Red
## 6   8 Black
roulTablep <- roulette(makespace = T)
head(roulTablep)
##   num color      probs
## 1  27   Red 0.02631579
## 2  10 Black 0.02631579
## 3  25   Red 0.02631579
## 4  29 Black 0.02631579
## 5  12   Red 0.02631579
## 6   8 Black 0.02631579

We can use the function urnsamples() to create additional sample spaces.

Create the sample space for pulling M&Ms from a jar.

mms <- urnsamples(c("red", "orange", "yellow", "brown", "light brown", "blue", "green"), 2, replace = T, order = T)
head(mms)
##            X1  X2
## 1         red red
## 2      orange red
## 3      yellow red
## 4       brown red
## 5 light brown red
## 6        blue red

You can also create the sample space in which the order does not matter.

mms2 <- urnsamples(c("red", "orange", "yellow", "brown", "light brown", "blue", "green"), 2, replace = T)
head(mms2)
##    X1          X2
## 1 red         red
## 2 red      orange
## 3 red      yellow
## 4 red       brown
## 5 red light brown
## 6 red        blue

You can create the hands in a poker game with 2 card hands. You might notice that there are 1,326 different hands with only two cards! A 5 card hand will take a long time because there are so many different possible hands.

pokerHand <- urnsamples(cards(), 2)
head(pokerHand)
## [[1]]
##   rank suit
## 1    2 Club
## 2    3 Club
## 
## [[2]]
##   rank suit
## 1    2 Club
## 3    4 Club
## 
## [[3]]
##   rank suit
## 1    2 Club
## 4    5 Club
## 
## [[4]]
##   rank suit
## 1    2 Club
## 5    6 Club
## 
## [[5]]
##   rank suit
## 1    2 Club
## 6    7 Club
## 
## [[6]]
##   rank suit
## 1    2 Club
## 7    8 Club

Now let’s learn why it is a bad idea to play roulette. Roulette tables pay 1 to 1 odds, which means that if you bet 5 dollars on red and red hits, then the dealer gives you your 5 dollars back plus another 5 dollars. The probability of winning with 1 to 1 odds is 0.5, or 50% of the time. So, if you won 50% of the time, then this game would be a fair one. Pretty sweet, right? No, and let’s see why.

We created our sample space of the roulette table above. So, let’s create the set of hitting red.

red <- subset(roulTablep, color == "Red")
Prob(red) #Prob of winning
## [1] 0.4736842
1 - Prob(red) #Prob of losing
## [1] 0.5263158

Notice that the probablity of winning is actually less than 0.5, which means that on average you will lose more than you win–yet they pay you as if your chances were 50%.

Now let’s become a serious gambler. We are going to spend all day at the roulette table. Fortunatley, we can simulate this without spending a dime, which is good thing because we are likely to lose! Let’s play the game 2,000 times. You will get a different set of results than what is below because each spin of the table is random.

playingRoulette <- sim(roulTablep, ntrials = 2000)
head(playingRoulette)
##   num color
## 1  26 Black
## 2  18   Red
## 3  00 Green
## 4  32   Red
## 5   3   Red
## 6  15 Black

Now let’s figure out how we did. Let’s assume we played 5 dollars on red each time.

wagered <- 5*2000
#nrow() returns the number of rows
earnings <- 10 * nrow(subset(playingRoulette, color == "Red")) #10 because $5 for bet and $5 for winnings
(winnings <- earnings - wagered)
## [1] -760

How did we do? And that is why you should not play roulette.