We will go over a few examples, to illustrate how you can use R to
help you with probabilities.
In this lab, I want you to work from a script file (if you don’t know
what I mean, then please ask!).
In R, there is a package called prob that we can use to help us answer questions about probability. So, let’s install and load the prob package.
You can find information about installing and loading packages in the Intro to R: Packages link.
#to install prob package and dependencies remove the # from the next 3 lines of code
#install.packages("devtools", dependencies = TRUE)
#install.packages("combinat")
#devtools::install_github("mzinkgraf/prob")
require("prob")
Athough there are a bunch of functions in the package, we will focus on a few.
You can find more information about the prob package at the following link web page.
Recall from lecture the questions about the probability of crocodile eggs hatching. Below we will work through a few examples that are in the lecture.
To quickly create a sample space and associated probablities, we can use the function iidspace(). We can include the outcomes, the number of trials, and the probabilities of each outcome.
Let’s create the sample space for the situtation in which we have 3 eggs. Assume that each egg can only hatch or not hatch, and the probability of hatching is 0.8 and the probability of not hatching is 0.2. You are welcome, and encouraged, to modify this example.
egg.ss <- iidspace(c("Hatched", "Not Hatched"), ntrials = 3, probs = c(0.8, 0.2))
egg.ss
## X1 X2 X3 probs
## 1 Hatched Hatched Hatched 0.512
## 2 Not Hatched Hatched Hatched 0.128
## 3 Hatched Not Hatched Hatched 0.128
## 4 Not Hatched Not Hatched Hatched 0.032
## 5 Hatched Hatched Not Hatched 0.128
## 6 Not Hatched Hatched Not Hatched 0.032
## 7 Hatched Not Hatched Not Hatched 0.032
## 8 Not Hatched Not Hatched Not Hatched 0.008
Wow, that was easy!!!! Of course you need to understand how to calculate each probability by hand. So, now calculate by hand each of these proabilities. If you don’t recall how, then look at your notes or ask me for help.
You can now subset the sample space (i.e., create new sets).
The first new set, which I will call f, is all the outcomes in which the first egg hatched. So, that is rows 1, 3, 5, 7. The second set, which I will call t, is all the outcomes in which two or more egg hatched. So, that is rows 1, 2, 3, 5. Let’s create those subsets now.
f <- egg.ss[c(1,3,5,7), ] #Alternatively you could use the function subset, see below.
f
## X1 X2 X3 probs
## 1 Hatched Hatched Hatched 0.512
## 3 Hatched Not Hatched Hatched 0.128
## 5 Hatched Hatched Not Hatched 0.128
## 7 Hatched Not Hatched Not Hatched 0.032
t <- egg.ss[c(1,2,3,5), ] #Alternatively you could use the function subset, see below.
For the rest of this example, we will use the two sets we just created. However, below are some examples of selecting sets using the function subset(). In particular, we are going to use “or” and “and” statements to create new sets. In R, you indicate “or” with a vertical bar | and “and” with an ampersand &.
#Select all the cases in which the first egg hatched
#Same as the set f we created above
subset(egg.ss, X1 == "Hatched")
## X1 X2 X3 probs
## 1 Hatched Hatched Hatched 0.512
## 3 Hatched Not Hatched Hatched 0.128
## 5 Hatched Hatched Not Hatched 0.128
## 7 Hatched Not Hatched Not Hatched 0.032
#Select all the cases in which two or more eggs hatched
#Same as the set t we created above
subset(egg.ss,
X1 == "Hatched" & X2 == "Hatched" |
X2 == "Hatched" & X3 == "Hatched" |
X1 == "Hatched" & X3 == "Hatched")
## X1 X2 X3 probs
## 1 Hatched Hatched Hatched 0.512
## 2 Not Hatched Hatched Hatched 0.128
## 3 Hatched Not Hatched Hatched 0.128
## 5 Hatched Hatched Not Hatched 0.128
#Select all the cases in which the second and third eggs did not hatch
subset(egg.ss,X2 == "Not Hatched" & X3 == "Not Hatched")
## X1 X2 X3 probs
## 7 Hatched Not Hatched Not Hatched 0.032
## 8 Not Hatched Not Hatched Not Hatched 0.008
#Select all the cases in which at least one egg hached
subset(egg.ss, X1 == "Hatched" | X2 == "Hatched" | X3 == "Hatched")
## X1 X2 X3 probs
## 1 Hatched Hatched Hatched 0.512
## 2 Not Hatched Hatched Hatched 0.128
## 3 Hatched Not Hatched Hatched 0.128
## 4 Not Hatched Not Hatched Hatched 0.032
## 5 Hatched Hatched Not Hatched 0.128
## 6 Not Hatched Hatched Not Hatched 0.032
## 7 Hatched Not Hatched Not Hatched 0.032
Here is a more advanced example to illustrate how you can create more flexible ways to create sets. You are welcome to ignore the select_set example, but you are doing well if you understand it. We first create a function with the arguments set (the set you want to selection from), selection_number (the number of successful trials that you want to select), selection_name (the name of a success in the set), and , n_trials (the number of trials in the set). I have included default values so the function will return 2 or more eggs that hatched. The function uses a for loop to repeat an operation, in this case determine whether there are two or more eggs that hatched in each row. It then returns all the rows from the set the meet the our criteria. Please ask if you have questions.
#Our homemade function to create a subset
select_set <- function(set, selection_number = 2, selection_name = "Hatched", n_trials = 3) {
in.t <- c()
for (i in 1:dim(set)[1]) {
in.t[i] <- sum(set[i, 1:n_trials] == selection_name) >= selection_number
}
return(subset(set, in.t))
}
#Select all the cases in which two or more eggs hatched
#Same as the set t we created above
select_set(egg.ss)
## X1 X2 X3 probs
## 1 Hatched Hatched Hatched 0.512
## 2 Not Hatched Hatched Hatched 0.128
## 3 Hatched Not Hatched Hatched 0.128
## 5 Hatched Hatched Not Hatched 0.128
#Select all the cases in which 1 or more eggs did not hatched
select_set(egg.ss, 1, "Not Hatched")
## X1 X2 X3 probs
## 2 Not Hatched Hatched Hatched 0.128
## 3 Hatched Not Hatched Hatched 0.128
## 4 Not Hatched Not Hatched Hatched 0.032
## 5 Hatched Hatched Not Hatched 0.128
## 6 Not Hatched Hatched Not Hatched 0.032
## 7 Hatched Not Hatched Not Hatched 0.032
## 8 Not Hatched Not Hatched Not Hatched 0.008
We can now determine the union and intersection of two sets, f and t. Remember to think about the answer before you run each function. You should not ignore this.
u <- union(f, t)
u
## X1 X2 X3 probs
## 1 Hatched Hatched Hatched 0.512
## 2 Not Hatched Hatched Hatched 0.128
## 3 Hatched Not Hatched Hatched 0.128
## 5 Hatched Hatched Not Hatched 0.128
## 7 Hatched Not Hatched Not Hatched 0.032
i <- intersect(f, t)
i
## X1 X2 X3 probs
## 1 Hatched Hatched Hatched 0.512
## 3 Hatched Not Hatched Hatched 0.128
## 5 Hatched Hatched Not Hatched 0.128
When you load the prob package, it loads several functions that mask existing functions in R. Two of these are union() and intersect(). Beware that the union() and intersect() functions that load when you start R are not the same as the union() and intersect() that load when you load the prob package.
There is also a function setdiff() which gives the outcomes in the first set but not the second. Notice that which set you put first matters.
setdiff(f, t)
## X1 X2 X3 probs
## 7 Hatched Not Hatched Not Hatched 0.032
setdiff(t, f)
## X1 X2 X3 probs
## 2 Not Hatched Hatched Hatched 0.128
not.t <- setdiff(egg.ss, t) #Gives t^c
not.f <- setdiff(egg.ss, f) #Gives f^c
You can use the function Prob() (notice the P is capitalized) to calculate the probabilities of these sets.
Prob(f)
## [1] 0.8
Prob(t)
## [1] 0.896
Prob(u)
## [1] 0.928
Prob(i)
## [1] 0.768
Prob(not.t)
## [1] 0.104
Prob(not.f)
## [1] 0.2
The function Prob() can also calculate conditional probabilities.
Prob(i)/Prob(f) #Formula for conditional probs
## [1] 0.96
Prob(t, given = f)
## [1] 0.96
Prob(f, given = t)
## [1] 0.8571429
Now let’s do a question that Dr. Trent might ask you in Genetics. There is a population of rabbits with three alleles for a particular gene. Let’s call the alleles a, b, and c. In the population there are 560 copies of a, 105 copies of b, and 35 copies of c.
Now assume Hardy-Weinberg equilibrum and answer the following questions.
Try to graph the data with a bargraph or pie chart.
In this example, we are going to import data from a csv file. You can find the file, named “ladybirddata.csv”, on Canvas. Save it, and remember the location. Now import the data into R. This is a standard format for representing categorical data. Each row represents an observation.
#set your working directory
#use dir() to see files in the working directory
data <- read.csv("../inst/extdata/ladybirddata.csv")
str(data)
## 'data.frame': 49 obs. of 4 variables:
## $ Species : chr "Chilocorus stigma" "Chilocorus stigma" "Chilocorus stigma" "Chilocorus stigma" ...
## $ BG_Color : chr "Black" "Black" "Black" "Black" ...
## $ Spot_Color: chr "Red" "Red" "Red" "Red" ...
## $ Spots : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
We can use the functon table() to quickly calculate the frequencies. Let’s calculate how many individuals had spots for each species.
table(data$Species, data$Spots)
##
## FALSE TRUE
## Chilocorus stigma 0 13
## Coccinella septempunctata 0 13
## Harmonia axyridis 8 15
We can use the with() function if we don’t want to type the name of the data.frame multiple times. The first argument is a data.frame, and the second is the code that want to execute with that data.frame. Notice I only have to type Species and not data$Species. The following code produces the same result as above.
with(data, table(Species, Spots))
## Spots
## Species FALSE TRUE
## Chilocorus stigma 0 13
## Coccinella septempunctata 0 13
## Harmonia axyridis 8 15
How would you determine how many individuals have a given background color for each species?
We can also ask R to give us more specific information. For example, we can use the code below to determine the number of individuals with a given color of spots and a given background for each species.
with(data, table(BG_Color, Spot_Color, Species))
## , , Species = Chilocorus stigma
##
## Spot_Color
## BG_Color Black Red
## Black 0 13
## DarkRed 0 0
## Orange 0 0
## Red 0 0
##
## , , Species = Coccinella septempunctata
##
## Spot_Color
## BG_Color Black Red
## Black 0 0
## DarkRed 0 0
## Orange 0 0
## Red 13 0
##
## , , Species = Harmonia axyridis
##
## Spot_Color
## BG_Color Black Red
## Black 0 0
## DarkRed 1 0
## Orange 9 0
## Red 5 0
Change the order of the column names, and see what happens.
We can use the function prop.table() to calculate the proportions, which are also the probilities, from the frequencies.
prop.table(with(data, table(Species, Spots)))
## Spots
## Species FALSE TRUE
## Chilocorus stigma 0.0000000 0.2653061
## Coccinella septempunctata 0.0000000 0.2653061
## Harmonia axyridis 0.1632653 0.3061224
Practice calculating the frequencies and proportions for other combinations of characters, like background color and spot color. You should also practice making barplots and pie charts to represent these data.
So far, all the functions we have used in the Ladybird Beetle example are loaded when you start R. In other words, they are not in a special package. There is also another useful function empirical() in the prob package that will calculate the proportions for each of the combination of characters in the data.
empirical(data)
## Species BG_Color Spot_Color Spots probs
## 1 Harmonia axyridis DarkRed Black TRUE 0.02040816
## 2 Harmonia axyridis Orange Black TRUE 0.18367347
## 3 Coccinella septempunctata Red Black TRUE 0.26530612
## 4 Harmonia axyridis Red Black TRUE 0.10204082
## 5 Chilocorus stigma Black Red TRUE 0.26530612
Here are some examples of functions in the prob package that you might find helpful for practicing R and learning probability.
There are several functions that will quickly generate some common sample spaces.
Let’s create a few sample spaces now.
Create the sample space for flipping a coin 6 times. For all the examples below, I use the head function so I don’t print out the entire sample space (which are large in a few examples). These functions have an argument makespace that will also add the probabilities of each outcome.
coin.ss <- tosscoin(6)
head(coin.ss)
## toss1 toss2 toss3 toss4 toss5 toss6
## 1 H H H H H H
## 2 T H H H H H
## 3 H T H H H H
## 4 T T H H H H
## 5 H H T H H H
## 6 T H T H H H
coin.ssp <- tosscoin(6, makespace = T)
head(coin.ssp)
## toss1 toss2 toss3 toss4 toss5 toss6 probs
## 1 H H H H H H 0.015625
## 2 T H H H H H 0.015625
## 3 H T H H H H 0.015625
## 4 T T H H H H 0.015625
## 5 H H T H H H 0.015625
## 6 T H T H H H 0.015625
Create the sample space for rolling 2 dice. This sample space is used to model the game craps.
dice.ss <- rolldie(2)
head(dice.ss)
## X1 X2
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
dice.ssp <- rolldie(2, makespace = T)
head(dice.ssp)
## X1 X2 probs
## 1 1 1 0.02777778
## 2 2 1 0.02777778
## 3 3 1 0.02777778
## 4 4 1 0.02777778
## 5 5 1 0.02777778
## 6 6 1 0.02777778
Create the sample space for a deck of card that includes the jokers.
cardDeck <- cards(jokers = T)
head(cardDeck)
## rank suit
## 1 2 Club
## 2 3 Club
## 3 4 Club
## 4 5 Club
## 5 6 Club
## 6 7 Club
cardDeckp <- cards(jokers = T, makespace = T)
head(cardDeckp)
## rank suit probs
## 1 2 Club 0.01851852
## 2 3 Club 0.01851852
## 3 4 Club 0.01851852
## 4 5 Club 0.01851852
## 5 6 Club 0.01851852
## 6 7 Club 0.01851852
Create the sample space for a roulette table.
roulTable <- roulette()
head(roulTable)
## num color
## 1 27 Red
## 2 10 Black
## 3 25 Red
## 4 29 Black
## 5 12 Red
## 6 8 Black
roulTablep <- roulette(makespace = T)
head(roulTablep)
## num color probs
## 1 27 Red 0.02631579
## 2 10 Black 0.02631579
## 3 25 Red 0.02631579
## 4 29 Black 0.02631579
## 5 12 Red 0.02631579
## 6 8 Black 0.02631579
We can use the function urnsamples() to create additional sample spaces.
Create the sample space for pulling M&Ms from a jar.
mms <- urnsamples(c("red", "orange", "yellow", "brown", "light brown", "blue", "green"), 2, replace = T, order = T)
head(mms)
## X1 X2
## 1 red red
## 2 orange red
## 3 yellow red
## 4 brown red
## 5 light brown red
## 6 blue red
You can also create the sample space in which the order does not matter.
mms2 <- urnsamples(c("red", "orange", "yellow", "brown", "light brown", "blue", "green"), 2, replace = T)
head(mms2)
## X1 X2
## 1 red red
## 2 red orange
## 3 red yellow
## 4 red brown
## 5 red light brown
## 6 red blue
You can create the hands in a poker game with 2 card hands. You might notice that there are 1,326 different hands with only two cards! A 5 card hand will take a long time because there are so many different possible hands.
pokerHand <- urnsamples(cards(), 2)
head(pokerHand)
## [[1]]
## rank suit
## 1 2 Club
## 2 3 Club
##
## [[2]]
## rank suit
## 1 2 Club
## 3 4 Club
##
## [[3]]
## rank suit
## 1 2 Club
## 4 5 Club
##
## [[4]]
## rank suit
## 1 2 Club
## 5 6 Club
##
## [[5]]
## rank suit
## 1 2 Club
## 6 7 Club
##
## [[6]]
## rank suit
## 1 2 Club
## 7 8 Club
Now let’s learn why it is a bad idea to play roulette. Roulette tables pay 1 to 1 odds, which means that if you bet 5 dollars on red and red hits, then the dealer gives you your 5 dollars back plus another 5 dollars. The probability of winning with 1 to 1 odds is 0.5, or 50% of the time. So, if you won 50% of the time, then this game would be a fair one. Pretty sweet, right? No, and let’s see why.
We created our sample space of the roulette table above. So, let’s create the set of hitting red.
red <- subset(roulTablep, color == "Red")
Prob(red) #Prob of winning
## [1] 0.4736842
1 - Prob(red) #Prob of losing
## [1] 0.5263158
Notice that the probablity of winning is actually less than 0.5, which means that on average you will lose more than you win–yet they pay you as if your chances were 50%.
Now let’s become a serious gambler. We are going to spend all day at the roulette table. Fortunatley, we can simulate this without spending a dime, which is good thing because we are likely to lose! Let’s play the game 2,000 times. You will get a different set of results than what is below because each spin of the table is random.
playingRoulette <- sim(roulTablep, ntrials = 2000)
head(playingRoulette)
## num color
## 1 26 Black
## 2 18 Red
## 3 00 Green
## 4 32 Red
## 5 3 Red
## 6 15 Black
Now let’s figure out how we did. Let’s assume we played 5 dollars on red each time.
wagered <- 5*2000
#nrow() returns the number of rows
earnings <- 10 * nrow(subset(playingRoulette, color == "Red")) #10 because $5 for bet and $5 for winnings
(winnings <- earnings - wagered)
## [1] -760
How did we do? And that is why you should not play roulette.