Plyr Package

The plyr package for R helps summarize data quickly. For example, have you ever tried to calculate the means for a bunch of different groups in Excel; what a pain in the ass. We have already learned the tapply() function, but it limits you to only one summary stat (e.g., the means for each group of interest). The plyr package shines when you want to summarize several different stats.

1 Install and Load plyr

Use the install.package() to install the package. Then use either the library() or requires() functions to load the package. Alternatively, in RStudio click on the Packages tab in the lower-right window, and then the Install button on the left-hand side just below the tabs. Once the package is installed, just check the box next to the plyr package. Remember you need to only install the package on a computer once, but need to load the package each time you restart R. Look at the page on packages for more information.

  #If you have admin access
  install.packages("plyr", dependencies=T)
  require("plyr")

  #If you don't have admin access
  #And install the package to your u drive
  install.packages("plyr", lib="u:/", dependencies=T)
  require("plyr", lib.loc="u:/")

2 ddply

There many functions in this package, but I want to focus on the function ddply(). This function take a data.frame, summarizes it, and returns a data.frame to the user. There are other functions to summarize other data types, and you can even convert between data types. For example, the function laply() take a list, summarizes it, and returns an array.

The function ddply() requires several arguments. The first is the data.frame that you want to summarize. The second is the columns that you want to summarize by. There is a bunch of data that R already installed on your computer, and we are going to just look at the iris data in R to see how this function works. If you want to see all the data available in R, use the function data() without any arguments. Here I load the the cars data and look at the structure of it.

  data(iris)
  str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

So, let us first just ask what was the average and standard deviation of rbi per team.

sepal.length.species <- ddply(iris, .(Species), summarise, 
  mean.Sepal.Length = mean(Sepal.Length, na.rm = T),
  sd.Sepal.Length = sd(Sepal.Length, na.rm = T)
)
sepal.length.species
##      Species mean.Sepal.Length sd.Sepal.Length
## 1     setosa             5.006       0.3524897
## 2 versicolor             5.936       0.5161711
## 3  virginica             6.588       0.6358796

I make up the names mean.Sepal.Length and sd.Sepal.Length. If I wanted to caculate the variance, then I could add another argument of change one of the last two arguments. Here I will just add another.

  sepal.length.species2 <- ddply(iris, .(Species), summarise, 
    mean.Sepal.Length = mean(Sepal.Length, na.rm = T),
    sd.Sepal.Length = sd(Sepal.Length, na.rm = T),
    var.Sepal.Length = var(Sepal.Length, na.rm = T)
  )
sepal.length.species2
##      Species mean.Sepal.Length sd.Sepal.Length var.Sepal.Length
## 1     setosa             5.006       0.3524897        0.1242490
## 2 versicolor             5.936       0.5161711        0.2664327
## 3  virginica             6.588       0.6358796        0.4043429

If there were more categorical variables in this data.frame, then we could summarize by them also. To show you what I mean, let’s add another factor to the iris data.frame.

iris2 <- data.frame(iris, NewFactor = rep(c("Big", "Small"), length.out=iris))
## Warning in rep(c("Big", "Small"), length.out = iris): first element used of
## 'length.out' argument
head(iris2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species NewFactor
## 1          5.1         3.5          1.4         0.2  setosa       Big
## 2          4.9         3.0          1.4         0.2  setosa     Small
## 3          4.7         3.2          1.3         0.2  setosa       Big
## 4          4.6         3.1          1.5         0.2  setosa     Small
## 5          5.0         3.6          1.4         0.2  setosa       Big
## 6          5.4         3.9          1.7         0.4  setosa     Small

Now let’s summarize by species and the new factor we just created.

  sepal.length.species.newfac <- ddply(iris2, .(Species, NewFactor), summarise, 
    mean.Sepal.Length = mean(Sepal.Length, na.rm = T),
    sd.Sepal.Length = sd(Sepal.Length, na.rm = T)
  )
sepal.length.species.newfac
##      Species NewFactor mean.Sepal.Length sd.Sepal.Length
## 1     setosa       Big             5.024       0.3908111
## 2     setosa     Small             4.988       0.3166491
## 3 versicolor       Big             5.992       0.5559676
## 4 versicolor     Small             5.880       0.4778424
## 5  virginica       Big             6.504       0.6031031
## 6  virginica     Small             6.672       0.6686554