Today a student asked me whether there was function that would
calculate a bunch of descriptive stats for numerical data. I suggested
summary()
, but they countered with “but what about the
measures of spread?”. So, let’s start with how you might create this
function. I won’t create a function that will do it all for you, but I
will provide you with the tools so you can create your own (you are
welcome).
Use the function
function()
Like always, we can create our own custom function with
function()
. We can create the function with any number of
arguments, from zero to a lot, and default or user-required values for
the those arguments. Let’s start with creating a function that will
print something in the console.
desc_stats <- function() {
cat("Mean =")
}
#The function has no arguments right now
desc_stats()
## Mean =
We can use cat()
to print stuff out, without quotations
marks, in the console. Of course, we now want to add the mean of data,
because just printing “Mean =” is a pretty lame function. So, we need to
add an argument to our new function so we can get the data from the
user
desc_stats <- function(x) {
cat("Mean =", mean(x))
}
#We now need to provide the function with some data
desc_stats(1:10)
## Mean = 5.5
Hopefully you noticed that we added the argument x
to
our function. Of course, we could have named the argument anything, but
x
(as is y
) is a common letter to represent
data.
Let’s now add the median. We want to print the median on the next
line, so we use the “” to tell R to go to a new line.
desc_stats <- function(x) {
cat("Mean =", mean(x), "\n")
cat("Median =", median(x))
}
desc_stats(1:10)
## Mean = 5.5
## Median = 5.5
Maybe you want to include a couple metrics of spread. We just need to
add a few lines to calculate and print these metrics.
desc_stats <- function(x, center = TRUE, spread = TRUE) {
cat("Mean =", mean(x), "\n")
cat("Median =", median(x), "\n")
cat("Variance =", var(x), "\n")
cat("Standard deviation =", sd(x))
}
desc_stats(1:10)
## Mean = 5.5
## Median = 5.5
## Variance = 9.166667
## Standard deviation = 3.02765
Adding more
arguments
I am sure you can see how to add more descriptive stats to the
function. So, I will now show you how you can add options for the user.
We will add a new argument and slightly modify the printout. The new
argument allows the user to select whether they want metrics of
“location”, “spread”, or “both”. Notice that results =
“both”
provides a default value for this argument. So, if the
user only provides the data, then the function will print out measures
of location and spread by default. We use if()
to execute
the code base on what the user has selected. Notice that I use “ to tab
the output. I also added the number of observations in the data to the
bottom of the output.
desc_stats <- function(x, results = "both") {
if(results == "both" | results == "location") {
cat("Measures of location", "\n")
cat("\t", "Mean =", mean(x), "\n")
cat("\t", "Median =", median(x), "\n \n")
}
if(results == "both" | results == "spread") {
cat("Measures of spread", "\n")
cat("\t", "Variance =", var(x), "\n")
cat("\t", "Standard deviation =", sd(x), "\n \n")
}
cat("n =", length(x))
}
desc_stats(1:10)
## Measures of location
## Mean = 5.5
## Median = 5.5
##
## Measures of spread
## Variance = 9.166667
## Standard deviation = 3.02765
##
## n = 10
desc_stats(1:10, results = "both")
## Measures of location
## Mean = 5.5
## Median = 5.5
##
## Measures of spread
## Variance = 9.166667
## Standard deviation = 3.02765
##
## n = 10
desc_stats(1:10, results = "location")
## Measures of location
## Mean = 5.5
## Median = 5.5
##
## n = 10
desc_stats(1:10, results = "spread")
## Measures of spread
## Variance = 9.166667
## Standard deviation = 3.02765
##
## n = 10
This is looking pretty good. But… Let’s add a little more code to
provide the user with a snarky response if they don’t provide numeric
data for the argument x
, or give an invalid value for the
argument results
.
desc_stats <- function(x, results = "both") {
if(!is.numeric(x)) stop("The data in x must be numeric. How do you expect me to calculate the mean or variance of a non-numeric value?")
if(!(results %in% c("location", "spread", "both"))) stop("Options for the results are: 'location', 'spread', or 'both' -- and not whatever you put.")
if(results == "both" | results == "location") {
cat("Measures of location", "\n")
cat("\t", "Mean =", mean(x), "\n")
cat("\t", "Median =", median(x), "\n \n")
}
if(results == "both" | results == "spread") {
cat("Measures of spread", "\n")
cat("\t", "Variance =", var(x), "\n")
cat("\t", "Standard deviation =", sd(x), "\n \n")
}
cat("n =", length(x))
}
desc_stats("testing") #Give x a non-numeric object
## Error in desc_stats("testing"): The data in x must be numeric. How do you expect me to calculate the mean or variance of a non-numeric value?
desc_stats(1:10, results = "stats") #Give an invalid value for results
## Error in desc_stats(1:10, results = "stats"): Options for the results are: 'location', 'spread', or 'both' -- and not whatever you put.