Exercises

Begin this practical exercise by setting the maximum line length in R-Studio to 80 characters. Go to RStudio’s Preferences (or Global Options under Tools) –> Code –> Display, and tick the show margin box. Make sure that the margin column is set to 80

Exercise 1-5

Install package mice.

Go to Tools > Install Packages in RStudio. If you are connected to the internet, select Repository under Install From and type mice under Packages. Leave the Install to Library at default and make sure that Install Dependencies is selected. Click install. If you are not connected to the internet, select Package Archive File under “Install from” and navigate to the respective file on your drive.

Some packages depend on other packages, meaning that their functionality may be limited if their dependencies are not installed. Installing dependencies is therefor recommended, but internet connectivity is required.

If all is right, you will receive a message in the console that the package has been installed (as well as its dependencies.

ALternatively, if you know the name of the package you would like to install - in this case mice - you can also call install.packages("mice") in the console window.

Load package mice. Loading packages can be done through functions library() and require().

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

If you use require() within a function, and the required package is not available, require() will yield a warning and the remainder of the function is still executed, whereas library() will yield an error and terminate all executions. The use of library() when not doing too complicated things is preferred - require() would result in more computational overhead because it calls library() itself.

Most packages have datasets included. Open the mammalsleep dataset from package mice by typing mammalsleep in the console, and subsequently by using the function View().

Using View() is preferred for inspecting datasets that are large. View() opens the dataset in a spreadsheet-like window (conform MS Excel, or SPSS). If you View() your own datasets, you can even edit the datasets’ contents.

Write the mammalsleep dataset from package mice to the work directory as a tab-delimited text file with . as a decimal seperator. Name the file mammalsleep.txt

library(mice)
write.table(mammalsleep, "mammalsleep.txt", sep = "\t", dec = ".", row.names = FALSE)

The command sep = "\t" indicates that the file is tabulated and the command dec = "." indicates that a point is used as the decimal seperator (instead of a comma). row.names = FALSE tells R that row names are not to be included in exported file.

Import the mammalsleep.txt file.

sleepdata <- read.table("mammalsleep.txt", sep = "\t", dec = ".", header = TRUE, stringsAsFactors = TRUE)

The command sep = "\t" indicates that the file is tabulated and the command dec = "." indicates that a point is used as the decimal seperator (instead of a comma). header = TRUE tells R that variable names are included in the header.

All files that are presented in the work directory of the current R project, can essentially be imported into the workspace (the space that contains all environments) directly. All other locations require you to specify the specific path from the root of your machine. To find out what the current work directory is, you can type getwd() and to change the work directory you can use setwd(). The beauty of using projects in RStudio is that you would never have to change the work directory, as the work directory is automatically set, relative to your projects’ R-scripts.

There are many packages that facilitate importing datasets from other statistical software packages, such as SPSS (e.g. function read_spss from package haven), Mplus (package MplusAutomation), Stata (read.dta() in foreign), SAS (sasxport.get() from package Hmisc) and from spreadsheet software, such as MS Excel (function read.xlsx() from package xlsx). For a short guideline to import multiple formats into R, see e.g. http://www.statmethods.net/input/importingdata.html.

Exercise 6-10

The dataset we’ve just imported contains the sleepdata by Allison & Cicchetti (1976). Inspect the sleepdata and make yourself familiar with it.

If you would like to know more about this dataset, you can open the help for the mammalsleep dataset in package mice through ?mammalsleep. Don’t forget to load package mice first.

Inspecting the sleepdata could be done by

str(sleepdata) #the data structure

## 'data.frame':    62 obs. of  11 variables:
##  $ species: Factor w/ 62 levels "African elephant",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ bw     : num  6654 1 3.38 0.92 2547 ...
##  $ brw    : num  5712 6.6 44.5 5.7 4603 ...
##  $ sws    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
##  $ ps     : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
##  $ ts     : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
##  $ mls    : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
##  $ gt     : num  645 42 60 25 624 180 35 392 63 230 ...
##  $ pi     : int  3 3 1 5 3 4 1 4 1 1 ...
##  $ sei    : int  5 1 1 2 5 4 1 5 2 1 ...
##  $ odi    : int  3 3 1 3 4 4 1 4 1 1 ...

summary(sleepdata) #distributional summaries

##                       species         bw                brw         
##  African elephant         : 1   Min.   :   0.005   Min.   :   0.14  
##  African giant pouched rat: 1   1st Qu.:   0.600   1st Qu.:   4.25  
##  Arctic Fox               : 1   Median :   3.342   Median :  17.25  
##  Arctic ground squirrel   : 1   Mean   : 198.790   Mean   : 283.13  
##  Asian elephant           : 1   3rd Qu.:  48.202   3rd Qu.: 166.00  
##  Baboon                   : 1   Max.   :6654.000   Max.   :5712.00  
##  (Other)                  :56                                       
##       sws               ps              ts             mls         
##  Min.   : 2.100   Min.   :0.000   Min.   : 2.60   Min.   :  2.000  
##  1st Qu.: 6.250   1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625  
##  Median : 8.350   Median :1.800   Median :10.45   Median : 15.100  
##  Mean   : 8.673   Mean   :1.972   Mean   :10.53   Mean   : 19.878  
##  3rd Qu.:11.000   3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750  
##  Max.   :17.900   Max.   :6.600   Max.   :19.90   Max.   :100.000  
##  NA's   :14       NA's   :12      NA's   :4       NA's   :4        
##        gt               pi             sei             odi       
##  Min.   : 12.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 35.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 79.00   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :142.35   Mean   :2.871   Mean   :2.419   Mean   :2.613  
##  3rd Qu.:207.50   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :645.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :4

round(cor(sleepdata[, -1], use = "pairwise.complete.obs"), 2) #bivariate correlations, variable 1 excluded.

##        bw   brw   sws    ps    ts   mls    gt    pi   sei   odi
## bw   1.00  0.93 -0.38 -0.11 -0.31  0.30  0.65  0.06  0.34  0.13
## brw  0.93  1.00 -0.37 -0.11 -0.36  0.51  0.75  0.03  0.37  0.15
## sws -0.38 -0.37  1.00  0.51  0.96 -0.38 -0.59 -0.32 -0.54 -0.48
## ps  -0.11 -0.11  0.51  1.00  0.73 -0.30 -0.45 -0.45 -0.54 -0.58
## ts  -0.31 -0.36  0.96  0.73  1.00 -0.41 -0.63 -0.40 -0.64 -0.59
## mls  0.30  0.51 -0.38 -0.30 -0.41  1.00  0.61 -0.10  0.36  0.06
## gt   0.65  0.75 -0.59 -0.45 -0.63  0.61  1.00  0.20  0.64  0.38
## pi   0.06  0.03 -0.32 -0.45 -0.40 -0.10  0.20  1.00  0.62  0.92
## sei  0.34  0.37 -0.54 -0.54 -0.64  0.36  0.64  0.62  1.00  0.79
## odi  0.13  0.15 -0.48 -0.58 -0.59  0.06  0.38  0.92  0.79  1.00

head(mammalsleep) #first six rows

##                     species       bw    brw sws  ps   ts  mls  gt pi sei odi
## 1          African elephant 6654.000 5712.0  NA  NA  3.3 38.6 645  3   5   3
## 2 African giant pouched rat    1.000    6.6 6.3 2.0  8.3  4.5  42  3   1   3
## 3                Arctic Fox    3.385   44.5  NA  NA 12.5 14.0  60  1   1   1
## 4    Arctic ground squirrel    0.920    5.7  NA  NA 16.5   NA  25  5   2   3
## 5            Asian elephant 2547.000 4603.0 2.1 1.8  3.9 69.0 624  3   5   4
## 6                    Baboon   10.550  179.5 9.1 0.7  9.8 27.0 180  4   4   4

tail(mammalsleep) #last six rows

##                  species    bw  brw  sws  ps   ts  mls  gt pi sei odi
## 57                Tenrec 0.900  2.6 11.0 2.3 13.3  4.5  60  2   1   2
## 58            Tree hyrax 2.000 12.3  4.9 0.5  5.4  7.5 200  3   1   3
## 59            Tree shrew 0.104  2.5 13.2 2.6 15.8  2.3  46  3   2   2
## 60                Vervet 4.190 58.0  9.7 0.6 10.3 24.0 210  4   3   4
## 61         Water opossum 3.500  3.9 12.8 6.6 19.4  3.0  14  2   1   1
## 62 Yellow-bellied marmot 4.050 17.0   NA  NA   NA 13.0  38  3   1   1

?mammalsleep # the help

Note that the sleepdata dataset is automatically recognized as a dataframe. After all, there is one factor (categorical variable) containing the animal names.

The functions head() and tail() are very useful functions. As is function str as it gives you a quick overview of the measurement levels in mammalsleep.

Since mammalsleep is an R-dataset, there should be a help file. Taking a look at ?mammalsleep may yield valuable insight about the measurements and origin of the variables.

One thing that may have caught your attention is the relation between ts, ps and sws. This is a deterministic relation where total sleep (ts) is the sum of paradoxical sleep (ps) and short-wave sleep (sws). In the event that you would model the data, you need to take such relations into account.

Save the current workspace. Name the workspace Practical_C.RData. Also, save the sleepdata file as a separate workspace called Sleepdata.RData.

Now that we have imported our data, it may be wise to save the current workspace, i.e. the current state of affairs. Saving the workspace will leave everything as is, so that we can continue from this exact state at a later time, by simply opening the workspace file. To save everything in the current workspace, type

save.image("Practical_C.RData")

To save just the dataset sleepdata, and nothing else, type

save(sleepdata, file = "Sleepdata.RData")

With the save functions, any object in the workspace can be saved.

Some animals were not used in the calculations by Allison and Cicchetti. Exclude the following animals from the sleepdata dataset: Echidna, Lesser short-tailed shrew and Musk shrew. Save the dataset as sleepdata2. Tip: use the square brackets to indicate [rows, columns].

There are three ways to exclude the three animals from the dataset. The first approach uses the names:

exclude <- c("Echidna", "Lesser short-tailed shrew", "Musk shrew")
which <- sleepdata$species %in% exclude #Indicate the species that match the names in exclude
which

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [37] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE

sleepdata2 <- sleepdata[!which, ]

the second approach uses function filter() from package dplyr:

library(dplyr) # Data Manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

filter(sleepdata, !sleepdata$species %in% exclude) # ! makes all TRUES into FALSE

##                      species       bw     brw  sws  ps   ts   mls  gt pi sei
## 1           African elephant 6654.000 5712.00   NA  NA  3.3  38.6 645  3   5
## 2  African giant pouched rat    1.000    6.60  6.3 2.0  8.3   4.5  42  3   1
## 3                 Arctic Fox    3.385   44.50   NA  NA 12.5  14.0  60  1   1
## 4     Arctic ground squirrel    0.920    5.70   NA  NA 16.5    NA  25  5   2
## 5             Asian elephant 2547.000 4603.00  2.1 1.8  3.9  69.0 624  3   5
## 6                     Baboon   10.550  179.50  9.1 0.7  9.8  27.0 180  4   4
## 7              Big brown bat    0.023    0.30 15.8 3.9 19.7  19.0  35  1   1
## 8            Brazilian tapir  160.000  169.00  5.2 1.0  6.2  30.4 392  4   5
## 9                        Cat    3.300   25.60 10.9 3.6 14.5  28.0  63  1   2
## 10                Chimpanzee   52.160  440.00  8.3 1.4  9.7  50.0 230  1   1
## 11                Chinchilla    0.425    6.40 11.0 1.5 12.5   7.0 112  5   4
## 12                       Cow  465.000  423.00  3.2 0.7  3.9  30.0 281  5   5
## 13           Desert hedgehog    0.550    2.40  7.6 2.7 10.3    NA  NA  2   1
## 14                    Donkey  187.100  419.00   NA  NA  3.1  40.0 365  5   5
## 15     Eastern American mole    0.075    1.20  6.3 2.1  8.4   3.5  42  1   1
## 16         European hedgehog    0.785    3.50  6.6 4.1 10.7   6.0  42  2   2
## 17                    Galago    0.200    5.00  9.5 1.2 10.7  10.4 120  2   2
## 18                     Genet    1.410   17.50  4.8 1.3  6.1  34.0  NA  1   2
## 19           Giant armadillo   60.000   81.00 12.0 6.1 18.1   7.0  NA  1   1
## 20                   Giraffe  529.000  680.00   NA 0.3   NA  28.0 400  5   5
## 21                      Goat   27.660  115.00  3.3 0.5  3.8  20.0 148  5   5
## 22            Golden hamster    0.120    1.00 11.0 3.4 14.4   3.9  16  3   1
## 23                   Gorilla  207.000  406.00   NA  NA 12.0  39.3 252  1   4
## 24                 Gray seal   85.000  325.00  4.7 1.5  6.2  41.0 310  1   3
## 25                 Gray wolf   36.330  119.50   NA  NA 13.0  16.2  63  1   1
## 26           Ground squirrel    0.101    4.00 10.4 3.4 13.8   9.0  28  5   1
## 27                Guinea pig    1.040    5.50  7.4 0.8  8.2   7.6  68  5   3
## 28                     Horse  521.000  655.00  2.1 0.8  2.9  46.0 336  5   5
## 29                    Jaguar  100.000  157.00   NA  NA 10.8  22.4 100  1   1
## 30                  Kangaroo   35.000   56.00   NA  NA   NA  16.3  33  3   5
## 31          Little brown bat    0.010    0.25 17.9 2.0 19.9  24.0  50  1   1
## 32                       Man   62.000 1320.00  6.1 1.9  8.0 100.0 267  1   1
## 33                  Mole rat    0.122    3.00  8.2 2.4 10.6    NA  30  2   1
## 34           Mountain beaver    1.350    8.10  8.4 2.8 11.2    NA  45  3   1
## 35                     Mouse    0.023    0.40 11.9 1.3 13.2   3.2  19  4   1
## 36       N. American opossum    1.700    6.30 13.8 5.6 19.4   5.0  12  2   1
## 37     Nine-banded armadillo    3.500   10.80 14.3 3.1 17.4   6.5 120  2   1
## 38                     Okapi  250.000  490.00   NA 1.0   NA  23.6 440  5   5
## 39                Owl monkey    0.480   15.50 15.2 1.8 17.0  12.0 140  2   2
## 40              Patas monkey   10.000  115.00 10.0 0.9 10.9  20.2 170  4   4
## 41                Phanlanger    1.620   11.40 11.9 1.8 13.7  13.0  17  2   1
## 42                       Pig  192.000  180.00  6.5 1.9  8.4  27.0 115  4   4
## 43                    Rabbit    2.500   12.10  7.5 0.9  8.4  18.0  31  5   5
## 44                   Raccoon    4.288   39.20   NA  NA 12.5  13.7  63  2   2
## 45                       Rat    0.280    1.90 10.6 2.6 13.2   4.7  21  3   1
## 46                   Red fox    4.235   50.40  7.4 2.4  9.8   9.8  52  1   1
## 47             Rhesus monkey    6.800  179.00  8.4 1.2  9.6  29.0 164  2   3
## 48    Rock hyrax (Hetero. b)    0.750   12.30  5.7 0.9  6.6   7.0 225  2   2
## 49 Rock hyrax (Procavia hab)    3.600   21.00  4.9 0.5  5.4   6.0 225  3   2
## 50                  Roe deer   14.830   98.20   NA  NA  2.6  17.0 150  5   5
## 51                     Sheep   55.500  175.00  3.2 0.6  3.8  20.0 151  5   5
## 52                Slow loris    1.400   12.50   NA  NA 11.0  12.7  90  2   2
## 53           Star nosed mole    0.060    1.00  8.1 2.2 10.3   3.5  NA  3   1
## 54                    Tenrec    0.900    2.60 11.0 2.3 13.3   4.5  60  2   1
## 55                Tree hyrax    2.000   12.30  4.9 0.5  5.4   7.5 200  3   1
## 56                Tree shrew    0.104    2.50 13.2 2.6 15.8   2.3  46  3   2
## 57                    Vervet    4.190   58.00  9.7 0.6 10.3  24.0 210  4   3
## 58             Water opossum    3.500    3.90 12.8 6.6 19.4   3.0  14  2   1
## 59     Yellow-bellied marmot    4.050   17.00   NA  NA   NA  13.0  38  3   1
##    odi
## 1    3
## 2    3
## 3    1
## 4    3
## 5    4
## 6    4
## 7    1
## 8    4
## 9    1
## 10   1
## 11   4
## 12   5
## 13   2
## 14   5
## 15   1
## 16   2
## 17   2
## 18   1
## 19   1
## 20   5
## 21   5
## 22   2
## 23   1
## 24   1
## 25   1
## 26   3
## 27   4
## 28   5
## 29   1
## 30   4
## 31   1
## 32   1
## 33   1
## 34   3
## 35   3
## 36   1
## 37   1
## 38   5
## 39   2
## 40   4
## 41   2
## 42   4
## 43   5
## 44   2
## 45   3
## 46   1
## 47   2
## 48   2
## 49   3
## 50   5
## 51   5
## 52   2
## 53   2
## 54   2
## 55   3
## 56   2
## 57   4
## 58   1
## 59   1

and the third approach uses the row numbers directly (you would need to inquire about, or calculate the rownumbers)

sleepdata2 <- sleepdata[-c(16, 32, 38), ]

Note that the numbered option requires less code, but the named option has a much lower probability for error. As the dataset might change, or might get sorted differently, the second option may not be valid anymore.

Plot brain weight as a function of species.

plot(brw ~ species, data = sleepdata2)

Some animals have much heavier brains than other animals. Find out the names of the animals that have a brain weight larger than 1 standard deviation above the mean brain weight. Replicate the plot from Question 9 with only these animals and do not plot any information about the other animals.

To find out which animals have a brain weight larger than 1 standard deviation above the mean brain weight:

sd.brw <- sd(sleepdata2$brw) #standard deviation  
mean.brw <- mean(sleepdata2$brw) #mean
which <- sleepdata2$brw > (mean.brw + (1 * sd.brw)) #which are larger?
as.character(sleepdata2$species[which]) #names of the animals with brw > 1000

## [1] "African elephant" "Asian elephant"   "Man"

To plot these animals:

plot(brw ~ species, data = sleepdata2[which, ])

The downside is that it still prints all the animals on the x-axis. This is due to the factor labels for species being copied to the smaller subset of the data. Plot automatically takes over the labels. For example,

sleepdata2$species[which]

## [1] African elephant Asian elephant   Man             
## 62 Levels: African elephant African giant pouched rat ... Yellow-bellied marmot

returns only 3 mammals, but still has 62 factor levels. To get rid of the unused factor levels, we can use function factor():

sleepdata3 <- sleepdata2[which, ]
sleepdata3$species <- factor(sleepdata3$species)
sleepdata3$species

## [1] African elephant Asian elephant   Man             
## Levels: African elephant Asian elephant Man

To plot the graph that we wanted:

plot(brw ~ species, data = sleepdata3)

If your current software-analysis platform is different from R, chances are that you prepare your data in the software of your choice. In R there are fantastic facilities for importing and exporting data and I would specifically like to pinpoint you to package haven by Hadley Wickham. It provides wonderful functions to import and export many data types from software such as Stata, SAS and SPSS.

Useful links

Package haven for importing/exporting SPSS, SAS and STATA data.

End of exercise

Practical C

Gerko Vink

Fundamental Techniques in Data Science with R

Exercises

Exercise 1-5

Exercise 6-10

Useful links