Begin this practical exercise by setting the maximum line length in R-Studio
to 80 characters. Go to RStudio
’s Preferences
(or Global Options
under Tools
) –> Code
–> Display
, and tick the show margin
box. Make sure that the margin column
is set to 80
mice
. Go to Tools
> Install Packages
in RStudio
. If you are connected to the internet, select Repository
under Install From
and type mice
under Packages
. Leave the Install to Library
at default and make sure that Install Dependencies
is selected. Click install. If you are not connected to the internet, select Package Archive File
under “Install from” and navigate to the respective file on your drive.
Some packages depend on other packages, meaning that their functionality may be limited if their dependencies are not installed. Installing dependencies is therefor recommended, but internet connectivity is required.
If all is right, you will receive a message in the console that the package has been installed (as well as its dependencies.
ALternatively, if you know the name of the package you would like to install - in this case mice
- you can also call install.packages("mice")
in the console window.
mice
. Loading packages can be done through functions library()
and require()
.library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
If you use require()
within a function, and the required package is not available, require()
will yield a warning and the remainder of the function is still executed, whereas library()
will yield an error and terminate all executions. The use of library()
when not doing too complicated things is preferred - require()
would result in more computational overhead because it calls library()
itself.
mammalsleep
dataset from package mice
by typing mammalsleep
in the console, and subsequently by using the function View()
. Using View()
is preferred for inspecting datasets that are large. View()
opens the dataset in a spreadsheet-like window (conform MS Excel, or SPSS). If you View()
your own datasets, you can even edit the datasets’ contents.
mice
to the work directory as a tab-delimited text file with .
as a decimal seperator. Name the file mammalsleep.txt
library(mice)
write.table(mammalsleep, "mammalsleep.txt", sep = "\t", dec = ".", row.names = FALSE)
The command sep = "\t"
indicates that the file is tabulated and the command dec = "."
indicates that a point is used as the decimal seperator (instead of a comma). row.names = FALSE
tells R
that row names are not to be included in exported file.
mammalsleep.txt
file. sleepdata <- read.table("mammalsleep.txt", sep = "\t", dec = ".", header = TRUE, stringsAsFactors = TRUE)
The command sep = "\t"
indicates that the file is tabulated and the command dec = "."
indicates that a point is used as the decimal seperator (instead of a comma). header = TRUE
tells R
that variable names are included in the header.
All files that are presented in the work directory of the current R
project, can essentially be imported into the workspace (the space that contains all environments) directly. All other locations require you to specify the specific path from the root of your machine. To find out what the current work directory is, you can type getwd()
and to change the work directory you can use setwd()
. The beauty of using projects in RStudio is that you would never have to change the work directory, as the work directory is automatically set, relative to your projects’ R
-scripts.
There are many packages that facilitate importing datasets from other statistical software packages, such as SPSS (e.g. function read_spss
from package haven
), Mplus (package MplusAutomation
), Stata (read.dta()
in foreign
), SAS (sasxport.get()
from package Hmisc
) and from spreadsheet software, such as MS Excel (function read.xlsx()
from package xlsx
). For a short guideline to import multiple formats into R
, see e.g. http://www.statmethods.net/input/importingdata.html.
If you would like to know more about this dataset, you can open the help for the mammalsleep
dataset in package mice
through ?mammalsleep
. Don’t forget to load package mice
first.
Inspecting the sleepdata could be done by
str(sleepdata) #the data structure
## 'data.frame': 62 obs. of 11 variables:
## $ species: Factor w/ 62 levels "African elephant",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ bw : num 6654 1 3.38 0.92 2547 ...
## $ brw : num 5712 6.6 44.5 5.7 4603 ...
## $ sws : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
## $ ps : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
## $ ts : num 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
## $ mls : num 38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
## $ gt : num 645 42 60 25 624 180 35 392 63 230 ...
## $ pi : int 3 3 1 5 3 4 1 4 1 1 ...
## $ sei : int 5 1 1 2 5 4 1 5 2 1 ...
## $ odi : int 3 3 1 3 4 4 1 4 1 1 ...
summary(sleepdata) #distributional summaries
## species bw brw
## African elephant : 1 Min. : 0.005 Min. : 0.14
## African giant pouched rat: 1 1st Qu.: 0.600 1st Qu.: 4.25
## Arctic Fox : 1 Median : 3.342 Median : 17.25
## Arctic ground squirrel : 1 Mean : 198.790 Mean : 283.13
## Asian elephant : 1 3rd Qu.: 48.202 3rd Qu.: 166.00
## Baboon : 1 Max. :6654.000 Max. :5712.00
## (Other) :56
## sws ps ts mls
## Min. : 2.100 Min. :0.000 Min. : 2.60 Min. : 2.000
## 1st Qu.: 6.250 1st Qu.:0.900 1st Qu.: 8.05 1st Qu.: 6.625
## Median : 8.350 Median :1.800 Median :10.45 Median : 15.100
## Mean : 8.673 Mean :1.972 Mean :10.53 Mean : 19.878
## 3rd Qu.:11.000 3rd Qu.:2.550 3rd Qu.:13.20 3rd Qu.: 27.750
## Max. :17.900 Max. :6.600 Max. :19.90 Max. :100.000
## NA's :14 NA's :12 NA's :4 NA's :4
## gt pi sei odi
## Min. : 12.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 35.75 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median : 79.00 Median :3.000 Median :2.000 Median :2.000
## Mean :142.35 Mean :2.871 Mean :2.419 Mean :2.613
## 3rd Qu.:207.50 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :645.00 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :4
round(cor(sleepdata[, -1], use = "pairwise.complete.obs"), 2) #bivariate correlations, variable 1 excluded.
## bw brw sws ps ts mls gt pi sei odi
## bw 1.00 0.93 -0.38 -0.11 -0.31 0.30 0.65 0.06 0.34 0.13
## brw 0.93 1.00 -0.37 -0.11 -0.36 0.51 0.75 0.03 0.37 0.15
## sws -0.38 -0.37 1.00 0.51 0.96 -0.38 -0.59 -0.32 -0.54 -0.48
## ps -0.11 -0.11 0.51 1.00 0.73 -0.30 -0.45 -0.45 -0.54 -0.58
## ts -0.31 -0.36 0.96 0.73 1.00 -0.41 -0.63 -0.40 -0.64 -0.59
## mls 0.30 0.51 -0.38 -0.30 -0.41 1.00 0.61 -0.10 0.36 0.06
## gt 0.65 0.75 -0.59 -0.45 -0.63 0.61 1.00 0.20 0.64 0.38
## pi 0.06 0.03 -0.32 -0.45 -0.40 -0.10 0.20 1.00 0.62 0.92
## sei 0.34 0.37 -0.54 -0.54 -0.64 0.36 0.64 0.62 1.00 0.79
## odi 0.13 0.15 -0.48 -0.58 -0.59 0.06 0.38 0.92 0.79 1.00
head(mammalsleep) #first six rows
## species bw brw sws ps ts mls gt pi sei odi
## 1 African elephant 6654.000 5712.0 NA NA 3.3 38.6 645 3 5 3
## 2 African giant pouched rat 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3
## 3 Arctic Fox 3.385 44.5 NA NA 12.5 14.0 60 1 1 1
## 4 Arctic ground squirrel 0.920 5.7 NA NA 16.5 NA 25 5 2 3
## 5 Asian elephant 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
## 6 Baboon 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4
tail(mammalsleep) #last six rows
## species bw brw sws ps ts mls gt pi sei odi
## 57 Tenrec 0.900 2.6 11.0 2.3 13.3 4.5 60 2 1 2
## 58 Tree hyrax 2.000 12.3 4.9 0.5 5.4 7.5 200 3 1 3
## 59 Tree shrew 0.104 2.5 13.2 2.6 15.8 2.3 46 3 2 2
## 60 Vervet 4.190 58.0 9.7 0.6 10.3 24.0 210 4 3 4
## 61 Water opossum 3.500 3.9 12.8 6.6 19.4 3.0 14 2 1 1
## 62 Yellow-bellied marmot 4.050 17.0 NA NA NA 13.0 38 3 1 1
?mammalsleep # the help
Note that the sleepdata dataset is automatically recognized as a dataframe. After all, there is one factor (categorical variable) containing the animal names.
The functions head()
and tail()
are very useful functions. As is function str
as it gives you a quick overview of the measurement levels in mammalsleep
.
Since mammalsleep
is an R
-dataset, there should be a help file. Taking a look at ?mammalsleep
may yield valuable insight about the measurements and origin of the variables.
One thing that may have caught your attention is the relation between ts
, ps
and sws
. This is a deterministic relation where total sleep (ts
) is the sum of paradoxical sleep (ps
) and short-wave sleep (sws
). In the event that you would model the data, you need to take such relations into account.
Practical_C.RData
. Also, save the sleepdata file as a separate workspace called Sleepdata.RData
. Now that we have imported our data, it may be wise to save the current workspace, i.e. the current state of affairs. Saving the workspace will leave everything as is, so that we can continue from this exact state at a later time, by simply opening the workspace file. To save everything in the current workspace, type
save.image("Practical_C.RData")
To save just the dataset sleepdata
, and nothing else, type
save(sleepdata, file = "Sleepdata.RData")
With the save functions, any object in the workspace can be saved.
There are three ways to exclude the three animals from the dataset. The first approach uses the names:
exclude <- c("Echidna", "Lesser short-tailed shrew", "Musk shrew")
which <- sleepdata$species %in% exclude #Indicate the species that match the names in exclude
which
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [37] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE
sleepdata2 <- sleepdata[!which, ]
the second approach uses function filter()
from package dplyr
:
library(dplyr) # Data Manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
filter(sleepdata, !sleepdata$species %in% exclude) # ! makes all TRUES into FALSE
## species bw brw sws ps ts mls gt pi sei
## 1 African elephant 6654.000 5712.00 NA NA 3.3 38.6 645 3 5
## 2 African giant pouched rat 1.000 6.60 6.3 2.0 8.3 4.5 42 3 1
## 3 Arctic Fox 3.385 44.50 NA NA 12.5 14.0 60 1 1
## 4 Arctic ground squirrel 0.920 5.70 NA NA 16.5 NA 25 5 2
## 5 Asian elephant 2547.000 4603.00 2.1 1.8 3.9 69.0 624 3 5
## 6 Baboon 10.550 179.50 9.1 0.7 9.8 27.0 180 4 4
## 7 Big brown bat 0.023 0.30 15.8 3.9 19.7 19.0 35 1 1
## 8 Brazilian tapir 160.000 169.00 5.2 1.0 6.2 30.4 392 4 5
## 9 Cat 3.300 25.60 10.9 3.6 14.5 28.0 63 1 2
## 10 Chimpanzee 52.160 440.00 8.3 1.4 9.7 50.0 230 1 1
## 11 Chinchilla 0.425 6.40 11.0 1.5 12.5 7.0 112 5 4
## 12 Cow 465.000 423.00 3.2 0.7 3.9 30.0 281 5 5
## 13 Desert hedgehog 0.550 2.40 7.6 2.7 10.3 NA NA 2 1
## 14 Donkey 187.100 419.00 NA NA 3.1 40.0 365 5 5
## 15 Eastern American mole 0.075 1.20 6.3 2.1 8.4 3.5 42 1 1
## 16 European hedgehog 0.785 3.50 6.6 4.1 10.7 6.0 42 2 2
## 17 Galago 0.200 5.00 9.5 1.2 10.7 10.4 120 2 2
## 18 Genet 1.410 17.50 4.8 1.3 6.1 34.0 NA 1 2
## 19 Giant armadillo 60.000 81.00 12.0 6.1 18.1 7.0 NA 1 1
## 20 Giraffe 529.000 680.00 NA 0.3 NA 28.0 400 5 5
## 21 Goat 27.660 115.00 3.3 0.5 3.8 20.0 148 5 5
## 22 Golden hamster 0.120 1.00 11.0 3.4 14.4 3.9 16 3 1
## 23 Gorilla 207.000 406.00 NA NA 12.0 39.3 252 1 4
## 24 Gray seal 85.000 325.00 4.7 1.5 6.2 41.0 310 1 3
## 25 Gray wolf 36.330 119.50 NA NA 13.0 16.2 63 1 1
## 26 Ground squirrel 0.101 4.00 10.4 3.4 13.8 9.0 28 5 1
## 27 Guinea pig 1.040 5.50 7.4 0.8 8.2 7.6 68 5 3
## 28 Horse 521.000 655.00 2.1 0.8 2.9 46.0 336 5 5
## 29 Jaguar 100.000 157.00 NA NA 10.8 22.4 100 1 1
## 30 Kangaroo 35.000 56.00 NA NA NA 16.3 33 3 5
## 31 Little brown bat 0.010 0.25 17.9 2.0 19.9 24.0 50 1 1
## 32 Man 62.000 1320.00 6.1 1.9 8.0 100.0 267 1 1
## 33 Mole rat 0.122 3.00 8.2 2.4 10.6 NA 30 2 1
## 34 Mountain beaver 1.350 8.10 8.4 2.8 11.2 NA 45 3 1
## 35 Mouse 0.023 0.40 11.9 1.3 13.2 3.2 19 4 1
## 36 N. American opossum 1.700 6.30 13.8 5.6 19.4 5.0 12 2 1
## 37 Nine-banded armadillo 3.500 10.80 14.3 3.1 17.4 6.5 120 2 1
## 38 Okapi 250.000 490.00 NA 1.0 NA 23.6 440 5 5
## 39 Owl monkey 0.480 15.50 15.2 1.8 17.0 12.0 140 2 2
## 40 Patas monkey 10.000 115.00 10.0 0.9 10.9 20.2 170 4 4
## 41 Phanlanger 1.620 11.40 11.9 1.8 13.7 13.0 17 2 1
## 42 Pig 192.000 180.00 6.5 1.9 8.4 27.0 115 4 4
## 43 Rabbit 2.500 12.10 7.5 0.9 8.4 18.0 31 5 5
## 44 Raccoon 4.288 39.20 NA NA 12.5 13.7 63 2 2
## 45 Rat 0.280 1.90 10.6 2.6 13.2 4.7 21 3 1
## 46 Red fox 4.235 50.40 7.4 2.4 9.8 9.8 52 1 1
## 47 Rhesus monkey 6.800 179.00 8.4 1.2 9.6 29.0 164 2 3
## 48 Rock hyrax (Hetero. b) 0.750 12.30 5.7 0.9 6.6 7.0 225 2 2
## 49 Rock hyrax (Procavia hab) 3.600 21.00 4.9 0.5 5.4 6.0 225 3 2
## 50 Roe deer 14.830 98.20 NA NA 2.6 17.0 150 5 5
## 51 Sheep 55.500 175.00 3.2 0.6 3.8 20.0 151 5 5
## 52 Slow loris 1.400 12.50 NA NA 11.0 12.7 90 2 2
## 53 Star nosed mole 0.060 1.00 8.1 2.2 10.3 3.5 NA 3 1
## 54 Tenrec 0.900 2.60 11.0 2.3 13.3 4.5 60 2 1
## 55 Tree hyrax 2.000 12.30 4.9 0.5 5.4 7.5 200 3 1
## 56 Tree shrew 0.104 2.50 13.2 2.6 15.8 2.3 46 3 2
## 57 Vervet 4.190 58.00 9.7 0.6 10.3 24.0 210 4 3
## 58 Water opossum 3.500 3.90 12.8 6.6 19.4 3.0 14 2 1
## 59 Yellow-bellied marmot 4.050 17.00 NA NA NA 13.0 38 3 1
## odi
## 1 3
## 2 3
## 3 1
## 4 3
## 5 4
## 6 4
## 7 1
## 8 4
## 9 1
## 10 1
## 11 4
## 12 5
## 13 2
## 14 5
## 15 1
## 16 2
## 17 2
## 18 1
## 19 1
## 20 5
## 21 5
## 22 2
## 23 1
## 24 1
## 25 1
## 26 3
## 27 4
## 28 5
## 29 1
## 30 4
## 31 1
## 32 1
## 33 1
## 34 3
## 35 3
## 36 1
## 37 1
## 38 5
## 39 2
## 40 4
## 41 2
## 42 4
## 43 5
## 44 2
## 45 3
## 46 1
## 47 2
## 48 2
## 49 3
## 50 5
## 51 5
## 52 2
## 53 2
## 54 2
## 55 3
## 56 2
## 57 4
## 58 1
## 59 1
and the third approach uses the row numbers directly (you would need to inquire about, or calculate the rownumbers)
sleepdata2 <- sleepdata[-c(16, 32, 38), ]
Note that the numbered option requires less code, but the named option has a much lower probability for error. As the dataset might change, or might get sorted differently, the second option may not be valid anymore.
plot(brw ~ species, data = sleepdata2)
To find out which animals have a brain weight larger than 1 standard deviation above the mean brain weight:
sd.brw <- sd(sleepdata2$brw) #standard deviation
mean.brw <- mean(sleepdata2$brw) #mean
which <- sleepdata2$brw > (mean.brw + (1 * sd.brw)) #which are larger?
as.character(sleepdata2$species[which]) #names of the animals with brw > 1000
## [1] "African elephant" "Asian elephant" "Man"
To plot these animals:
plot(brw ~ species, data = sleepdata2[which, ])
The downside is that it still prints all the animals on the x-axis. This is due to the factor labels for species
being copied to the smaller subset of the data. Plot automatically takes over the labels. For example,
sleepdata2$species[which]
## [1] African elephant Asian elephant Man
## 62 Levels: African elephant African giant pouched rat ... Yellow-bellied marmot
returns only 3 mammals, but still has 62 factor levels. To get rid of the unused factor levels, we can use function factor()
:
sleepdata3 <- sleepdata2[which, ]
sleepdata3$species <- factor(sleepdata3$species)
sleepdata3$species
## [1] African elephant Asian elephant Man
## Levels: African elephant Asian elephant Man
To plot the graph that we wanted:
plot(brw ~ species, data = sleepdata3)
If your current software-analysis platform is different from R
, chances are that you prepare your data in the software of your choice. In R
there are fantastic facilities for importing and exporting data and I would specifically like to pinpoint you to package haven
by Hadley Wickham. It provides wonderful functions to import and export many data types from software such as Stata, SAS and SPSS.