- Simulation for Data Science with R
- Matthias Templ
- 2166字
- 2021-07-14 11:17:06
The R statistical environment
R was founded by Ross Ihaka and Robert Gentlemen in 1994/1995. It is based on S, a programming language developed by John Chambers (Bell Laboratories), and Scheme. Since 1997, it has been internationally developed and distributed from Vienna over the Comprehensive R Archive Network (CRAN). R is nowadays the most popular and most used software in the statistical world. In addition, R is free and open source (under the GPL2). R is not only a statistical software, it is an environment for interactive computing with data supporting facilities to produce high-quality graphics. The exchange of code with others is easy since everybody can download R. This might also be one reason why modern methods are often exclusively developed in R. R is an object-oriented programming language and has interfaces to many other software products such as C, C++, Java, and interfaces to databases.
Useful information can be found on the following links.
- Homepage: http://www.r-project.org/ and http://cran.r-project.org (CRAN)
- Frequently Asked Questions (FAQs) lists on CRAN
- Manuals and contributed manuals at CRAN
- Task views on CRAN
R is extendable by approximately 8,400 add-on packages.
For programming, it is advisable to write the code in a well-developed editor and communicate interactively with R. An editor should allow syntax-highlighting, code-completion and interactive communication with R. For beginners but also for advanced users, RStudio is one choice (http://www.rstudio.org/). Experts might also use the combination of Eclipse plus its add-on StatET. Both editors provide a fully developed programming environment. They not only integrate R, they also integrate many other useful tools and software.
Basics in R
R can be used as an overgrown calculator. All the operations of a calculator can be very easily used in R; for example, addition is done with +
, subtraction with -
, division with /
, exponential with exp()
, logarithm with log()
, square root using sqrt()
, sinus with sin()
. All operations works as expected; for example, the following expression is parsed by R, inner brackets are solved first, multiplication and division operators have precedence over the addition and subtraction operators, and so on:
5 + 2 * log(3 * 3) ## [1] 9.394449
Since R starts within one second, there is no need to have any other calculator at hand anymore.
Some very basic stuff about R
R is a function and object-oriented language. Functions can be applied to objects. The syntax is as shown in the following example:
mean(rnorm(10)) ## [1] -0.4241956
With the function rnorm
, 10 numbers are drawn randomly from a standard normal distribution. If no seed is fixed (with function seed()
), the numbers differ from one call to another call of these function. Afterwards, the mean
is calculated for these 10 numbers. Functions typically have function arguments that can be set. The syntax for calling a function is generally:
res1 <- name_of_function(v1) # an input argument res2 <- name_of_function(v1, v2) # two input arguments res3 <- name_of_function(v1, v2, v3) # three input arguments
Functions often have additional function arguments with default values. You get access to all function arguments with args()
.
Allocation to objects are made by <-
or =
and the generated object can be print via object name followed by typing ENTER
, such as:
x <- rnorm(5) x ## [1] -1.3672828 -2.0871666 0.4747871 0.4861836 0.8022188
The function options()
allows you to modify the default setting such as to change the font, the encoding, or as shown here, we reduce the number of printed digits (internally R will not round to these digits, it's just the print):
options(digits = 4) x ## [1] -1.3673 -2.0872 0.4748 0.4862 0.8022
Please note that R is case sensitive.
Installation and updates
The recommended procedure to install the software consists of the following steps.
Install R: if R is already installed on the computer, ensure that it is the latest version. If the software is not installed, go to download the executable file depended on your operating system and follow the on-screen instructions.
To install an add-on package, say package dplyr
, type:
install.packages("dplyr")
Installation is needed only once. The content of an installed package can be used after loading the package via:
library("dplyr")
When typing update.packages()
, R searches for possible updates and installs new versions of packages, if any are available.
The previous information was about installing the stable CRAN version of the packages. However, the latest changes are often only available in the development version of the package. Sometimes these development versions are hosted on GitHub or similar Git repository systems.
To install the latest development version, the installation of the package devtools
(Wickham and Chang, 2015) is needed. After calling the devtools
package, the development version can be installed via install_github()
. We show this for package dplyr
:
if(!require(devtools)) install.packages("devtools") library("devtools") install_github("hadley/dplyr")
Help
It is crucial to have basic knowledge of how to get help
using the following command:
help.start()
By this command your browser opens and help (and more) is available.
The browsable help index of the package can be accessed by typing the following command into R:
help(package="dplyr")
To find help for a specific function, one can use help(name)
or ?name
. As an example, we can look at the help file of function group_by
, which is included in the package dplyr
:
?group_by
Data in the package can be loaded via the data()
function, for example, the Cars93
dataset from package MASS
(Venables and Ripley 2002):
data(Cars93, package = "MASS")
help.search()
can be used to find functions for which you don't know an exact name, for example:
help.search("histogram")
This command will search your local R
installation for functions approximately matching the character string "histogram"
in the (file) name, alias, title, concept, or keyword entries. With function apropos
, one can find and list objects by (partial) name. For example, to list all objects with partial name match of hist
, type:
apropos("hist")
To search help pages, vignettes or task views, use the search engine at the website of R and view the results of your request (for example, summarize
) in your web browser:
RSiteSearch("group by factor")
This reports all search results for the character string "gro
up by factor"
.
The R workspace and the working directory
Created objects are available in the workspace of R and loaded in the memory of your computer. The collection of all created objects is called the workspace. To list the objects in the workspace, type the following:
ls() ## [1] "x"
When importing or exporting data, the working directory must be defined. To show the current working directory, the function getwd
can be used:
getwd() ## [1] "/Users/templ/workspace/simulation/book"
To change the working directory, the function setwd
is the choice, see ?setwd
.
Data types
The objective is to know the most important data types:
numeric
character
factor
logical
The following are the important data structures:
vector
list
array
data.frame
- Special data types: missing values (
NA
), NULL-objects,NaN
,-inf
,+inf
Vectors in R
Vectors are the simplest data structure in R. A vector is a sequence of elements of the same type, such as numerical vectors, character vectors, or logical vectors. Vectors are often created with the function c()
, for example:
v.num <- c(1,3,5.9,7) v.num ## [1] 1.0 3.0 5.9 7.0 is.numeric (v.num) ## [1] TRUE
is.numeric
query if the vector is of class numeric
. Note that characters are written with parentheses.
Logical vectors are often created indirectly from numerical/character vectors:
v.num > 3 ## [1] FALSE FALSE TRUE TRUE
Many operations on vectors are performed element-wise, for example, logical comparisons or arithmetic operations with vectors. A common error source is when the length of two or more vectors differs. Then the shorter one is repeated (recycling):
v1 <- c(1,2,3) v2 <- c(4,5) v1 + v2 ## [1] 5 7 7 Warning message: In v1 + v2 : longer object length is not a multiple of shorter object length
One should also be aware that R coerces internally to meaningful data types automatically. For example:
v2 <- c (100, TRUE, "A", FALSE) v2 ## [1] "100" "TRUE" "A" "FALSE" is.numeric (v2) ## [1] FALSE
Here, the lowest common data type is a string and therefore all entries of the vector are coerced to character. Note, to create vectors, the functions seq
and rep
are very useful.
Often it is necessary to subset vectors. The selection is made using the []
operator. A selection can be done in three ways:
- Positive: A vector of positive integers that specifies the position of the desired elements
- Negative: A vector with negative integers indicating the position of the non-required elements
- Logical: A logic vector in which the elements are to be the selected (
TRUE
), along with those that are not selected (FALSE
)
Let us consider the following example:
data(Cars93, package = "MASS") # extract a subset of variable Horsepower from Cars93 hp <- Cars93[1:10, "Horsepower"] hp ## [1] 140 200 172 172 208 110 170 180 170 200 # positive indexing: hp[c(1,6)] ## [1] 140 110 # negative indexing: hp[-c(2:5,7:10)] ## [1] 140 110 # logical indexing: hp < 150 ## [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE # a logical expression can be written directly in [] hp[ hp < 150] ## [1] 140 110
Factors in R
Factors in R are of special importance. They are used to represent nominal or ordinal data. More precisely, unrecorded factors for nominally scaled data, and ordered factors for ordinal scaled data. Factors can be seen as special vectors. They are internally coded integers from 1 to n (# of occurrences) which are all associated with a name (label). So why should numeric or character variables be used as factors? Basically, factors have to be used for categorical information to get the correct number of degrees of freedom and correct design matrices in statistical modeling. In addition, the implementation of graphics for factors vs. numerical / character vectors differs. Moreover, factors are more efficient at storing character vectors. However, factors have a more complex data structure since factors include a numerically coded data vector and labels for each level/category. Let us consider the following example:
class(Cars93) ## [1] "data.frame" class(Cars93$Cylinders) ## [1] "factor" levels(Cars93$Cylinders) ## [1] "3" "4" "5" "6" "8" "rotary" summary(Cars93$Cylinders) ## 3 4 5 6 8 rotary ## 3 49 2 31 7 1
We note that output of summary
is different for factors. Internally, R applies a method dispatch for generic functions such as summary
, searching in our case if the function summary.factor
exists. If yes, this function is applied, if not, summary.default
is used.
list
A list in R is an ordered collection of objects whereas each object is part of the list and where the data types of the individual list elements can be different (vectors, matrices, data.frames, lists, and so on). The dimension of each list item can be different. Lists can be used to group and summarize various objects in an object. There are (at least) four ways of accessing elements of a list, (a
) the []
operator, the operator [[]]
, the $
operator and the name of a list item. With str()
, you can view the structure of a list, with names()
you get the names of the list elements:
model <- lm(Price ~ Cylinders + Type + EngineSize + Origin, data = Cars93) ## result is a list class(model) ## [1] "lm" ## access elements from the named list with the dollar sign model$coefficients ## (Intercept) Cylinders4 Cylinders5 Cylinders6 ## 5.951 3.132 7.330 10.057 ## Cylinders8 Cylindersrotary TypeLarge TypeMidsize ## 17.835 19.828 -4.232 2.558 ## TypeSmall TypeSporty TypeVan EngineSize ## -6.086 -2.188 -5.835 2.303 ## Originnon-USA ## 5.915
data.frame
Data frames (in R data.frame
) are the most important data type. They correspond to the rectangle data format that is well-known from other software packages, with rows corresponding to observation units and columns to variables. A data.frame
is like a list, where all list elements are vector/factors but with the restriction that all list elements have the same number of elements (equal length) For example, data from external sources to be read are often stored as data frames, data frames are usually created by reading data but they can also be constructed with function data.frame()
.
A lot of opportunities exist to subset a data frame, for example with syntax: [ index row, index columns]
. Again positive, negative and logical indexing is possible and the type of indexing may be different for row index and column index. Accessing individual columns is easiest using the $
operator (like lists):
## extract cars with small number of cylinders and small power w <- Cars93$Cylinders %in% c("3", "4") & Cars93$Horsepower < 80 str(Cars93[w, ]) ## 'data.frame': 5 obs. of 27 variables: ## $ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 11 12 25 28 29 ## $ Model : Factor w/ 93 levels "100","190E","240",..: 44 62 53 50 88 ## $ Type : Factor w/ 6 levels "Compact","Large",..: 4 4 4 4 4 ## $ Min.Price : num 6.9 6.7 8.2 7.3 7.3 ## $ Price : num 7.4 8.4 9 8.4 8.6 ## $ Max.Price : num 7.9 10 9.9 9.5 10 ## $ MPG.city : int 31 46 31 33 39 ## $ MPG.highway : int 33 50 41 37 43 ## $ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 3 3 3 3 ## $ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 1 2 ## $ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 1 2 1 1 ## $ EngineSize : num 1.3 1 1.6 1.2 1.3 ## $ Horsepower : int 63 55 74 73 70 ## $ RPM : int 5000 5700 5600 5600 6000 ## $ Rev.per.mile : int 3150 3755 3130 2875 3360 ## $ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 ## $ Fuel.tank.capacity: num 10 10.6 13.2 9.2 10.6 ## $ Passengers : int 4 4 4 4 4 ## $ Length : int 141 151 177 146 161 ## $ Wheelbase : int 90 93 99 90 93 ## $ Width : int 63 63 66 60 63 ## $ Turn.circle : int 33 34 35 32 34 ## $ Rear.seat.room : num 26 27.5 25.5 23.5 27.5 ## $ Luggage.room : int 12 10 17 10 10 ## $ Weight : int 1845 1695 2350 2045 1965 ## $ Origin : Factor w/ 2 levels "USA","non-USA": 1 2 1 2 2 ## $ Make : Factor w/ 93 levels "Acura Integra",..: 34 39 76 80 83
A few helpful functions that can be used in conjunction with data frames are: dim()
, reporting the dimension (number of rows and columns); head()
, the first (default 6) rows of a data frame; and colnames()
, the columns/variable names.
array
An array in R can have multiple dimensions. A vector is already a one-dimensional array. A matrix is a two-dimensional array, having rows and columns. Let us call a data set from package vcd
stored as a four-dimensional array:
library("vcd") ## Loading required package: grid data(PreSex) PreSex ## , , PremaritalSex = Yes, Gender = Women ## ## ExtramaritalSex ## MaritalStatus Yes No ## Divorced 17 54 ## Married 4 25 ## ## , , PremaritalSex = No, Gender = Women ## ## ExtramaritalSex ## MaritalStatus Yes No ## Divorced 36 214 ## Married 4 322 ## ## , , PremaritalSex = Yes, Gender = Men ## ## ExtramaritalSex ## MaritalStatus Yes No ## Divorced 28 60 ## Married 11 42 ## ## , , PremaritalSex = No, Gender = Men ## ## ExtramaritalSex ## MaritalStatus Yes No ## Divorced 17 68 ## Married 4 130
We see that the first dimension is MaritalStatus
, the second is ExtramaritalSex
, the third dimension is PremaritalSex
, and the fourth dimension is Gender
.
We can now access the elements of the array by indexing using []
. If we want to extract the data where PremaritalSex
is Yes
and Gender
is Men
, we type:
PreSex[, , 1, 2] ## ExtramaritalSex ## MaritalStatus Yes No ## Divorced 28 60 ## Married 11 42
This mean that all values from the first and second dimensions are chosen, only the first one (Yes
) from the third and the second one (Men
) from the last dimension is specified. This can also be done by name:
PreSex[, , "Yes", "Men"] ## ExtramaritalSex ## MaritalStatus Yes No ## Divorced 28 60 ## Married 11 42
Missing values
Missing values are almost always present in the data. The default representation of a missing value in R is the symbol NA
. A very useful function to check if data values are missing is is.na
. It returns a logical vector or data.frame
depending on whether the input is a vector or data.frame
indicating "missingness". To calculate the number of missing values, we could sum the TRUE's (interpreted as 1 while FALSE is interpreted as 0).
sum(is.na(Cars93)) ## [1] 13
All in all, 13 values are missing.
To analyze the structure of any missing values, the R package VIM
(Templ, Alfons, and Filzmoser, 2012) can be used. One out of many possible plots for missing values, the matrixplot
(Figure 1) shows all the values of the whole data frame. Interestingly, the higher the weight of the cars, the more missings are present in variable luggage.room
:
require("VIM") matrixplot(Cars93, sortby = "Weight", cex.axis=0.6)

Figure 1: matrixplot from package VIM. The darker the higher the values. Missing values are in red
In package robCompositions (Templ, Hron, and Filzmoser 2011)
, one useful function is missPatterns
, which shows the structure of missing values (we do not show the output):
m <- robCompositions::missPatterns(Cars93)
- Learning Microsoft Windows Server 2012 Dynamic Access Control
- 深入核心的敏捷開發(fā):ThoughtWorks五大關鍵實踐
- DevOps:軟件架構師行動指南
- 零基礎學C++程序設計
- 軟件架構設計:大型網(wǎng)站技術架構與業(yè)務架構融合之道
- 架構不再難(全5冊)
- Learning Flask Framework
- Learning Laravel 4 Application Development
- 精通MATLAB(第3版)
- 一本書講透Java線程:原理與實踐
- C#程序設計(項目教學版)
- 玩轉(zhuǎn).NET Micro Framework移植:基于STM32F10x處理器
- Practical Predictive Analytics
- Visual Basic 程序設計實踐教程
- 官方 Scratch 3.0 編程趣味卡:讓孩子們愛上編程(全彩)