pg宝石传奇中大奖视频在哪看

書名： Simulation for Data Science with R
作者名： Matthias Templ
本章字數(shù)： 2166字
更新時間： 2021-07-14 11:17:06

The R statistical environment

R was founded by Ross Ihaka and Robert Gentlemen in 1994/1995. It is based on S, a programming language developed by John Chambers (Bell Laboratories), and Scheme. Since 1997, it has been internationally developed and distributed from Vienna over the Comprehensive R Archive Network (CRAN). R is nowadays the most popular and most used software in the statistical world. In addition, R is free and open source (under the GPL2). R is not only a statistical software, it is an environment for interactive computing with data supporting facilities to produce high-quality graphics. The exchange of code with others is easy since everybody can download R. This might also be one reason why modern methods are often exclusively developed in R. R is an object-oriented programming language and has interfaces to many other software products such as C, C++, Java, and interfaces to databases.

Useful information can be found on the following links.

Homepage: http://www.r-project.org/ and http://cran.r-project.org (CRAN)
Frequently Asked Questions (FAQs) lists on CRAN
Manuals and contributed manuals at CRAN
Task views on CRAN

R is extendable by approximately 8,400 add-on packages.

For programming, it is advisable to write the code in a well-developed editor and communicate interactively with R. An editor should allow syntax-highlighting, code-completion and interactive communication with R. For beginners but also for advanced users, RStudio is one choice (http://www.rstudio.org/). Experts might also use the combination of Eclipse plus its add-on StatET. Both editors provide a fully developed programming environment. They not only integrate R, they also integrate many other useful tools and software.

Basics in R

R can be used as an overgrown calculator. All the operations of a calculator can be very easily used in R; for example, addition is done with +, subtraction with -, division with /, exponential with exp(), logarithm with log(), square root using sqrt(), sinus with sin(). All operations works as expected; for example, the following expression is parsed by R, inner brackets are solved first, multiplication and division operators have precedence over the addition and subtraction operators, and so on:

5 + 2 * log(3 * 3)
## [1] 9.394449

Since R starts within one second, there is no need to have any other calculator at hand anymore.

Some very basic stuff about R

R is a function and object-oriented language. Functions can be applied to objects. The syntax is as shown in the following example:

mean(rnorm(10))
## [1] -0.4241956

With the function rnorm, 10 numbers are drawn randomly from a standard normal distribution. If no seed is fixed (with function seed()), the numbers differ from one call to another call of these function. Afterwards, the mean is calculated for these 10 numbers. Functions typically have function arguments that can be set. The syntax for calling a function is generally:

res1 <- name_of_function(v1) # an input argument
res2 <- name_of_function(v1, v2) # two input arguments
res3 <- name_of_function(v1, v2, v3) # three input arguments

Functions often have additional function arguments with default values. You get access to all function arguments with args().

Allocation to objects are made by <- or = and the generated object can be print via object name followed by typing ENTER, such as:

x <- rnorm(5)
x
## [1] -1.3672828 -2.0871666 0.4747871 0.4861836 0.8022188

The function options() allows you to modify the default setting such as to change the font, the encoding, or as shown here, we reduce the number of printed digits (internally R will not round to these digits, it's just the print):

options(digits = 4)
x
## [1] -1.3673 -2.0872 0.4748 0.4862 0.8022

Please note that R is case sensitive.

Installation and updates

The recommended procedure to install the software consists of the following steps.

Install R: if R is already installed on the computer, ensure that it is the latest version. If the software is not installed, go to download the executable file depended on your operating system and follow the on-screen instructions.

To install an add-on package, say package dplyr, type:

install.packages("dplyr")

Installation is needed only once. The content of an installed package can be used after loading the package via:

library("dplyr")

When typing update.packages(), R searches for possible updates and installs new versions of packages, if any are available.

The previous information was about installing the stable CRAN version of the packages. However, the latest changes are often only available in the development version of the package. Sometimes these development versions are hosted on GitHub or similar Git repository systems.

To install the latest development version, the installation of the package devtools (Wickham and Chang, 2015) is needed. After calling the devtools package, the development version can be installed via install_github(). We show this for package dplyr:

if(!require(devtools)) install.packages("devtools")

library("devtools")
install_github("hadley/dplyr")

Help

It is crucial to have basic knowledge of how to get help using the following command:

help.start()

By this command your browser opens and help (and more) is available.

The browsable help index of the package can be accessed by typing the following command into R:

help(package="dplyr")

To find help for a specific function, one can use help(name) or ?name. As an example, we can look at the help file of function group_by, which is included in the package dplyr:

?group_by

Data in the package can be loaded via the data() function, for example, the Cars93 dataset from package MASS (Venables and Ripley 2002):

data(Cars93, package = "MASS")

help.search() can be used to find functions for which you don't know an exact name, for example:

help.search("histogram")

This command will search your local R installation for functions approximately matching the character string "histogram" in the (file) name, alias, title, concept, or keyword entries. With function apropos, one can find and list objects by (partial) name. For example, to list all objects with partial name match of hist, type:

apropos("hist")

To search help pages, vignettes or task views, use the search engine at the website of R and view the results of your request (for example, summarize) in your web browser:

RSiteSearch("group by factor")

This reports all search results for the character string "gro up by factor".

The R workspace and the working directory

Created objects are available in the workspace of R and loaded in the memory of your computer. The collection of all created objects is called the workspace. To list the objects in the workspace, type the following:

ls()
## [1] "x"

When importing or exporting data, the working directory must be defined. To show the current working directory, the function getwd can be used:

getwd()
## [1] "/Users/templ/workspace/simulation/book"

To change the working directory, the function setwd is the choice, see ?setwd.

Data types

The objective is to know the most important data types:

numeric
character
factor
logical

The following are the important data structures:

vector
list
array
data.frame
Special data types: missing values (NA), NULL-objects, NaN, -inf, +inf

Vectors in R

Vectors are the simplest data structure in R. A vector is a sequence of elements of the same type, such as numerical vectors, character vectors, or logical vectors. Vectors are often created with the function c(), for example:

v.num <- c(1,3,5.9,7)
v.num
## [1] 1.0 3.0 5.9 7.0
is.numeric (v.num)
## [1] TRUE

is.numeric query if the vector is of class numeric. Note that characters are written with parentheses.

Logical vectors are often created indirectly from numerical/character vectors:

v.num > 3
## [1] FALSE FALSE TRUE TRUE

Many operations on vectors are performed element-wise, for example, logical comparisons or arithmetic operations with vectors. A common error source is when the length of two or more vectors differs. Then the shorter one is repeated (recycling):

v1 <- c(1,2,3)
v2 <- c(4,5)
v1 + v2
## [1] 5 7 7
Warning message:
In v1 + v2 :
 longer object length is not a multiple of shorter object length

One should also be aware that R coerces internally to meaningful data types automatically. For example:

v2 <- c (100, TRUE, "A", FALSE)
v2
## [1] "100" "TRUE" "A" "FALSE"
is.numeric (v2)
## [1] FALSE

Here, the lowest common data type is a string and therefore all entries of the vector are coerced to character. Note, to create vectors, the functions seq and rep are very useful.

Often it is necessary to subset vectors. The selection is made using the [] operator. A selection can be done in three ways:

Positive: A vector of positive integers that specifies the position of the desired elements
Negative: A vector with negative integers indicating the position of the non-required elements
Logical: A logic vector in which the elements are to be the selected (TRUE), along with those that are not selected (FALSE)

Let us consider the following example:

data(Cars93, package = "MASS")
# extract a subset of variable Horsepower from Cars93
hp <- Cars93[1:10, "Horsepower"]
hp
## [1] 140 200 172 172 208 110 170 180 170 200
# positive indexing:
hp[c(1,6)]
## [1] 140 110
# negative indexing:
hp[-c(2:5,7:10)]
## [1] 140 110
# logical indexing:
hp < 150
## [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# a logical expression can be written directly in []
hp[
hp < 150]
## [1] 140 110

Factors in R

Factors in R are of special importance. They are used to represent nominal or ordinal data. More precisely, unrecorded factors for nominally scaled data, and ordered factors for ordinal scaled data. Factors can be seen as special vectors. They are internally coded integers from 1 to n (# of occurrences) which are all associated with a name (label). So why should numeric or character variables be used as factors? Basically, factors have to be used for categorical information to get the correct number of degrees of freedom and correct design matrices in statistical modeling. In addition, the implementation of graphics for factors vs. numerical / character vectors differs. Moreover, factors are more efficient at storing character vectors. However, factors have a more complex data structure since factors include a numerically coded data vector and labels for each level/category. Let us consider the following example:

class(Cars93)
## [1] "data.frame"
class(Cars93$Cylinders)
## [1] "factor"
levels(Cars93$Cylinders)
## [1] "3" "4" "5" "6" "8" "rotary"
summary(Cars93$Cylinders)
## 3 4 5 6 8 rotary
## 3 49 2 31 7 1

We note that output of summary is different for factors. Internally, R applies a method dispatch for generic functions such as summary, searching in our case if the function summary.factor exists. If yes, this function is applied, if not, summary.default is used.

list

A list in R is an ordered collection of objects whereas each object is part of the list and where the data types of the individual list elements can be different (vectors, matrices, data.frames, lists, and so on). The dimension of each list item can be different. Lists can be used to group and summarize various objects in an object. There are (at least) four ways of accessing elements of a list, (a) the [] operator, the operator [[]], the $ operator and the name of a list item. With str(), you can view the structure of a list, with names() you get the names of the list elements:

model <- lm(Price ~ Cylinders + Type + EngineSize + Origin, data = Cars93)
## result is a list
class(model)
## [1] "lm"
## access elements from the named list with the dollar sign
model$coefficients
## (Intercept) Cylinders4 Cylinders5 Cylinders6
## 5.951 3.132 7.330 10.057
## Cylinders8 Cylindersrotary TypeLarge TypeMidsize
## 17.835 19.828 -4.232 2.558
## TypeSmall TypeSporty TypeVan EngineSize
## -6.086 -2.188 -5.835 2.303
## Originnon-USA
## 5.915

data.frame

Data frames (in R data.frame) are the most important data type. They correspond to the rectangle data format that is well-known from other software packages, with rows corresponding to observation units and columns to variables. A data.frame is like a list, where all list elements are vector/factors but with the restriction that all list elements have the same number of elements (equal length) For example, data from external sources to be read are often stored as data frames, data frames are usually created by reading data but they can also be constructed with function data.frame().

A lot of opportunities exist to subset a data frame, for example with syntax: [ index row, index columns]. Again positive, negative and logical indexing is possible and the type of indexing may be different for row index and column index. Accessing individual columns is easiest using the $ operator (like lists):

## extract cars with small number of cylinders and small power
w <- Cars93$Cylinders %in% c("3", "4") & Cars93$Horsepower < 80
str(Cars93[w, ])
## 'data.frame': 5 obs. of 27 variables:
## $ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 11 12 25 28 29
## $ Model : Factor w/ 93 levels "100","190E","240",..: 44 62 53 50 88
## $ Type : Factor w/ 6 levels "Compact","Large",..: 4 4 4 4 4
## $ Min.Price : num 6.9 6.7 8.2 7.3 7.3
## $ Price : num 7.4 8.4 9 8.4 8.6
## $ Max.Price : num 7.9 10 9.9 9.5 10
## $ MPG.city : int 31 46 31 33 39
## $ MPG.highway : int 33 50 41 37 43
## $ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 3 3 3 3
## $ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 1 2
## $ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 1 2 1 1
## $ EngineSize : num 1.3 1 1.6 1.2 1.3
## $ Horsepower : int 63 55 74 73 70
## $ RPM : int 5000 5700 5600 5600 6000
## $ Rev.per.mile : int 3150 3755 3130 2875 3360
## $ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2
## $ Fuel.tank.capacity: num 10 10.6 13.2 9.2 10.6
## $ Passengers : int 4 4 4 4 4
## $ Length : int 141 151 177 146 161
## $ Wheelbase : int 90 93 99 90 93
## $ Width : int 63 63 66 60 63
## $ Turn.circle : int 33 34 35 32 34
## $ Rear.seat.room : num 26 27.5 25.5 23.5 27.5
## $ Luggage.room : int 12 10 17 10 10
## $ Weight : int 1845 1695 2350 2045 1965
## $ Origin : Factor w/ 2 levels "USA","non-USA": 1 2 1 2 2
## $ Make : Factor w/ 93 levels "Acura Integra",..: 34 39 76 80 83

A few helpful functions that can be used in conjunction with data frames are: dim(), reporting the dimension (number of rows and columns); head(), the first (default 6) rows of a data frame; and colnames(), the columns/variable names.

array

An array in R can have multiple dimensions. A vector is already a one-dimensional array. A matrix is a two-dimensional array, having rows and columns. Let us call a data set from package vcd stored as a four-dimensional array:

library("vcd")
## Loading required package: grid
data(PreSex)
PreSex
## , , PremaritalSex = Yes, Gender = Women
##
## ExtramaritalSex
## MaritalStatus Yes No
## Divorced 17 54
## Married 4 25
##
## , , PremaritalSex = No, Gender = Women
##
## ExtramaritalSex
## MaritalStatus Yes No
## Divorced 36 214
## Married 4 322
##
## , , PremaritalSex = Yes, Gender = Men
##
## ExtramaritalSex
## MaritalStatus Yes No
## Divorced 28 60
## Married 11 42
##
## , , PremaritalSex = No, Gender = Men
##
## ExtramaritalSex
## MaritalStatus Yes No
## Divorced 17 68
## Married 4 130

We see that the first dimension is MaritalStatus, the second is ExtramaritalSex, the third dimension is PremaritalSex, and the fourth dimension is Gender.

We can now access the elements of the array by indexing using []. If we want to extract the data where PremaritalSex is Yes and Gender is Men, we type:

PreSex[, , 1, 2]
## ExtramaritalSex
## MaritalStatus Yes No
## Divorced 28 60
## Married 11 42

This mean that all values from the first and second dimensions are chosen, only the first one (Yes) from the third and the second one (Men) from the last dimension is specified. This can also be done by name:

PreSex[, , "Yes", "Men"]
## ExtramaritalSex
## MaritalStatus Yes No
## Divorced
 28 60
## Married 11 42

Missing values

Missing values are almost always present in the data. The default representation of a missing value in R is the symbol NA. A very useful function to check if data values are missing is is.na. It returns a logical vector or data.frame depending on whether the input is a vector or data.frame indicating "missingness". To calculate the number of missing values, we could sum the TRUE's (interpreted as 1 while FALSE is interpreted as 0).

sum(is.na(Cars93))
## [1] 13

All in all, 13 values are missing.

To analyze the structure of any missing values, the R package VIM (Templ, Alfons, and Filzmoser, 2012) can be used. One out of many possible plots for missing values, the matrixplot (Figure 1) shows all the values of the whole data frame. Interestingly, the higher the weight of the cars, the more missings are present in variable luggage.room:

require("VIM")
matrixplot(Cars93, sortby = "Weight", cex.axis=0.6)

Figure 1: matrixplot from package VIM. The darker the higher the values. Missing values are in red

In package robCompositions (Templ, Hron, and Filzmoser 2011), one useful function is missPatterns, which shows the structure of missing values (we do not show the output):

m <- robCompositions::missPatterns(Cars93)

官术网_书友最值得收藏!

Simulation for Data Science with R