[A, SfS] Chapter 1: Sampling, Descriptive Statistics, Intr: 1.3: Variables
Variables
Variables
In this lesson you will learn:
- What a variable is.
- What a data set is.
#\text{}#
Variable
A variable is some feature which can be measured on the elements of a sample. It has a range of possible values which vary from one element to another.
#\text{}#
Data and Data Sets
Information collected from elements in a sample is called data.
A data set contains information about a set of variables measured or observed on all sample elements.
A data set typically has the following format:
- Each row pertains to a single sample element.
- Each column pertains to a single variable.
So the number of rows (excluding the header row containing the names of the variables) is equal to the number of elements in the sample, and the number of columns (excluding a column consisting of some identification number or name of each element) is equal to the number of variables on which data are collected.
#\text{}#
Using R
Importing data from a text file
A data set in #\mathrm{R}# is in the form of a data frame. If you use #\mathtt{read.table()}# to import data from a text file (in your #\mathrm{R}# working directory) into the #\mathrm{R}# workspace, the result is a data frame.
> MyData = read.table("somedata.txt",header=TRUE)
> class(MyData)
[1] "data.frame"
Loading .RData files
If you use the #\mathtt{load()}# command to load an #\mathtt{.RData}# file (in your #\mathrm{R}# working directory) into the #\mathrm{R}# workspace, you can use the #\mathtt{class()}# command as above to see if it is a data frame.
In this course, you will be given many #\mathtt{.RData}# files which, when loaded, will contain a file which is already in the #\mathtt{data.frame}# format. Note that you do not assign a loaded #\mathtt{.RData}# file to a name, as you do with #\mathtt{read.table()}#.
To see what files were contained in the #\mathtt{.RData}# file after you loaded it, you can see what is in your current #\mathrm{R}# workspace using the #\mathtt{ls()}# command.
> load("my_research.RData")
> ls()
[1] "Data_January_8" "Data_February_4" "Data_March_5"
> class(Data_February_4)
[1] "data.frame"
Subsetting data
Each column of a data frame in #\mathrm{R}# is a variable and will have a name at the top of the data. Either the name is provided in the imported #\mathtt{(header=TRUE)}# or loaded file, or #\mathrm{R}# will assign it a name, such as #\mathtt{V1 \text{,} V2 \text{,}}# etc.
If you want to input the data for that variable into another #\mathrm{R}# function, you reference the column by first giving the name of the data file, then the name of the column, joined by a #\mathtt{$}#.
> Data_Februray_4$temperature
Alternatively, if the variable you want is in column #3#, you can reference it using the column reference:
> Data_February4[,3]
The blank space before the comma means that you want all the rows in column #3#.
If you only want rows #10# through #23#, you would use:
> Data_February_4$temperature[10:23]
or
> Data_February_4$[10:23,3]
If you want rows #7#, #14#, #21#, and #30# through #33#, you can use:
> Data_February_4$temperature[c(7,14,21,30:33)]
or
> Data_February_4$[c(7,14,21,30:33),3]
If you want all rows except row #8#, you can use:
> Data_February_4$temperature[-8]
or
> Data_February_4[-8,3]
Extracting variables
You can also extract a variable from the data frame so that it exists both within the data frame and also outside the data frame as a vector, directly accessible within the #\mathrm{R}# workspace. You have to choose a name for the vector when you extract it.
> Temps = Data_February_4$temperature
Then even if you alter the data contained in #\mathtt{Temps}# the original data will still be preserved inside the data frame #\mathtt{Data\_February\_4}#
Sorting vectors
You can sort the vector in order from smallest to largest using the #\mathtt{sort()}# function.
> Temps = sort(Temps)
If you want it sorted from largest to smallest, use:
> Temps = sort(Temps, decreasing=TRUE)
or
> Temps = sort(Temps, d=T)
Reordering columns of a data frame
You can sort the entire data frame by one of its columns using the #\mathtt{order()}# function.
For instance, to sort the data frame #\mathtt{Data\_February\_4}# by the temperature column, use:
> Data_February_4 = Data_February_4[order(Data_February_4$temperature),]
Then the rows of the data frame will be re-ordered from the case with the lowest temperature to the case with the highest temperature.
To sort from highest to lowest instead, use:
> Data_February_4 = Data_February_4[-order(Data_February_4$temperature),]
See #\mathrm{R}# Manual for further information.
Or visit omptest.org if jou are taking an OMPT exam.