output

html_document

html_notebook

pdf_document

toc
true

default

toc
true

Data import and export

::: {.infobox .download data-latex="{download}"} You can download the corresponding R-Code here :::

Before you can start your analysis in R, you first need to import the data you wish to perform the analysis on. You will often be faced with different types of data formats (usually produced by some other statistical software like SPSS or Excel or a text editor). Fortunately, R is fairly flexible with respect to the sources from which data may be imported and you can import the most common data formats into R with the help of a few packages. R can, among others, handle data from the following sources:

In the previous chapter, we saw how we may use the keyboard to input data in R. In the following sections, we will learn how to import data from text files and other statistical software packages.

Getting data for this course

Most of the data sets we will be working with in this course will be stored in text files (i.e., .dat, .txt, .csv). All data sets we will be working with are stored in a repository on GitHub (similar to other cloud storage services such as Dropbox). You can directly import these data sets from GitHub without having to copy data sets from one place to another. If you know the location, where the files are stored, you may conveniently load the data directly from GitHub into R using the read.table() function. The header=TRUE argument indicates that the first line of data represents the header, i.e., it contains the names of the columns. The sep="\t"-argument specifies the delimiter (the character used to separate the columns), which is a TAB in this case.

test_data <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/test_data.dat",
    sep = "\t", header = TRUE)
head(test_data)

Note that it is also possible to download the data from the respective folder on the "Learn@WU" platform, placing it in the working directory and importing it from there. However, this requires an additional step to download the file manually first. If you chose this option, please remember to put the data file in the working directory first. If the import is not working, check your working directory setting using getwd(). Once you placed the file in the working directory, you can import it using the same command as above. Note that the file must be given as a character string (i.e., in quotation marks) and has to end with the file extension (e.g., .csv, .tsv, etc.).

test_data <- read.table("test_data.csv", header = TRUE,
    sep = ",")
head(test_data)

Import data created by other software packages

Sometimes, you may need to import data files created by other software packages, such as Excel or SPSS. In this section we will use the readxl and haven packages to do this. To import a certain file you should first make sure that the file is stored in your current working directory. You can list all file names in your working directory using the list.files() function. If the file is not there, either copy it to your current working directory, or set your working directory to the folder where the file is located using setwd("/path/to/file"). This tells R the folder you are working in. Remember that you have to use / instead of \ to specify the path (if you use Windows paths copied from the explorer they will not work). When your file is in your working directory you can simply enter the filename into the respective import command. The import commands offer various options. For more details enter ?read_excel, ?read_spss after loading the packages.

# import excel files
library(readxl)  #load package to import Excel files
excel_sheets("test_data.xlsx")
survey_data_xlsx <- read_excel("test_data.xlsx", sheet = "mrda_2016_survey")  # 'sheet=x'' specifies which sheet to import
head(survey_data_xlsx)

library(haven)  #load package to import SPSS files
# import SPSS files
survey_data_spss <- read_sav("test_data.sav")
head(survey_data_spss)

The import of other file formats works in a very similar way (e.g., Stata, SAS). Please refer to the respective help-files (e.g., ?read_dta, ?read_sas ...) if you wish to import data created by other software packages.

Import data from Qualtrics

There is also a dedicated package 'qualtRics' which lets you conveniently import data from surveys you conducted via Qualtrics. Simply export your data from Qualtrics as a .csv file (standard option) and you can read it into R as follows:

library(qualtRics)
qualtrics <- read_survey("qualtrics_survey.csv")
head(qualtrics)

When you inspect the data frame in R after you imported the data, you will find that it has some additional information compared to a standard .csv file. For example, each question (column) has the question number that you assigned in Qualtrics but also the Question text as an additional label.

Export data

Exporting to different formats is also easy, as you can just replace "read" with "write" in many of the previously discussed functions (e.g. write.table(object, "file_name")). This will save the data file to the working directory. To check what the current working directory is you can use getwd(). By default, the write.table(object, "file_name")function includes the row number as the first variable. By specifying row.names = FALSE, you may exclude this variable since it doesn't contain any useful information.

write.table(music_data, "musicData.dat", row.names = FALSE,
    sep = "\t")  #writes to a tab-delimited text file
write.table(music_data, "musicData.csv", row.names = FALSE,
    sep = ",")  #writes to a comma-separated value file 
write_sav(music_data, "my_file.sav")

Import data from the Web

Scraping data from websites

Sometimes you may come across interesting data on websites that you would like to analyze. Reading data from websites is possible in R, e.g., using the rvest package. Let's assume you would like to read a table that lists the population of different countries from this Wikipedia page. It helps to first inspect the structure of the website (e.g., using tools like SelectorGadget), so you know which elements you would like to extract. In this case it is fairly obvious that the data are stored in a table for which the associated html-tag is <table>. So let's read the entire website using read_html(url) and filter all tables using read_html(html_nodes(...,"table")).

library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
population <- read_html(url)
population <- html_nodes(population, "table.wikitable")
print(population)

## {xml_nodeset (1)}
## [1] <table class="wikitable sortable"><tbody>\n<tr>\n<th>Rank</th>\n<th><a hr ...

The output shows that there are two tables on the website and the first one appears to contain the relevant information. So let's read the first table using the html_table() function. Note that population is of class "list". A list is a vector that has other R objects (e.g., other vectors, data frames, matrices, etc.) as its elements. If we want to access the data of one of the elements, we have to use two square brackets on each side instead of just one (e.g., population[[1]] gets us the first table from the list of tables on the website; the argument fill = TRUE ensures that empty cells are replaced with missing values when reading the table).

population <- population[[1]] %>%
    html_table(fill = TRUE)
head(population)  #checks if we scraped the desired data

You can see that population is read as a character variable because of the commas.

class(population$Population)

## [1] "character"

If we wanted to use this variable for some kind of analysis, we would first need to convert it to numeric format using the as.numeric() function. However, before we can do this, we can use the str_replace_all() function from the stringr package, which replaces all matches of a string. In our case, we would like to replace the commas (",") with nothing ("").

library(stringr)
population$Population <- as.numeric(str_replace_all(population$Population,
    pattern = ",", replacement = ""))  #convert to numeric
head(population)  #checks if we scraped the desired data

Now the variable is of type "numeric" and could be used for analysis.

class(population$Population)

## [1] "numeric"

Scraping data from APIs

Scraping data from APIs directly

Reading data from websites can be tricky since you need to analyze the page structure first. Many web-services (e.g., Facebook, Twitter, YouTube) actually have application programming interfaces (API's), which you can use to obtain data in a pre-structured format. JSON (JavaScript Object Notation) is a popular lightweight data-interchange format in which data can be obtained. The process of obtaining data is visualized in the following graphic:

The process of obtaining data from APIs consists of the following steps:

Identify an API that has enough data to be relevant and reliable (e.g., www.programmableweb.com has >12,000 open web APIs in 63 categories).
Request information by calling (or, more technically speaking, creating a request to) the API (e.g., R, python, php or JavaScript).
Receive response messages, which is usually in JavaScript Object Notation (JSON) or Extensible Markup Language (XML) format.
Write a parser to pull out the elements you want and put them into a of simpler format
Store, process or analyze data according the marketing research question.

Let's assume that you would like to obtain population data again. The World Bank has an API that allows you to easily obtain this kind of data. The details are usually provided in the API reference, e.g., here. You simply "call" the API for the desired information and get a structured JSON file with the desired key-value pairs in return. For example, the population for Austria from 1960 to 2019 can be obtained using this call. The file can be easily read into R using the fromJSON()-function from the jsonlite-package. Again, the result is a list and the second element ctrydata[[2]] contains the desired data, from which we select the "value" and "data" columns using the square brackets as usual [,c("value","date")]

library(jsonlite)
url <- "http://api.worldbank.org/v2/countries/AT/indicators/SP.POP.TOTL/?date=1960:2019&format=json&per_page=100"  #specifies url
ctrydata <- fromJSON(url)  #parses the data 
str(ctrydata)

## List of 2
##  $ :List of 7
##   ..$ page       : int 1
##   ..$ pages      : int 1
##   ..$ per_page   : int 100
##   ..$ total      : int 60
##   ..$ sourceid   : chr "2"
##   ..$ sourcename : chr "World Development Indicators"
##   ..$ lastupdated: chr "2021-07-30"
##  $ :'data.frame':	60 obs. of  8 variables:
##   ..$ indicator      :'data.frame':	60 obs. of  2 variables:
##   .. ..$ id   : chr [1:60] "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
##   .. ..$ value: chr [1:60] "Population, total" "Population, total" "Population, total" "Population, total" ...
##   ..$ country        :'data.frame':	60 obs. of  2 variables:
##   .. ..$ id   : chr [1:60] "AT" "AT" "AT" "AT" ...
##   .. ..$ value: chr [1:60] "Austria" "Austria" "Austria" "Austria" ...
##   ..$ countryiso3code: chr [1:60] "AUT" "AUT" "AUT" "AUT" ...
##   ..$ date           : chr [1:60] "2019" "2018" "2017" "2016" ...
##   ..$ value          : int [1:60] 8879920 8840521 8797566 8736668 8642699 8546356 8479823 8429991 8391643 8363404 ...
##   ..$ unit           : chr [1:60] "" "" "" "" ...
##   ..$ obs_status     : chr [1:60] "" "" "" "" ...
##   ..$ decimal        : int [1:60] 0 0 0 0 0 0 0 0 0 0 ...

head(ctrydata[[2]][, c("value", "date")])  #checks if we scraped the desired data

Scraping data from APIs via R packages

An even more convenient way to obtain data from web APIs is to use existing R packages that someone else has already created. There are R packages available for various web-services. For example, the gtrendsR package can be used to conveniently obtain data from the Google Trends page. The gtrends() function is easy to use and returns a list of elements (e.g., "interest over time", "interest by city", "related topics"), which can be inspected using the ls() function. The following example can be used to obtain data for the search term "data science" in the US between September 1 and October 6:

library(gtrendsR)
# specify search term, area, source and time
# frame
google_trends <- gtrends("data science", geo = c("US"),
    gprop = c("web"), time = "2012-09-01 2020-10-06")
# inspect trend over time data frame
head(google_trends$interest_over_time)

Although we haven't covered data visualization yet (see chapter 5), you could also easily plot the data to see the increasing trend for the search term we selected using the plot()-function. Note that the argument type = "b" indicates that both - a combination of line and points - should be used.

# plot data
plot(google_trends$interest_over_time[, c("date", "hits")],
    type = "b")

Another advantage of R is that it is open to user contributions. This often means that packages that allow users to collect data to investigate timely issues are available fairly quickly. As an example, consider the recent pandemic where many resources were made available via R packages to researchers (see here for an overview). For example, we might want to get information on the number of daily confirmed cases in the US on the state level. We could obtain this information in just one line of code using the 'COVID19' package.

library(COVID19)
covid_data <- covid19(country = "US", level = 2, start = "2020-01-01")

## We have invested a lot of time and effort in creating COVID-19 Data Hub, please cite the following when using it:
## 
##   Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open
##   Source Software 5(51):2376, doi: 10.21105/joss.02376.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {COVID-19 Data Hub},
##     year = {2020},
##     doi = {10.21105/joss.02376},
##     author = {Emanuele Guidotti and David Ardia},
##     journal = {Journal of Open Source Software},
##     volume = {5},
##     number = {51},
##     pages = {2376},
##   }
## 
## To retrieve citation and metadata of the data sources see ?covid19cite. To hide this message use 'verbose = FALSE'.

head(covid_data)

Again, we could plot this data easily. In the following example, we first subset the data to the state of New York and then plot the development over time using the plot()-function. The argument type = "l" indicates that a line plot should be produced.

# plot data
plot(covid_data[covid_data$administrative_area_level_2 ==
    "New York", c("date", "confirmed")], type = "l")

Learning check {-}

(LC3.1) Which of the following are data types are recognized by R?

Factor
Date
Decimal
Vector
None of the above

(LC3.2) What function should you use to check if an object is a data frame?

type()
str()
class()
object.type()
None of the above

(LC3.3) You would like to combine three vectors (student, grade, date) in a data frame. What would happen when executing the following code?

student <- c("Max", "Jonas", "Saskia", "Victoria")
grade <- c(3, 2, 1, 2)
date <- as.Date(c("2020-10-06", "2020-10-08", "2020-10-09"))
df <- data.frame(student, grade, date)

Error because a data frame can not have different data types
Error because you should use as.data.frame() instead of data.frame()
Error because all vectors need to have the same length
Error because the column names are not specified
This code should not report an error

You would like to analyze the following data frame

(LC3.4) How can you obtain Christina's grade from the data frame?

df[4,2]
df[2,4]
df[student="Christina","grade"]
df[student=="Christina","grade"]
None of the above

(LC3.5) How can you add a new variable 'student_id' to the data frame that assigns numbers to students in an ascending order?

df$student_id <- 1:nrow(df)
df&student_id <- 1:nrow(df)
df[,"student_id"] <- 1:nrow(df)
df$student_id <- 1:length(df)
None of the above

(LC3.6) How could you obtain all rows with students who obtained a 1?

df[df$grade==1,]
df[grade == min(df$grade),]
df[,df$grade==1]
df[grade==1,]
None of the above

**(LC3.7) How could you create a subset of observations where the grade is not missing (NA) **

df_subset <- df[grade!=NA,]
df_subset <- df[isnot.na(grade),]
df_subset <- df[!is.na(grade),]
df_subset <- df[,grade!=NA]
None of the above

(LC3.8) What is the share of students with a grade better than 3?

df[grade<3,]/nrow(df)
nrow(df[grade<3,])/length(df)
nrow(df[grade<3,])/nrow(df)
nrow(df[,grade<3])/nrow(df)
None of the above

(LC3.9) You would like to load a .csv file from your working directory. What function would you use do it?

read.table(file_name.csv)
load.csv("file.csv")
read.table("file.csv")
get.table(file_name.csv)
None of the above

(LC3.10) After you loaded the file, you would like to inspect the types of data contained in it. How would you do it?

ncol(df)
nrow(df)
dim(df)
str(df)
None of the above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03-data_import.md

03-data_import.md

Data import and export

Getting data for this course

Import data created by other software packages

Import data from Qualtrics

Export data

Import data from the Web

Scraping data from websites

Scraping data from APIs

Scraping data from APIs directly

Scraping data from APIs via R packages

Learning check {-}

Files

03-data_import.md

Latest commit

History

03-data_import.md

File metadata and controls

Data import and export

Getting data for this course

Import data created by other software packages

Import data from Qualtrics

Export data

Import data from the Web

Scraping data from websites

Scraping data from APIs

Scraping data from APIs directly

Scraping data from APIs via R packages

Learning check {-}