While Duke Libraries Data and Visualization Services provides a number of workshops on learning how to use R for data management, mapping, and visualization, we also receive a several questions each year from students searching for datasets to use in applied statistics courses at Duke. As others have noted, finding data sets that are engaging (for students) that provide a range of data types for analysis and don’t present too many data loading/cleaning challenges is harder than one might think.
In the hope of making the search a little easier, we offer this blog post to share some considerations for choosing a dataset for an applied statistics course. We also suggests few “curated” datasets designed to ease the process of data access in R.
We work with a number of students using linear models in R. Three of the largest data access and management challenges for students in these courses include:
Format - Are the data in a format that is easy to import into R? R can readily input delimited files. Using packages like googlesheets, readxl, and haven allows one to import google sheets, excel worksheets, and statistical formats. No matter which package you employ, it’s always a great idea to run descriptive statistics on your data after loading any data into R.
Missing data - It’s important to determine if there are enough observations of the variables of interest in your data. Fortunately, R makes it fairly easy to generate summary statistics and identify missing observations.
Factors - Several years ago we had many questions about how to deal with non-numeric data that R automatically transformed into factors. With the rise of the tidyverse, these questions have declined, but it remains important to know how to create (and occasionally remove) factors for statistical analysis.
R Datasets from DVS
Data and Visualization Services has created a set of four datasets that we often suggest as a starting point for students who want to explore survey data that has been subsetted for use in R. In addition to these four datasets, we also offer a few other sources that have been popular with Duke students.
ANES 2012 Survey
The American National Elections Study (ANES) contains a variety of attitudinal and voter intention questions and is conducted both before and after an election. This allows the study to compare intentions and beliefs before an election, and actions following an election.
General Social Survey (R Subset)
The General Social Survey (GSS) contains both a standard core of demographic, behavioral, and attitudinal questions, and a second group of questions on special topics that are not asked every year. This survey has been periodically conducted in the United States since 1972, and the cumulative dataset will provide respondents from all survey years. This subset mostly includes only questions that were included every year or almost every year.
World Values Survey (R Subset)
The World Values Survey is a global network of social scientists studying changing values and their impact on social and political life, led by an international team of scholars, with the WVS association and secretariat headquartered in Stockholm, Sweden. This particular extract covers wave 6 of the WVS (2010-2014).
Behavioral Risk Factors Surveillance System (BRFSS)
BRFSS is an ongoing public health surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. This extract of the BRFSS provides data from 2013.
Other Datasets/Data Sources of Interest
Dataset(s) of movie information underlying the popular IMDB database
The world’s largest social science data repository. Some datasets are readily available for R, but many will require some processing or conversion.
DVS is always happy to consult with Duke affiliates on other possibilities, please contact us at email@example.com for further possibilities.