Sometimes, data that we want to combine for analyses are separated across different spreadsheets or data tables. How can we combine these different data tables? Join operations (FMI on joining two data tables) offer a way to merge data across multiple data tables (also called data frames in R parlance).
Below, I will first create two data tables that store different characteristics about fruit.
### Load dplyr and stringr packages into R workspace
library(dplyr)
library(stringr)
### Table storing colors of 5 fruits
dt1 <- tibble::tibble(fruit=c("apple","pear","orange","kiwi","mangosteen"),
color=c("red","green","orange","brown","purple"))
dt1 # display table
## # A tibble: 5 × 2
## fruit color
## <chr> <chr>
## 1 apple red
## 2 pear green
## 3 orange orange
## 4 kiwi brown
## 5 mangosteen purple
### Table storing prices of 3 fruits
dt2 <- tibble::tibble(fruit=c("pear","orange","mangosteen","cherimoya"),
price=c(1.25,1,5,4.7)) # price per pound
dt2 # display table
## # A tibble: 4 × 2
## fruit price
## <chr> <dbl>
## 1 pear 1.25
## 2 orange 1
## 3 mangosteen 5
## 4 cherimoya 4.7
All join operations assume there is some column between your data tables that has consistent values that you can use to merge the records in the data tables together. In the case of this simple example, the shared column is fruit.
The first join that we will look at is the left_join function from the dplyr package. The left_join will keep all of the rows in the left-hand side table and attempt to match entries from the second (right-hand side) table. If there is no match, R will populate NA (missing value) at that missing join location.
left_join(dt1, dt2)
## # A tibble: 5 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 apple red NA
## 2 pear green 1.25
## 3 orange orange 1
## 4 kiwi brown NA
## 5 mangosteen purple 5
# Note that the left_join function detects that both data tables both have a column named "fruit". What would happen if they didn't have a column with the same name?
We see that R has merged the two data tables, dt1 and dt2 together; the resulting data table above has all of the unique columns across these two data tables (fruit - the column we used to merge the two data tables together, color from dt1, and price from dt2). We also see that R has matched the records across the two data tables–see for yourself for example that the row for pear has the correct values from dt1 (color=green) and dt2 (price=1.25).
NA values?What we see above is that because dt2 didn’t have any data for apple and kiwi, R has attempted to match all 5 fruits from dt1 (left-hand side data table) with the fruits in dt2, and the fruits that were in dt1 but not dt2 get an NA value in the price column, which is merged from dt2 to dt1. Note also that the fruit that is only in dt2 but not dt1 doesn’t show up in the merged data table (no cherimoya row).
In this case, R retains all of the records from dt2. Because dt1 has some fruits that dt2 does not, that means that the fruits that are unique to dt1 will not show up in the merged data table.
right_join(dt1, dt2)
## # A tibble: 4 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 pear green 1.25
## 2 orange orange 1
## 3 mangosteen purple 5
## 4 cherimoya <NA> 4.7
As before, because cherimoya is only defined in dt2, but not dt1, R populates an NA value for all of the columns that are coming from dt1 (in this case, the column color) to the final merged data table.
In the inner_join function, only those rows that have exact matches across the two data tables are preserved in the final merged data table.
inner_join(dt1, dt2)
## # A tibble: 3 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 pear green 1.25
## 2 orange orange 1
## 3 mangosteen purple 5
We see that this resulting data table presents only the three fruits that are included in both dt1 and dt2.
In a full join operation, all unique elements across dt1 and dt2 are preserved in the final merged data table. For non-matching pairs, R will fill in an NA value.
full_join(dt1, dt2)
## # A tibble: 6 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 apple red NA
## 2 pear green 1.25
## 3 orange orange 1
## 4 kiwi brown NA
## 5 mangosteen purple 5
## 6 cherimoya <NA> 4.7
We see that this merged table has all 6 unique fruits across the two data tables. We see that R has assigned NA in price (the column from dt2) for the two fruits that are in dt1 but not dt2 (apple and kiwi). We also see that there is an NA value in color (the column from dt1) for cherimoya, which is a fruit that was only listed in dt2.
R can’t read our minds, for better or for worse. So if we notice that our data tables are not consistent in how they present the fruits, then we’ll see different join behavior. R is looking for an exact match of the words in the joining column (fruit).
### Modifying dt2
dt2 <- tibble::tibble(fruit=c("Pear","Orange","Mangosteen","Cherimoya"),
price=c(1.25,1,5,4.7)) # price per pound
dt2 # display table
## # A tibble: 4 × 2
## fruit price
## <chr> <dbl>
## 1 Pear 1.25
## 2 Orange 1
## 3 Mangosteen 5
## 4 Cherimoya 4.7
# dt1 # display dt1 to compare the fruit column visually
### See what happens - R can't join the fruits up - the words are not identical because R is case sensitive
full_join(dt1,dt2)
## # A tibble: 9 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 apple red NA
## 2 pear green NA
## 3 orange orange NA
## 4 kiwi brown NA
## 5 mangosteen purple NA
## 6 Pear <NA> 1.25
## 7 Orange <NA> 1
## 8 Mangosteen <NA> 5
## 9 Cherimoya <NA> 4.7
Dang! R has treated pear as a different value than Pear. That’s really silly and unfortunate. It’s not “automagically” joining up the data across the two tables because R is case sensitive (that is, pear is different from Pear to R).
We can fix this example here by taking dt2 and making the fruit column all lower-case then re-running the join operation.
### Let's fix this problem and make the fruit names consistent
dt2new <- dt2 %>%
mutate(fruit=tolower(fruit)) # a function that takes character/strings (words) and converts them to lower case
dt2new # confirm that now the fruit names are lower case
## # A tibble: 4 × 2
## fruit price
## <chr> <dbl>
## 1 pear 1.25
## 2 orange 1
## 3 mangosteen 5
## 4 cherimoya 4.7
### Re-run full_join
full_join(dt1,dt2new)
## # A tibble: 6 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 apple red NA
## 2 pear green 1.25
## 3 orange orange 1
## 4 kiwi brown NA
## 5 mangosteen purple 5
## 6 cherimoya <NA> 4.7
Given dt1, dt2, and dt2new that have just been defined above, run and contrast the outputs of the following:
left_join(dt1, dt2)
left_join(dt1, dt2new)right_join(dt1, dt2)
right_join(dt1, dt2new)inner_join(dt1, dt2)
inner_join(dt1, dt2new)inner_join(dt1, dt2) tells you that it didn’t find any matches! It produces a table that has 0 x 3 dimensions, or 0 rows because it didn’t find any matching records across dt1 and dt2.As an additional illustration of data merging, I will illustrate below how one can add on socio-economic data from the USDA and the US Census to the Provided Datasheet. In the code chunk below, I pull in data from the GitHub repository that stores the class R tutorials.
### Read in data
CA_county_data <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_protectedareas_datasheet.tsv") # this is a link to the Provided Datasheet for the class project
### Take a look at the first few rows of the data table
CA_county_data
## # A tibble: 58 × 26
## County Birds Trees Reptiles Mammals area_of_county_ha lat_county_centroid_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alameda 235 30 27 54 212722. 37.7
## 2 Alpine 174 29 15 71 192485 38.6
## 3 Amador 172 41 21 69 156940. 38.4
## 4 Butte 184 45 21 67 434375. 39.7
## 5 Calaver… 173 39 21 66 268563. 38.2
## 6 Colusa 189 40 20 58 299495. 39.2
## 7 Contra … 233 35 25 57 208175 37.9
## 8 Del Nor… 199 56 19 71 318486. 41.7
## 9 El Dora… 174 41 21 71 462662. 38.8
## 10 Fresno 173 38 25 72 1556855. 36.8
## # ℹ 48 more rows
## # ℹ 19 more variables: long_county_centroid_x <dbl>, PAs_gapstatus1_HAs <dbl>,
## # PAs_gapstatus2_HAs <dbl>, PAs_gapstatus3_HAs <dbl>,
## # PAs_gapstatus4_HAs <dbl>, PAs_gapstatus1thru4_HAs <dbl>,
## # PAs_gapstatus1_Ct <dbl>, PAs_gapstatus2_Ct <dbl>, PAs_gapstatus3_Ct <dbl>,
## # PAs_gapstatus4_Ct <dbl>, PAs_gapstatus1thru4_Cts <dbl>,
## # farmland_value_USDperHA <dbl>, farmland_area_HAs <dbl>, …
### Data on socio-economic statuses in California
CA_county_SES <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_ses.tsv") # URL for spreadsheet storing socioeconomic data by county in California
### Take a look at the first few rows of the data table
CA_county_SES
## # A tibble: 58 × 6
## State County rurban PctPoverty PctChildPoverty MedianHHincome
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 CA Alameda County Urban 8.9 9.8 107589
## 2 CA Alpine County Rural 17.2 29.6 58112
## 3 CA Amador County Rural 9.8 12.5 62640
## 4 CA Butte County Urban 16.1 16.1 58394
## 5 CA Calaveras County Rural 12.1 19.6 68248
## 6 CA Colusa County Rural 12 15.3 59048
## 7 CA Contra Costa County Urban 7.9 9.9 106555
## 8 CA Del Norte County Rural 17.9 25.8 48979
## 9 CA El Dorado County Urban 8.7 9.1 86202
## 10 CA Fresno County Urban 20.5 29 56926
## # ℹ 48 more rows
### Data on racial composition in California
CA_county_demog <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_census.tsv")
### Take a look at the first few rows of the data table
CA_county_demog
## # A tibble: 58 × 10
## STNAME CTYNAME TOT_POP BAC IAC AAC NAC LATINX NHLATINX WA
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 California Alamed… 1650950 0.133 0.0235 0.334 0.0162 0.224 0.776 0.493
## 2 California Alpine… 1047 0.0143 0.233 0.0210 0.00287 0.100 0.900 0.729
## 3 California Amador… 37429 0.0289 0.0418 0.0244 0.00532 0.134 0.866 0.900
## 4 California Butte … 226231 0.0296 0.0445 0.0616 0.00653 0.160 0.840 0.858
## 5 California Calave… 45322 0.0159 0.0390 0.0259 0.00519 0.116 0.884 0.914
## 6 California Colusa… 21496 0.0203 0.0377 0.0228 0.00791 0.587 0.413 0.911
## 7 California Contra… 1137268 0.111 0.0221 0.202 0.0113 0.254 0.746 0.654
## 8 California Del No… 27382 0.0424 0.127 0.0452 0.00427 0.196 0.804 0.782
## 9 California El Dor… 185976 0.0165 0.0278 0.0599 0.00545 0.127 0.873 0.890
## 10 California Fresno… 976830 0.0679 0.0421 0.121 0.00517 0.527 0.473 0.764
## # ℹ 48 more rows
Note that you can find a description of the Provided Datasheet Data Explanation here.
On the other hand, the columns in CA_county_SES are as follows:
State: Abbreviated name for California (CA)County: County namerurban: Rural or urban designation for the countyPctPoverty: The percentage of the county’s population that lives in poverty in 2019PctChildPoverty: The percentage of the county’s children (ages 0-17) that live in poverty in 2019MedianHHincome: Median annual household income in the countyThese data for California were modified from national County-level Data setsat the USDA Economic Research Service. The county-level data sets website opens up a data viewer; here is the data viewer for California.
The columns in CA_county_demog are as follows:
STNAME: The name of the stateCTYNAME: The name of the countyTOT_POP: Total population of the countyBAC: The proportion of county residents who self-identify as Black or African-American alone or in combination (in combination includes multiracial respondents)IAC: The proportion of county residents who self-identify as American Indian and Alaska Native alone or in combinationAAC: The proportion of county residents who self-identify as Asian alone or in combinationNAC: The proportion of county residents who self-identify as Native Hawaiian and Other Pacific Islander alone or in combinationLATINX: The proportion of the county residents who self-identify as Hispanic and/or LatinxNHLATINX: The proportion of the county residents who self-identify as Not Hispanic and/or not LatinxWA: The proportion of the county residents who self-identify as WhiteAll of the racial/ethnic groupings above are those defined by the US Census as described in the metadata document. These data were calculated from the US Census Bureau County Population by Characteristics: 2010-2019; specifically the Annual County Resident Population Estimates by Age, Sex, Race... section of the site.
Let’s say that for an analysis, you wanted to combine some information about biodiversity (say Birds species richness) with socio-economic data. We can use the data joining operations that we just covered above to merge these data tables together, which permits us to plot our data and (as we will see after Spring Break) perform other analyses such as linear regression models.
We will join up CA_county_data (the Provided Datasheet with biodiversity and land use data) with CA_county_SES and CA_county_demog. To do so, we will merge the data tables based on the column that stores county names.
The first thing that we will need to do is ensure that the county names are consistent (no weird issues with lower/upper case or anything else). What do we see?
### Show the first six counties in each data table
head(CA_county_data$County) # Here we are using the $ symbol to refer to a particular column that we want to pull out of a data table. This is pretty much equivalent to calling dplyr::select(CA_county_data, County)
## [1] "Alameda" "Alpine" "Amador" "Butte" "Calaveras" "Colusa"
head(CA_county_SES$County)
## [1] "Alameda County" "Alpine County" "Amador County" "Butte County"
## [5] "Calaveras County" "Colusa County"
head(CA_county_demog$CTYNAME) # note that county in this data table is stored in a column called CTYNAME, not County
## [1] "Alameda County" "Alpine County" "Amador County" "Butte County"
## [5] "Calaveras County" "Colusa County"
Oof! We see that our matches won’t work–both CA_county_SES and CA_county_demog have County added to the end of each county’s name, while CA_county_data does not. No problem there–we’ll use a “string substitution” function (akin to using find and replace to search for then delete a string or replace it with something else) to remove County from the end of the county names. We’ll then check the county name columns in the two data tables to ensure that we’ve got names for counties that will match across all three data tables.
CA_county_SES$County <- str_replace_all(CA_county_SES$County," County","") # replace " County" at the end of the county names with nothing - basically like find and replace deleting the " County" string.
CA_county_demog$CTYNAME <- str_replace_all(CA_county_demog$CTYNAME," County","")
### Checking that we've removed " County" by inspecting the first few values in the county name columns
head(CA_county_SES$County)
## [1] "Alameda" "Alpine" "Amador" "Butte" "Calaveras" "Colusa"
head(CA_county_demog$CTYNAME)
## [1] "Alameda" "Alpine" "Amador" "Butte" "Calaveras" "Colusa"
Below, we will use inner_join to first join CA_county_data with CA_county_SES, then CA_county_demog.
### First we will check the dimensions of the CA_county_data column
dim(CA_county_data) # 58 rows by 26 columns
## [1] 58 26
### Merge the data table that has socioeconomic data to CA_county_data
CA_merged <- inner_join(CA_county_data, CA_county_SES)
dim(CA_merged) # now we have 31 columns
## [1] 58 31
### Merge the demographic data table to the merged data table above, which was stored in CA_merged
CA_merged <- inner_join(CA_merged, CA_county_demog, by=c("County"="CTYNAME"))
# Remember my question earlier about doing matches across data tables where the column names aren't the same? Specifying by=c(columnNameForMatchingInDataTable1 = column_name_for_matching_in_data_table_2) is how we can tell R which columns to merge the data tables on.
dim(CA_merged) # now we have 40 columns
## [1] 58 40
### Display the final merged data table
# View(CA_merged) # uncomment to run