Sometimes, data that we want to combine for analyses are separated across different spreadsheets or data tables. How can we combine these different data tables? Join
operations (FMI on join
ing two data tables) offer a way to merge data across multiple data tables (also called data frames in R
parlance).
Below, I will first create two data tables that store different characteristics about fruit.
### Load dplyr and stringr packages into R workspace
library(dplyr)
library(stringr)
### Table storing colors of 5 fruits
dt1 <- tibble::tibble(fruit=c("apple","pear","orange","kiwi","mangosteen"),
color=c("red","green","orange","brown","purple"))
dt1 # display table
## # A tibble: 5 × 2
## fruit color
## <chr> <chr>
## 1 apple red
## 2 pear green
## 3 orange orange
## 4 kiwi brown
## 5 mangosteen purple
### Table storing prices of 3 fruits
dt2 <- tibble::tibble(fruit=c("pear","orange","mangosteen","cherimoya"),
price=c(1.25,1,5,4.7)) # price per pound
dt2 # display table
## # A tibble: 4 × 2
## fruit price
## <chr> <dbl>
## 1 pear 1.25
## 2 orange 1
## 3 mangosteen 5
## 4 cherimoya 4.7
All join operations assume there is some column between your data tables that has consistent values that you can use to merge the records in the data tables together. In the case of this simple example, the shared column is fruit
.
The first join that we will look at is the left_join
function from the dplyr
package. The left_join
will keep all of the rows in the left-hand side table and attempt to match entries from the second (right-hand side) table. If there is no match, R
will populate NA
(missing value) at that missing join location.
left_join(dt1, dt2)
## # A tibble: 5 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 apple red NA
## 2 pear green 1.25
## 3 orange orange 1
## 4 kiwi brown NA
## 5 mangosteen purple 5
# Note that the left_join function detects that both data tables both have a column named "fruit". What would happen if they didn't have a column with the same name?
We see that R
has merged the two data tables, dt1
and dt2
together; the resulting data table above has all of the unique columns across these two data tables (fruit
- the column we used to merge the two data tables together, color
from dt1
, and price
from dt2
). We also see that R
has matched the records across the two data tables–see for yourself for example that the row for pear
has the correct values from dt1
(color=green
) and dt2
(price=1.25
).
NA
values?What we see above is that because dt2
didn’t have any data for apple
and kiwi
, R
has attempted to match all 5 fruits from dt1
(left-hand side data table) with the fruits in dt2
, and the fruits that were in dt1
but not dt2
get an NA
value in the price
column, which is merged from dt2
to dt1
. Note also that the fruit that is only in dt2
but not dt1
doesn’t show up in the merged data table (no cherimoya
row).
In this case, R
retains all of the records from dt2
. Because dt1
has some fruits that dt2
does not, that means that the fruits that are unique to dt1
will not show up in the merged data table.
right_join(dt1, dt2)
## # A tibble: 4 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 pear green 1.25
## 2 orange orange 1
## 3 mangosteen purple 5
## 4 cherimoya <NA> 4.7
As before, because cherimoya
is only defined in dt2
, but not dt1
, R
populates an NA
value for all of the columns that are coming from dt1
(in this case, the column color
) to the final merged data table.
In the inner_join
function, only those rows that have exact matches across the two data tables are preserved in the final merged data table.
inner_join(dt1, dt2)
## # A tibble: 3 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 pear green 1.25
## 2 orange orange 1
## 3 mangosteen purple 5
We see that this resulting data table presents only the three fruits that are included in both dt1
and dt2
.
In a full join operation, all unique elements across dt1
and dt2
are preserved in the final merged data table. For non-matching pairs, R
will fill in an NA
value.
full_join(dt1, dt2)
## # A tibble: 6 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 apple red NA
## 2 pear green 1.25
## 3 orange orange 1
## 4 kiwi brown NA
## 5 mangosteen purple 5
## 6 cherimoya <NA> 4.7
We see that this merged table has all 6 unique fruits across the two data tables. We see that R
has assigned NA
in price
(the column from dt2
) for the two fruits that are in dt1
but not dt2
(apple
and kiwi
). We also see that there is an NA
value in color
(the column from dt1
) for cherimoya
, which is a fruit that was only listed in dt2
.
R
can’t read our minds, for better or for worse. So if we notice that our data tables are not consistent in how they present the fruits, then we’ll see different join behavior. R
is looking for an exact match of the words in the joining column (fruit
).
### Modifying dt2
dt2 <- tibble::tibble(fruit=c("Pear","Orange","Mangosteen","Cherimoya"),
price=c(1.25,1,5,4.7)) # price per pound
dt2 # display table
## # A tibble: 4 × 2
## fruit price
## <chr> <dbl>
## 1 Pear 1.25
## 2 Orange 1
## 3 Mangosteen 5
## 4 Cherimoya 4.7
# dt1 # display dt1 to compare the fruit column visually
### See what happens - R can't join the fruits up - the words are not identical because R is case sensitive
full_join(dt1,dt2)
## # A tibble: 9 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 apple red NA
## 2 pear green NA
## 3 orange orange NA
## 4 kiwi brown NA
## 5 mangosteen purple NA
## 6 Pear <NA> 1.25
## 7 Orange <NA> 1
## 8 Mangosteen <NA> 5
## 9 Cherimoya <NA> 4.7
Dang! R
has treated pear
as a different value than Pear
. That’s really silly and unfortunate. It’s not “automagically” joining up the data across the two tables because R
is case sensitive (that is, pear
is different from Pear
to R
).
We can fix this example here by taking dt2
and making the fruit
column all lower-case then re-running the join operation.
### Let's fix this problem and make the fruit names consistent
dt2new <- dt2 %>%
mutate(fruit=tolower(fruit)) # a function that takes character/strings (words) and converts them to lower case
dt2new # confirm that now the fruit names are lower case
## # A tibble: 4 × 2
## fruit price
## <chr> <dbl>
## 1 pear 1.25
## 2 orange 1
## 3 mangosteen 5
## 4 cherimoya 4.7
### Re-run full_join
full_join(dt1,dt2new)
## # A tibble: 6 × 3
## fruit color price
## <chr> <chr> <dbl>
## 1 apple red NA
## 2 pear green 1.25
## 3 orange orange 1
## 4 kiwi brown NA
## 5 mangosteen purple 5
## 6 cherimoya <NA> 4.7
Given dt1
, dt2
, and dt2new
that have just been defined above, run and contrast the outputs of the following:
left_join(dt1, dt2)
left_join(dt1, dt2new)
right_join(dt1, dt2)
right_join(dt1, dt2new)
inner_join(dt1, dt2)
inner_join(dt1, dt2new)
inner_join(dt1, dt2)
tells you that it didn’t find any matches! It produces a table that has 0 x 3
dimensions, or 0
rows because it didn’t find any matching records across dt1
and dt2
.As an additional illustration of data merging, I will illustrate below how one can add on socio-economic data from the USDA and the US Census to the Provided Datasheet. In the code chunk below, I pull in data from the GitHub repository that stores the class R
tutorials.
### Read in data
CA_county_data <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_protectedareas_datasheet.tsv") # this is a link to the Provided Datasheet for the class project
### Take a look at the first few rows of the data table
CA_county_data
## # A tibble: 58 × 26
## County Birds Trees Reptiles Mammals area_of_county_ha lat_county_centroid_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alameda 235 30 27 54 212722. 37.7
## 2 Alpine 174 29 15 71 192485 38.6
## 3 Amador 172 41 21 69 156940. 38.4
## 4 Butte 184 45 21 67 434375. 39.7
## 5 Calaver… 173 39 21 66 268563. 38.2
## 6 Colusa 189 40 20 58 299495. 39.2
## 7 Contra … 233 35 25 57 208175 37.9
## 8 Del Nor… 199 56 19 71 318486. 41.7
## 9 El Dora… 174 41 21 71 462662. 38.8
## 10 Fresno 173 38 25 72 1556855. 36.8
## # ℹ 48 more rows
## # ℹ 19 more variables: long_county_centroid_x <dbl>, PAs_gapstatus1_HAs <dbl>,
## # PAs_gapstatus2_HAs <dbl>, PAs_gapstatus3_HAs <dbl>,
## # PAs_gapstatus4_HAs <dbl>, PAs_gapstatus1thru4_HAs <dbl>,
## # PAs_gapstatus1_Ct <dbl>, PAs_gapstatus2_Ct <dbl>, PAs_gapstatus3_Ct <dbl>,
## # PAs_gapstatus4_Ct <dbl>, PAs_gapstatus1thru4_Cts <dbl>,
## # farmland_value_USDperHA <dbl>, farmland_area_HAs <dbl>, …
### Data on socio-economic statuses in California
CA_county_SES <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_ses.tsv") # URL for spreadsheet storing socioeconomic data by county in California
### Take a look at the first few rows of the data table
CA_county_SES
## # A tibble: 58 × 6
## State County rurban PctPoverty PctChildPoverty MedianHHincome
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 CA Alameda County Urban 8.9 9.8 107589
## 2 CA Alpine County Rural 17.2 29.6 58112
## 3 CA Amador County Rural 9.8 12.5 62640
## 4 CA Butte County Urban 16.1 16.1 58394
## 5 CA Calaveras County Rural 12.1 19.6 68248
## 6 CA Colusa County Rural 12 15.3 59048
## 7 CA Contra Costa County Urban 7.9 9.9 106555
## 8 CA Del Norte County Rural 17.9 25.8 48979
## 9 CA El Dorado County Urban 8.7 9.1 86202
## 10 CA Fresno County Urban 20.5 29 56926
## # ℹ 48 more rows
### Data on racial composition in California
CA_county_demog <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_census.tsv")
### Take a look at the first few rows of the data table
CA_county_demog
## # A tibble: 58 × 10
## STNAME CTYNAME TOT_POP BAC IAC AAC NAC LATINX NHLATINX WA
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 California Alamed… 1650950 0.133 0.0235 0.334 0.0162 0.224 0.776 0.493
## 2 California Alpine… 1047 0.0143 0.233 0.0210 0.00287 0.100 0.900 0.729
## 3 California Amador… 37429 0.0289 0.0418 0.0244 0.00532 0.134 0.866 0.900
## 4 California Butte … 226231 0.0296 0.0445 0.0616 0.00653 0.160 0.840 0.858
## 5 California Calave… 45322 0.0159 0.0390 0.0259 0.00519 0.116 0.884 0.914
## 6 California Colusa… 21496 0.0203 0.0377 0.0228 0.00791 0.587 0.413 0.911
## 7 California Contra… 1137268 0.111 0.0221 0.202 0.0113 0.254 0.746 0.654
## 8 California Del No… 27382 0.0424 0.127 0.0452 0.00427 0.196 0.804 0.782
## 9 California El Dor… 185976 0.0165 0.0278 0.0599 0.00545 0.127 0.873 0.890
## 10 California Fresno… 976830 0.0679 0.0421 0.121 0.00517 0.527 0.473 0.764
## # ℹ 48 more rows
Note that you can find a description of the Provided Datasheet Data Explanation here.
On the other hand, the columns in CA_county_SES
are as follows:
State
: Abbreviated name for California (CA
)County
: County namerurban
: Rural or urban designation for the countyPctPoverty
: The percentage of the county’s population that lives in poverty in 2019PctChildPoverty
: The percentage of the county’s children (ages 0-17) that live in poverty in 2019MedianHHincome
: Median annual household income in the countyThese data for California were modified from national County-level Data setsat the USDA Economic Research Service. The county-level data sets website opens up a data viewer; here is the data viewer for California.
The columns in CA_county_demog
are as follows:
STNAME
: The name of the stateCTYNAME
: The name of the countyTOT_POP
: Total population of the countyBAC
: The proportion of county residents who self-identify as Black or African-American alone or in combination
(in combination
includes multiracial respondents)IAC
: The proportion of county residents who self-identify as American Indian and Alaska Native alone or in combination
AAC
: The proportion of county residents who self-identify as Asian alone or in combination
NAC
: The proportion of county residents who self-identify as Native Hawaiian and Other Pacific Islander alone or in combination
LATINX
: The proportion of the county residents who self-identify as Hispanic
and/or LatinxNHLATINX
: The proportion of the county residents who self-identify as Not Hispanic
and/or not LatinxWA
: The proportion of the county residents who self-identify as White
All of the racial/ethnic groupings above are those defined by the US Census as described in the metadata document. These data were calculated from the US Census Bureau County Population by Characteristics: 2010-2019; specifically the Annual County Resident Population Estimates by Age, Sex, Race...
section of the site.
Let’s say that for an analysis, you wanted to combine some information about biodiversity (say Birds
species richness) with socio-economic data. We can use the data joining operations that we just covered above to merge these data tables together, which permits us to plot our data and (as we will see after Spring Break) perform other analyses such as linear regression models.
We will join up CA_county_data
(the Provided Datasheet with biodiversity and land use data) with CA_county_SES
and CA_county_demog
. To do so, we will merge the data tables based on the column that stores county names.
The first thing that we will need to do is ensure that the county names are consistent (no weird issues with lower/upper case or anything else). What do we see?
### Show the first six counties in each data table
head(CA_county_data$County) # Here we are using the $ symbol to refer to a particular column that we want to pull out of a data table. This is pretty much equivalent to calling dplyr::select(CA_county_data, County)
## [1] "Alameda" "Alpine" "Amador" "Butte" "Calaveras" "Colusa"
head(CA_county_SES$County)
## [1] "Alameda County" "Alpine County" "Amador County" "Butte County"
## [5] "Calaveras County" "Colusa County"
head(CA_county_demog$CTYNAME) # note that county in this data table is stored in a column called CTYNAME, not County
## [1] "Alameda County" "Alpine County" "Amador County" "Butte County"
## [5] "Calaveras County" "Colusa County"
Oof! We see that our matches won’t work–both CA_county_SES
and CA_county_demog
have County
added to the end of each county’s name, while CA_county_data
does not. No problem there–we’ll use a “string substitution” function (akin to using find and replace to search for then delete a string or replace it with something else) to remove County
from the end of the county names. We’ll then check the county name columns in the two data tables to ensure that we’ve got names for counties that will match across all three data tables.
CA_county_SES$County <- str_replace_all(CA_county_SES$County," County","") # replace " County" at the end of the county names with nothing - basically like find and replace deleting the " County" string.
CA_county_demog$CTYNAME <- str_replace_all(CA_county_demog$CTYNAME," County","")
### Checking that we've removed " County" by inspecting the first few values in the county name columns
head(CA_county_SES$County)
## [1] "Alameda" "Alpine" "Amador" "Butte" "Calaveras" "Colusa"
head(CA_county_demog$CTYNAME)
## [1] "Alameda" "Alpine" "Amador" "Butte" "Calaveras" "Colusa"
Below, we will use inner_join
to first join CA_county_data
with CA_county_SES
, then CA_county_demog
.
### First we will check the dimensions of the CA_county_data column
dim(CA_county_data) # 58 rows by 26 columns
## [1] 58 26
### Merge the data table that has socioeconomic data to CA_county_data
CA_merged <- inner_join(CA_county_data, CA_county_SES)
dim(CA_merged) # now we have 31 columns
## [1] 58 31
### Merge the demographic data table to the merged data table above, which was stored in CA_merged
CA_merged <- inner_join(CA_merged, CA_county_demog, by=c("County"="CTYNAME"))
# Remember my question earlier about doing matches across data tables where the column names aren't the same? Specifying by=c(columnNameForMatchingInDataTable1 = column_name_for_matching_in_data_table_2) is how we can tell R which columns to merge the data tables on.
dim(CA_merged) # now we have 40 columns
## [1] 58 40
### Display the final merged data table
# View(CA_merged) # uncomment to run