R Left Join Merge


If you want to use dplyr left join or any other type of join in R to combine information from two or multiple data frames, this post might be very helpful. Here is how to left join only selected columns in R. The merge Function in R; Merge Multiple Data Frames in List; Row Bind Data in R; Column Bind Data in R; Join Data with dplyr Package; R Functions List (+ Examples) The R Programming Language. Summary: In this tutorial you learned how to join two data frames by their row names in the R programming language. In case you have any further. But, generally if you want a left join you should do: merge (x,y,all.x=T). The above code is saying keep all the observations from your x dataset (your left dataset) However, I suspect that that this won't solve your issue since you are getting MORE and not LESS observations than expected.

I’ve been encountering lists of data frames both at work and at play. Most of the time, I need only bind them togetherwith dplyr::bind_rows() or purrr::map_df(). But recently I’ve needed to join them by a shared key. This operation ismore complex. In fact, I admitted defeat earlier this year when I allowed rcicero::get_official() to return a list of data frames rather than a single, tidy table. Forgiveable at the time, but now I know better.

For a quick demonstration, let’s get our list of data frames:

Now we have a list of data frames that share one key column: “A”. I needed some programmatic way to join each data frame to the next,and while cycling through abstractions, I recalled the reduce function from Python, and I was ready to bet my life R had something similar. And we do:for basers, there’s Reduce(), but for civilized, tidyverse folk there’s purrr::reduce().

Here’s how to create and merge df_list together with base R and Reduce():

Hideous, right?! Behold the glory of the tidyverse:

There’s just no comparison. I need to go back and implement this little trick in rcicero pronto.


By way of conclusion, here’s an example from my maxprepsr package that I’ve since learned violates CBS Sports’ Terms of Use. Don’t do this, but here’s the idea:

That is quite a bit of power with just a dash of tidyverse piping.

This article is also available in Spanish.

Merging—also known as joining—two datasets by one or more common ID variables (keys) is a common task for any data scientist. If you get the merge wrong you can create some serious damage to your downstream analysis so you’d better make sure you’re doing the right thing! In order to do so, I’ll walk you through three different approaches to joining tables in R: the {base} way, the {dplyr} way and the SQL way (yes, you can use SQL in R).

Types of Merges

First of, though, let’s review the different ways you can merge datasets. Borrowing from the SQL terminology I will cover these four types:

  • Left join
  • Right join
  • Inner join
  • Full join

Left Join

In a left join involving datasets L and R the final table—let’s call it LR—will contain all records from dataset L but only those records from dataset R whose key (ID) is contained in L.

Right Join

A right join is just like a left join but the other way around: the final table contains all rows from R and only those from L with a matching key. Note that you can re-write any right join of L with R as a left join of R with L.

Inner Join

In an inner join only those records from L and R who have a matching key in the other dataset are contained in the final table.

Full Join

By using a full join the resulting dataset contains all rows from L and all rows from R regardless of whether or not there’s a matching key.

The {base} Way

Enough of the theory, let’s explore how to actually perform a merge in R. First of, the {base} way. In {base} R you use a single function to perform all merge types covered above. Conveniently, it is called merge().

To illustrate the concepts I will use two fictitious datasets of a clinical trial. One table contains demographic information and the other one adverse events recorded throughout the course of the trial. Note that patient P2 has a record in demographics but not in adverse_events and that P4 is contained in adverse_events but not in demographics.

By default, merge() will perform an inner join: only those patients that appear in both the demographics and adverse_events datasets are included in the final table.

To perform a left join, set the all.x parameter to TRUE. For a right join do the same with the all.y parameter.

Finally, a full join can be performed by either setting both all.x and all.y to TRUE or specifying all = TRUE.

In the two example datasets I created, the common key is conveniently called id in both tables. However, this doesn’t necessarily have to be the case. If the two datasets you’d like to merge have different names for their common ID variables you can specify them individually using the by.x and by.y parameters of merge().


The {dplyr} Way

Unlike {base} R—which uses a single function to perform the different merge types—{dplyr} provides one function for each type of join. And fortunately they are named just as you’d expect: left_join(), right_join(), inner_join() and full_join(). Personally I’m a big fan of this interface and thus tend to use {dplyr} for joining datasets much more often than {base}.

Left Join Merge In R

In case the ID variable names of the two tables do not match you need to pass a named vector as argument to by. The name and value corresponds to the key in the first and second table, respectively.

The SQL Way

When it comes to merging tables there’s no way one cannot mention the structured query language (SQL). There are several R packages available from CRAN to directly send SQL queries from R to a database. The {tidyquery} package does something different, though. It takes the SQL query you provide the query() function as input, translates it to {dplyr} code and then executes this {dplyr} code to produce the final result.

For simple queries—like joining tables—this is probably overkill given {dplyr}’s interface is so similar to SQL. However, if you are a SQL wizard and write more complex queries, {tidyquery} can be a great way to become proficient in {dplyr} as it can actually show you the translated {dplyr} code.

Join Tables In R

By the way, there’s also the {dbplyr} package which translates your {dplyr} code into SQL. That way you don’t actually need to learn SQL in order to query a database.

In this articles we’ve covered the four most common ways of joining tables and how to implement them in R using {base}, {dpyr} and SQL via {tidyquery}. Armed with this knowledge you should be able to confidently merge any datasets you come across in R. If you do get stuck feel free to ask a question in the comments below.