Abstract: As R developers, our first instinct may be to approach databases the same way we do regular files. We start by reading all of the data into memory and then proceed to data exploration. But what if there is a better way.
The dplyr R package simplifies data transformation. It provides a consistent set of functions, called verbs, that can be used in succession and interchangeably to gain understanding of the data iteratively. Another nice thing about dplyr is that it can interact with databases directly. It accomplishes this by translating the dplyr verbs into SQL queries. This incredibly convenient feature allows us to ‘speak’ directly with the database from R:
- Run data exploration over all of the data - Instead of coming up with a plan to decide what data to import, we can focus on analyzing the data inside the database, which in turn should yield faster insights.
- Use the SQL Engine to run the data transformations - We are, in effect, pushing the computation to the database because dplyr is sending SQL queries to the database.
- Collect a targeted dataset - After become familiar with the data and choosing the data points that will either be shared or modeled, a final query can then be used to bring back only that data into memory in R.
- All your code is in R! - Because we are using dplyr to communicate with the database, there is no need to change language, or tools, to perform the data exploration.
The talk will expand on an article I published in the RStudio RViews blog: https://rviews.rstudio.com/2017/05/17/databases-using-r/
Bio: Edgar has a background in deploying enterprise reporting and Business Intelligence solutions. He has posted multiple articles and blog posts sharing analytics insights and server infrastructure for Data Science.
Solutions Engineer at RStudio