Structured data is the gold standard to work with. With a properly setup database, data is easily retrieved and exported out for analysis. R and other statistical programming languages have the ability to pull directly from these sources. Back in October 2017, RStudio added a “Connections” tab to natively connect R to outside data sources. It’s incredibly easy to setup and use.
A large portion of analytics work is ad-hoc. Upper management sees an overall trend in the monthly numbers and asks you to dive in. You grab some data, slice and dice, and produce a nice visualization to go along with your write-up. The COO reads your explanation, asks you a few questions about the results then moves on. Most analysis works this way and having a static spreadsheet work just fine.
On occasion though, a larger question is asked. What is the current forecast of widgets sold? The trend the last few weeks has been down Year over Year. Beyond this month, what does that mean for the rest of the year and our busy season? A static spreadsheet will not answer the bell when the COO asks you a follow-up question the next week. Your data source must be dynamic and paired with your analytical tool.
The first step of analytical problem solving is to identify a data source and import it in for investigation. R accommodates lots of various sources ranging from the straightforward (think Excel or a CSV) to complicated like SAS. My favorite way is to hit a database and a clear winner when talking about dynamic data.
Loading data from SQL to R has typically been a clunky process. One would load RODBC or a similar package and pass SQL code through a function. Previously, I routinely passed SQL statements into R to pull data from the CRM like so:
db <- odbcConnect('ODBC', uid='user', pwd='pass')
df <- sqlQuery(db, query = 'select * from table', as.is=T)
It’s easy enough, but without the aide of a tool like SSMS, it’s very easy to mess up the syntax or use an improper field name. In the code above, I did a simple “select *” query. Typically I will pass through much more complex queries that reference multiple tables and fields. It gets tricky when the query is already in R and you need to make changes. To combat such issues, RStudio’s “Connections” tab makes the process seamless.
Advantages of the new tab:
- View into tables and fields in your SQL database. Gone are the days of trying to remember what tables and fields are called (is it fieldid or field_id???).
- Only pull the data you absolutely need. Data exploration is a key first step of analysis, but can be easily accomplished with the new pane.
- Filter and join data with R syntax instead of awkwardly passing through SQL. Rather than using SQL in R, this allows you to use dplyr to manipulate the data.
Here I am using a simple bit of R code to aggregate the number of cities present in my data by state.
city <- tbl(con, in_schema('dbo','Cities'))
a <- city %>%
no_cities = n()
The Connections tab makes life so much easier to connect to your database and have “live” analysis. I love using both R and SQL, but I strongly dislike using SQL in R. This enhancement eliminates those concerns. RStudio was already the best IDE on the market, but this integration continues to raise the bar.