Friday, April 28, 2017

A Simple SQL Server to Practice SQL in R

A Simple SQL Server to Practice SQL in R

I was originally introduced to SQL while programming in SAS. The PROC SQL procedure makes it easy to begin learning SQL because there is no need to have a separate SQL Server to hold the data. However, spending much more time in R, I wanted to continue practicing SQL there. I also wanted to practice learning SQL in an environment that would better emulate how I would be using it outside of a textbook environment.

There are many ODBC Databases that implement the SQL language standard such as MySQL, SQL Lite, MS SQL, etc. Initially I had trouble setting one up at home to practice with because I didn’t want to deal with the details of opening the proper ports and setting the appropriate security settings. I just wanted to get coding and munging. So, I finally settled on what I found to be an easy setup.

I use Microsoft’s SQL Server 2016 Express LocalDB. It is a free SQL server that runs on the same computer that you use for all your coding. There is no need to deal with ports, server accounts, or even separate database management software. A separate database management program is installed along with SQL Express, but I really haven’t needed to use it yet.

So, here is a quick tutorial on how to set up the SQL Express LocalDB. This is the process I went through on Windows 8 to get it set up.


Installing the Server

Go Microsoft’s SQL Express Page and download SQL Server 2016 Express LocalDB.


Run the file that is downloaded and install the Basic installation type.


After the installation process finishes, some basic information about the server will be shown.


The primary thing that we will need is the Connection String. You can copy the entire line by clicking on the copy text icon on the right side of the line. I recommend saving it in a separate text-file for easy reference. Now close the installation and restart your computer. Once restarted, the SQL server will be automatically running in the background, ready to connect to from R.


Connecting to the Server

Now that your computer has restarted, start R and load the RODBC package. Let’s also create a dataset that we will be using to get started with our SQL database.

library(RODBC)
round(runif(26, 0, 100))
##  [1] 20  7 19 28 74 83 67 35 35 56 53 13 15 48  1 83 31 32 55 42 23 33 72
## [24] 37 33 67
data = as.data.frame(list(letters, round(runif(26, 0,100))))
names(data) = c("letter", "freq")

To use the SQL database, we must create a connection handle, called a channel. There are multiple ways to create this channel. One common way is to use the function odbcConnect(), however, since we have a specific connection string, I found it easier to use the function odbcDriverConnect() to create the channel. Initially I had some trouble until I discovered that 2 things needed to be added:

  1. The driver type need to be added to the connection string we copied earlier at the end of the server installation. Manually copy into the beginning of the connection string you saved this: “Driver={SQL Server};”

  2. Because a single backslash, \ ,is an escape character, to use it in the connection string you need to change any instances of it to a double backslash, \\.

mychannel = odbcDriverConnect("Driver={SQL Server};Server=localhost\\SQLEXPRESS;Database=master;Trusted_Connection=True")

The lack of a username and password to access the SQL Server may be disconcerning to those that have seen them used before. However, this does not mean there is a lack of security. The syntax Trusted_Connection=True specifies that we are using Windows Authentication. Here is a short description of this from Microsoft.

When a user connects through a Windows user account, SQL Server validates the account name and password using the Windows principal token in the operating system. This means that the user identity is confirmed by Windows. SQL Server does not ask for the password, and does not perform the identity validation. Windows Authentication is the default authentication mode, and is much more secure than SQL Server Authentication.

Lets try testing it out now. The RODBC package has many functions that will allow you to conveniently interact with the database. Because the point of this is to practice writing SQL queries, I decided to stick to just two of RODBC’s functions.

  • sqlSave() to write an R dataframe to the SQL server
  • sqlQuery() to issue queries to the SQL server

To start out, lets get the basics down. We’ll save the dataframe we made and redownload it to another R variable.

sqlSave(mychannel, data, tablename = "LetterData")
fromsqldb = sqlQuery(mychannel, "select * from LetterData")
head(fromsqldb)
##   rownames letter freq
## 1        1      a   61
## 2        2      b    4
## 3        3      c   62
## 4        4      d   60
## 5        5      e   46
## 6        6      f   63

Notice that the R rownames are stored as a column in the SQL table. This can be useful if the rownames can be used as keys in the SQL table. Since the rownames are not being used as keys in the R dataframe, lets instead exclude the R rownames. We will be doing this using the sqlSave() option rownames = FALSE. If we try doing this directly, we will get an error because a table named LetterData has already been written to the SQL Server.

sqlSave(mychannel, data, tablename = "LetterData", rownames = FALSE)

Instead, we can drop the table first and rewrite it to the database without the R rownames.

sqlQuery(mychannel, "drop table LetterData")
## character(0)
sqlSave(mychannel, data, tablename = "LetterData", rownames = FALSE)

Note that sqlQuery() returned an empty character vector. This could be useful in larger programs where we may not be monitoring the output a few lines at a time.

results = sqlQuery(mychannel, "drop table NoTableNamedThis")
if(length(results)!=0) {print("Table Drop Failed!")}
## [1] "Table Drop Failed!"

Say we had some new frequencies for each letter. We can update a table using the sqlSave() option append = TRUE.

data[,2] = round(runif(26, 0, 100))
sqlSave(mychannel, data, tablename = "LetterData", rownames = FALSE, append = TRUE)

As the last example, note that the familiar multi-line queries can be easily written and submitted using the common continuation of R code to the next line.

fromsqldb2 = sqlQuery(mychannel, "
                      select *
                      from LetterData
                      where freq > 90")
fromsqldb2
##   letter freq
## 1      i   92
## 2      s   92
## 3      i  100
## 4      s   95
## 5      u   92
## 6      y   98
## 7      z   93

When we’re done using the channel to the SQL Server, we should close the connection so that computer resources are not being used to keep it open.

sqlQuery(mychannel, "drop table LetterData")
## character(0)
close(mychannel)

It’s a pretty basic setup, but one that should be accessible to most of you. I highly recommend exploring the RODBC documentation to get an idea of some of the details that may show up when using more advanced queries.

A Simple SQL Server to Practice SQL in R

A Simple SQL Server to Practice SQL in R A Simple SQL Server to Practice SQL in R ...