To announce an event, contribute with a post, or provide feedback or suggestions about this blog, please contact at r.user.group.sl@gmail.com.
Tuesday, 3 January 2017
Let’s
make life easier with GitHub.
GIT is
aversion control system, a tool that tracks
changes to our code and shares those changes with others. GIT is most useful
when combined with GitHub, a website that allows us to share our code
with the world. GitHub is the most popular version control system for
developers of R packages.
GIT and
GitHub are generally useful for all software development and data analysis, not
just R packages. I’ve included it here, because it is so useful when you’re
making a package.
·It makes sharing your package easy. Any R user
can install your package with just two lines of code.
·Readers can easily browse code, and read
documentation (via Markdown). They can report bugs, suggest new features withGitHub issues, and
propose improvements to your code with pull requests.
·With GIT, both of you can work on the same file
at the same time. GIT will either combine your changes automatically, or it
will show you all the ambiguities and conflicts.
·It’s very easy to accidentally introduce a
mistake that takes a few minutes to track down. GIT makes this problem easy to
spot because it allows you to see exactly what’s changed and undo any mistakes.
RStudio
makes day-to-day use of GIT simpler. Once you’ve set up a project to use GIT,
you’ll see a new pane and toolbar icon. These provide shortcuts to the most
commonly used GIT commands. However, because only a handful of the 150+ GIT
commands are available in RStudio, you also need to be familiar with using GIT from
theshell(aka the command line or the console).
It’s also useful to be familiar with using GIT in a shell because if you get
stuck you’ll need to search for a solution with the GIT command names.
A few years ago I was the CTO and co-founder of a startup in the medical practice management software space. One of the problems we were trying to solve was how medical office visit schedules can optimizeeveryone’stime. Too often, office visits are scheduled to optimize the physician’s time, and patients have to wait way too long in overcrowded waiting rooms in the company of people coughing contagious diseases out their lungs.
One of my co-founders, a hospital medical director, had a multivariate linear model that could predict the required length for an office visit based on the reason for the visit, whether the patient needs a translator, the average historical visit lengths of both doctor and patient, and other possibly relevant factors. One of the subsystems I needed to build was a monthly regression task to update all of the coefficients in the model based on historical data.
After exploring many options, I chose to implement this piece in R, taking advantage of the wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques implemented in the R system.
One of the attractions for me was the R scripting language, which makes it easy to save and rerun analyses on updated data sets; another attraction was the ability to integrate R and C++. A key benefit for this project was the fact that R, unlike Excel and other GUI analysis programs, is completely auditable.
Alas, that startup ran out of money not long after I implemented a proof-of-concept Web application, at least partially because our first hospital customer had to declare Chapter 7 bankruptcy. Nevertheless, I continue to favor R for statistical analysis and data science.
Start by installingRandRStudioon your desktop. Both are free. RStudio is optional, but I like it, and you probably will, too. There are a half-dozen other R IDEs and a dozen editors with some R support, but don’t go crazy trying them all.
Try running R from a command shell (Figure 1), the R Console (Figure 2), and RStudio (Figure 3). Familiarize yourself with some of the R tutorials and demos.
The power of R is illustrated by the deceptively simple calls in Figure 3 to do statistical analysis. For example,
This says “find the best fit coefficients, fitted values, and residuals for a linear model whereyvaries withxfor the supplied data and weight vectors. Save them in objectfm1and then summarize the results.” Earlier in this session we had defined the following:
w <- 1 + sqrt(x) / 2
Reading this code is straightforward. Writing it takes a little study. But it isn’t hard and there’s lots of free help available, not to mention dozens of books.
In addition to the R help available on the Web and from the Help menu items in the R Console and RStudio, you can get help from the R command line. For example:
To get data into R, either use its sample data, listed by thedata()function, or load it from a file:
mydata <- read.csv("filename.txt")
R is extremely extensible. Thelibrary()andrequire()functions load and attach add-on packages;require()is designed for use inside other functions. Many add-on packages and the R distributions live in CRAN, the worldwide Comprehensive R Archive Network. The other two common R archives are Omegahat and Bioconductor. Additional packages live in R-Forge.
The R installation copies the base packages and the recommended packages from CRAN into a local library directory, which on a Mac is currently at /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. Running the Rlibrary()command without any arguments will list the local packages and the library location. RStudio will also generate the correctlibrary()command to install a listed package when you check the installation check mark in the Packages tab. The commandhelp(package = packageName)will display the functions in the specified package.
There are R packages and functions to load data from any reasonable source, not only CSV files. Beyond the obvious case of delimiters other than commas, which are handled using theread.table()function, you can copy and paste data tables, read Excel files, connect Excel to R, bring in SAS and SPSS data, and access databases, Salesforce, and RESTful interfaces. See, for example, theforeignpackage.
You don’t really need to learn the syntax for standard data imports, as the RStudioTools|Import Datasetmenu item will help you generate the correct commands interactively by looking at the data from a text file or URL and setting the correct conversion options in drop-down lists based on what you see.
You can see alist of the currently available packages by nameon CRAN; this list is much more extensive than the list of recommended packages downloaded to your desktop by default. To install a package from one of the default archives, use theinstall.packagesfunction:
install.packages("ggplot2")
Note thatggplot2is a popular advanced graphics package that has more options than the standardgraphicspackage. Nevertheless,graphicscan do a lot. In addition to the graphics in Figures 2 and 3, consider Figures 4 and 5.
R can do much more in terms of graphics and statistical analysis. Do readSharon Machlis’s tutorialand follow up with herlinks to additional information. At this point, I want to expand my discussion to how you can analyze big data in R.
R in the cloud
When R programmers talk about “big data,” they don’t necessarily mean data that goes through Hadoop. They generally use “big” to mean data that can’t be analyzed in memory.
The fact is you can easily get 16GB of RAM in a desktop or laptop computer. R running in 16GB of RAM can analyze millions of rows of data with no problem. Times have changed quite a bit since the days when a database table with a million rows was considered big.
One of the first steps many developers take when their program needs more RAM is to run it on a bigger machine. You can run R on a server; a common 4U Intel server can hold up to 2TB of RAM. Of course, hogging an entire 2TB server for one personal R instance might be a bit wasteful. So people run large cloud instances for as long as they need them, run VMs on their server hardware, or run the likes of RStudio Server on their server hardware.
RStudio Server comes in Free and Pro editions. Both have the same features for individual analysts, but the Pro version offers more in the way of scale: authorization and security, management visibility, performance tuning, support, and a commercial license. According to Roger Oberg of RStudio, the company’s intent isnotto create paid-only features for individuals.
RStudio Server Pro is integrated with several big data systems. For example, when I was reviewing theIBM Bluemix PaaS, I noticed that R and RStudio are part of IBM's DashDB service (Figure 6). In fact, this is an installation of RStudio Server Pro on Bluemix and SoftLayer, according to Oberg and Tareef Kawaf of RStudio.
There is an additional strategy for running R against big data: Bring down only the data that you need to analyze. In the spirit of MapReduce, Hadoop, Spark, and Storm, you want to winnow the data as you stream it to make in-memory analysis tractable on the reduced data set. To use Kawaf’s example, you may have 100TB of data but need “only” 5 columns and 20 million rows, a mere few hundred megabytes of reduced data.
You may also want to perform some of the analysis in the database instead of in the app. IBM has done a good job of providing an example, along with the R source code. Consider the analysis shown in Figure 7.
Streaming the data out of the database and into R can take a significant amount of time. If you eliminate most of the network streaming, you can vastly reduce the time needed for the analysis. You’ll notice that the timing for the in-database regression analysis is 2.7 seconds. The same task with the regression done in-application took 1.47 minutes -- more than 30 times longer. The regression coefficients computed were exactly the same. All that changed was that one analysis did the regression where the data resided, and the other first streamed the data from the database to the R application.
The IBM implementation is not unique; I happened to have a Bluemix account. Vertica (HP), Greenplum (Pivotal), Oracle, and Teradata all have R packages. I’m not sure how far the others have gone in the direction of in-database analytics, however.
By the way, I was pleasantly surprised to find that running RStudio Server Pro in a browser feels exactly like running RStudio on my desktop -- nicely done.
Shiny and R Markdown
Of course, developers and analysts never really get away with simply writing the code and determining the results. Top management wants monthly reports, and middle management wants to play with the data without knowing anything about what’s under the covers. Entershinyandrmarkdown, two R packages from RStudio for Web applications and reporting, respectively.
Figure 8 shows a simple Shiny app running in RStudio. The code is fromLesson 2of the Shiny tutorial.
You can use Shiny to build interactive and “reactive” Web apps, with widgets that correspond to HTML control elements such asinputfields. By “reactive,” RStudio means that when a value changes, all values with dependencies on the changed value are recalculated, as you’d expect from a spreadsheet program. Figure 9 shows an interactive Shiny app with two widgets for input and a shaded choropleth map of U.S. census data for output.
The interactive Shiny app in Figure 9 is a good example of how you can allow middle management to play with the data without their having to know what’s under the covers.
To limit what is recomputed when input changes, thereactivewrapper function caches its values and recomputes only those that are invalid. I’ll forgo burdening you with an example, although you’ll find one in ShinyLesson 6. Shiny apps can run on your own hardware, or you can publish them to theshinyapps.ioserver. For a quick example, have a look at Figure 10.
Shiny apps should satisfy the needs of middle managers. Now what about top management?
If you are a GitHub user or have simply been paying attention to the Web and developer landscapes the last 10 years, you’ll know about theMarkdown languagefor generating formatted documents in HTML and other tag-based markup languages. RStudio includes a Markdown implementation and extends it to include embedded R code chunks and both LaTeX and MathML in theR Markdown package. You can also create interactive R Markdown documents using Shiny and publish them to your own Shiny server or toshinyapps.io. For an example, see Figure 11.