By the power of R, statistical computing at your fingertips
5 December, 2007 – 1:07 pmI’ve explored in previous posts the use of tools such as onboard Analytics (LoadRunner), off-the-shelf tools (Excel) and custom web based implementations (JGraph, ChartDirector) used to analyze the nitty gritty of performance metrics.
All of these tool’s use are governed by some common factors being:
Enter stage right, the R-Project.
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
Using R escapes the proprietary factor although some employers may object to you using GPL in their environment. It is a tad weak on the share-ability factor but is very strong on expediency and the subsequent timeliness of your analysis. Oh and the pimp factor, well, that just depends on your imagination. For this demo I’m sticking with corporate grey on a white background.
So how do you use it?
Essentially the reason I’m finding R so powerful, is its ability to quickly analyze and present meaningful statistical information from a raw data source. In my current implementation, I’m using a sequence of simple MySQL import scripts to import raw data (e.g. iostat, vmstat etc) in a database table. For my demo, the table name is based on the server I am sampling.
Then using a simple R script, that is executed from the R console, I loop through that table (don’t even need to bother with what the column names are, as it automatically enumerates those) and present the data as a series of line charts with a lowess curve fitted.
The script itself is as simple as this.
library(RMySQL)
MySQL
m <- dbDriver("MySQL")
con <- dbConnect(m, group="DatabaseName")
perfstat <- function(metric, table, dtgFrom, dtgTo) {
print(metric)
query<-paste("SELECT ", metric, " result FROM ", table, "
WHERE dtg BETWEEN ", dtgFrom, " AND ", dtgTo, sep="")
rs<-dbSendQuery(con, query)
df<-fetch(rs,n=-1)
quartz()
plot(df$result, yaxs ="i", type="l", col="grey", xlab="Elapsed Time",
ylab=metric, main=paste("Line chart of ",metric), lwd=1)
lines(lowess(df$result,f=.05,iter=100), col="blue", lwd="0.25")
}
table<-"perfstats_ServerName"
dtgFrom<-"'2007-11-29 19:00:00'"
dtgTo <-"'2007-11-30 19:00:00'"
fields <- dbListFields(con, table)
for (i in fields) {
if (i == "dtg") print("skipping") else
perfstat(paste("`",i,"`",sep=""),table, dtgFrom, dtgTo) }
dbDisconnect(con)
In that example, I’m sampling performance stats from a remote server spanning a 24 hour period, from which data was sampled at a 5 second interval, and presenting that info as follows.
![]()
With a slight twist of the code, you can plot result distributions using my favourite representation (boxplot) to help show the distribution of your data. Code like this:
library(RMySQL)
MySQL
m <- dbDriver("MySQL")
con <- dbConnect(m, group="mysql")
perfstat <- function(metric, table, dtgFrom, dtgTo) {
print(metric)
query<-paste("SELECT ", metric, " result, device FROM ", table, "
WHERE dtg BETWEEN ", dtgFrom, " AND ", dtgTo, " ORDER BY device", sep="")
rs<-dbSendQuery(con, query)
df<-fetch(rs,n=-1)
quartz()
# set plot parameters
op <- par(las=2, lwd=0.5, omd=c(0,1,0,1), mar=c(5,15,4,2), cex=0.8)
boxplot(df$result ~ df$device, data = df, col="lightgray", horizontal=TRUE, lwd=0.5, range=0, notch=TRUE,
xlab=metric, main=paste("Boxplot of ",metric))
}
table<-"_iostat"
dtgFrom<-"'2007-11-13 18:00:00'"
dtgTo <-"'2007-11-14 18:00:00'"
fields <- dbListFields(con, table)
for (i in fields) {
if (i == "dtg") print("skipping") else
perfstat(paste("`",i,"`",sep=""),table, dtgFrom, dtgTo) }
dbDisconnect(con)
Will present very useful boxplots (grouped by device ID) something like this:

Which makes it very easy to spot performance bottlenecks from a 20,000′ level.
I’m a realtive noob with this program, so will be exploring better ways to script and automate with it in future. But for the time being, its ability to quickly analyze data in a meaningful manner makes it one of the sharper tools in your toolbox.
I’m currently using it from my Macbook Pro, but hope to integrate it on some production Windows servers in the near future (using its ability to output straight to png). This a great product, it’s free and relatively easy to use. You could use this tool to automate ongoing performance benchmarking aspects of your application development life cycle, or as a simple health check for your key systems.









4 Responses to “By the power of R, statistical computing at your fingertips”
this is cool
By tim on Dec 6, 2007
Cool, now I’ll have to take another look at R. I had checked it out briefly about a year ago, but it fell off the radar.
Since I’m using SiteScope to collect performance data, I could have it log into a MySQL DB… now you’ve got me thinking.
How do you get your data into MySQL?
By Charlie on Apr 18, 2008
Hi Charlie,
I would use Ruby|Perl to open up your existing SiteScope\log files, parse them by line, and then use a mysqlimport statement at the end of the script.
For example, this line in a SiteScope log:
00:03:06 02/23/2008 good IF.1 IF Server Ping 0.01 sec 2:2216 ok 10.0 100
Could be translated into a csv using ruby so that it reads something like:
02-23-2008 00:03:06,good,IF.1,IF Server,Ping,0.01,10.0,100Then at the end of your ruby script suck that data into MySql using a puts %x{mysql import command}
I’m doing something similar here
A mysql import command on windows looks like this:
mysqlimport -h host -P 3306 -s -u myimportuser --password=mypassword --fields-optionally-enclosed-by=""" --fields-terminated-by=, --lines-terminated-by="\r\n" --local perfstats D:\path\to\your\file_created_by_ruby.csv
I’ve left some current Ruby scripts that do something very similar at work, so will post them on Monday for you.
I’m tending to use SiteScope less and less though. I hate its lack of flexibility (in presenting data) and I prefer to tweak my own monitors launched via scheduled tasks or cron as appropriate, that way I can control exactly what counters I want, and how they are formatted. For example, its not easy (possible?) to setup a custom DTrace monitor for Solaris using SiteScope, and it’s sometimes easier (to get access) to just ask a remote sysadmin to run a set of native commands he’s used to rather than impose the clunky use of SiteScope -> ssh -> …
For example, on windows, for batch collection I just use typeperf as in
typeperf -cf D:\monitors\perfcounters.txt -si 55 -sc 5 -s \\%1 -oFor solaris or *ix, I just use vmstat, iostat and sar mostly via cron. Or other arrangements depending on the system and level of access granted …
For app servers I use JMX, called via a ruby script as per that previous link I showed you.
I import all that data usually into a MySql database, which you can still use SiteScope as a reporting tool against that data (if you wish). But like in this article, tools like R let you quickly analyze tons of data, without necessarily worrying about how you set the monitor up in the first place. In other words, I’m leaning towards (and I know some people will hate this approach :)), monitor *everything*, report on select few … way of doing things…
I’ve even mucked around with a SiteScope ‘faker’ which can re-present data from a MySql database when a LoadRunner controller is running, giving you the perception of live counters during a load test run
All SiteScope really does is present XML data via a http port 8888 in this case …
Hmm, too much coffee this morning
Cheers,
Tim Koopmans
By Tim on Apr 19, 2008
I’ve had a look at using this from Windows with a WAMP installation. There are a couple of extras you need to do. Firstly install the pre-compiled binaries into your \libraries path for RMySQL and DBI from here
Then copy the libmysqll.dll from D:\wamp\bin\mysql\mysql5.0.51a\bin into your \R\R-2.7.0\library\RMySQL\libs directory.
After that you should be able to run similar reports to those identified above. Substitute the quartz() function (for Mac OSX) to the windows() function for WinXP.
Also instead of using a C:\my.cnf you can also pass in connection properties with your dbConnect call:
con<-dbConnect(m, user="root", port="3306", host="myhost", dbname="mydb")By Tim on Apr 28, 2008