Digithead's Lab Notebook: 09/01/2010

I wanted to get R talking to CouchDB. CouchDB is a NoSQL database that stores JSON documents and exposes a ReSTful API over HTTP. So, I needed to issue the basic HTTP requests: GET, POST, PUT, and DELETE from within R. Specifically, to get started, I wanted to add documents to the database using PUT.

There's CRAN package called httpRequest, which I thought would do the trick. This wound up being a dead end. There's a better way. Skip to the RCurl section unless you want to snicker at my hapless flailing.

Stuff that's totally beside the point

As Edison once said, "Failures? Not at all. We've learned several thousand things that won't work."

The httpRequest package is very incomplete, which is fair enough for a package at version 0.0.8. They implement only basic get and post and multipart post. Both post methods seem to expect name/value pairs in the body of the POST, whereas accessing web services typically requires XML or JSON in the request body. And, if I'm interpreting the HTTP spec right, these methods mishandle termination of response bodies.

Given this shaky foundation to start with, I implemented my own PUT function. While I eventually got it working for my specific purpose, I don't recommend going that route. HTTP, especially 1.1, is a complex protocol and implementing it is tricky. As I said, I believe the httpRequest methods, which send HTTP/1.1 in their request headers, get it wrong.

Specifically, they read the HTTP response with a loop like one of the following:

repeat{
  ss <- read.socket(fp,loop=FALSE)
  output <- paste(output,ss,sep="")
  if(regexpr("\r\n0\r\n\r\n",ss)>-1) break()
  if (ss == "") break()
}

repeat{
 ss <- rawToChar(readBin(scon, "raw", 2048))
 output <- paste(output,ss,sep="")
 if(regexpr("\r\n0\r\n\r\n",ss)>-1) break()
 if(ss == "") break()
 #if(proc.time()[3] > start+timeout) break()
}

Notice that they're counting on a blank line, a zero followed by a blank line or the server closing the connection to signal the end of the response body. I dunno where the zero thing comes from or why we should count on it not being broken up during reading. Looking through RFC2616 we find this description of an HTTP message:

generic-message = start-line
                  *(message-header CRLF)
                  CRLF
                  [ message-body ]

While the headers section ends with a blank line, the message body is not required to end in anything in particular. The part of the spec that refers to message length lists 5 ways that a message may be terminated, 4 of which are not "server closes connection". None of them are "a blank line". HTTP 1.1 was specifically designed this way so web browsers could download a page and all its images using the same open connection.

For my PUT implementation, I fell back to HTTP 1.0, where I could at least count on the connection closing at the end of the response. Even then, socket operations in R are confusing, at least for the clueless newbie such as myself.

One set of socket operations consists of: make.socket, read.socket/write.socket and close.socket. Of these functions, the R Data Import/Export guide states, "For new projects it is suggested that socket connections are used instead."

OK, socket connections, then. Now we're looking at: socketConnection, readLines, and writeLines. Actually, tons of IO methods in R can accept connections: readBin/writeBin, readChar/writeChar, cat, scan and the read.table methods among others.

At one point, I was trying to use the Content-Length header to properly determine the length of the response body. I would read the header lines using readLines, parse those to find Content-Length, then I tried reading the response body with readChar. By the name, I got the impression that readChar was like readLines but one character at a time. According to some helpful tips I got on the r-help mailing list this is not the case. Apparently, readChars is for binary mode connections, which seems odd to me. I didn't chase this down any further, so I still don't know how you would properly use Content-Length with the R socket functions.

Falling back to HTTP 1.0, we can just call readLines 'til the server closes the connection. In an amazing, but not recommended, feat of beating a dead horse until you actually get somewhere, I finally came up with the following code, with a couple variations commented out:

http.put <- function(host, path, data.to.send, content.type="application/json", port=80, verbose=FALSE) {

  if(missing(path))
    path <- "/"
  if(missing(host))
    stop("No host URL provided")
  if(missing(data.to.send))
    stop("No data to send provided")

  content.length <- nchar(data.to.send)

  header <- NULL
  header <- c(header,paste("PUT ", path, " HTTP/1.0\r\n", sep=""))
  header <- c(header,"Accept: */*\r\n")
  header <- c(header,paste("Content-Length: ", content.length, "\r\n", sep=""))
  header <- c(header,paste("Content-Type: ", content.type, "\r\n", sep=""))
  request <- paste(c(header, "\r\n", data.to.send), sep="", collapse="")

  if (verbose) {
    cat("Sending HTTP PUT request to ", host, ":", port, "\n")
    cat(request, "\n")
  }

  con <- socketConnection(host=host, port=port, open="w+", blocking=TRUE, encoding="UTF-8")
  on.exit(close(con))

  writeLines(request, con)

  response <- list()

  # read whole HTTP response and parse afterwords
  # lines <- readLines(con)
  # write(lines, stderr())
  # flush(stderr())
  # 
  # # parse response and construct a response 'object'
  # response$status = lines[1]
  # first.blank.line = which(lines=="")[1]
  # if (!is.na(first.blank.line)) {
  #   header.kvs = strsplit(lines[2:(first.blank.line-1)], ":\\s*")
  #   response$headers <- sapply(header.kvs, function(x) x[2])
  #   names(response$headers) <- sapply(header.kvs, function(x) x[1])
  # }
  # response$body = paste(lines[first.blank.line+1:length(lines)])

  response$status <- readLines(con, n=1)
  if (verbose) {
    write(response$status, stderr())
    flush(stderr())
  }
  response$headers <- character(0)
  repeat{
    ss <- readLines(con, n=1)
    if (verbose) {
      write(ss, stderr())
      flush(stderr())
    }
    if (ss == "") break
    key.value <- strsplit(ss, ":\\s*")
    response$headers[key.value[[1]][1]] <- key.value[[1]][2]
  }
  response$body = readLines(con)
  if (verbose) {
    write(response$body, stderr())
    flush(stderr())
  }

  # doesn't work. something to do with encoding?
  # readChar is for binary connections??
  # if (any(names(response$headers)=='Content-Length')) {
  #   content.length <- as.integer(response$headers['Content-Length'])
  #   response$body <- readChar(con, nchars=content.length)
  # }

  return(response)
}

After all that suffering, which was undoubtedly good for my character, I found an easier way.

RCurl

Duncan Temple Lang's RCurl is an R wrapper for libcurl, which provides robust support for HTTP 1.1. The paper R as a Web Client - the RCurl package lays out a strong case that wrapping an existing C library is a better way to get good HTTP support into R. RCurl works well and seems capable of everything needed to communicate with web services of all kinds. The API, mostly inherited from libcurl, is dense and a little confusing. Even given the docs and paper for RCurl and the docs for libcurl, I don't think I would have figured out PUT.

Luckily, at that point I found R4CouchDB, an R package built on RCurl and RJSONIO. R4CouchDB is part of a Google Summer of Code effort, NoSQL interface for R, through which high-level APIs were developed for several NoSQL DBs. Finally, I had stumbled across the answer to my problem.

I'm mainly documenting my misadventures here. In the next installment, CouchDB and R we'll see what actually worked. In the meantime, is there a conclusion from all this fumbling?

My point if I have one

HTTP is so universal that a high quality implementation should be a given for any language. HTTP-based APIs are being used by databases, message queues, and cloud computing services. And let's not forget plain old-fashioned web services. Mining and analyzing these data sources is something lots of people are going to want to do in R.

Others have stumbled over similar issues. There are threads on r-help about hanging socket reads, R with CouchDB, and getting R to talk over Stomp.

RCurl gets us pretty close. It could use high-level methods for PUT and DELETE and a high-level POST amenable to web-service use cases. More importantly, this stuff needs to be easier to find without sending the clueless noob running down blind alleys. RCurl is greatly superior to httpRequest, but that's not obvious without trying it or looking at the source. At minimum, it would be great to add a section on HTTP and web-services with RCurl to the R Data Import/Output guide. And finally, take it from the fool: trying to role your own HTTP (1.1 especially) is a fool's errand.

I'm investigating using CouchDB for a data mining application. CouchDB is a schema-less document-oriented database that stores JSON documents and uses JavaScript as a query language. You write queries in the form of map-reduce. Applications connect to the database over a ReSTful HTTP API. So, Couch is a creature of the web in a lot of ways.

What I have in mind (eventually) is sharding a collection of documents between several instances of CouchDB each running on their own nodes. Then, I want to run distributed map-reduce queries over the whole collection of documents. But, I'm just a beginner, so we're going to start off with the basics. The CouchDB wiki has a ton of getting started material.

Couchdb's installation instructions cover several options for installing on Mac OS X, as well as other OS's. I used MacPorts.

sudo port selfupdate
sudo port install couchdb

Did I remember to update my port definitions the first time through? Of f-ing course not. Port tries to be helpful, but it's a little late sometimes with the warnings. Anyway, now that it's installed, let's start it up. I came across CouchDB on Mac OS 10.5 via MacPorts which tells you how to start CouchDB using Apple's launchctl.

sudo launchctl load /opt/local/Library/LaunchDaemons/org.apache.couchdb.plist
sudo launchctl start org.apache.couchdb

To verify that it's up and running, type:

curl http://localhost:5984/

...which should return something like:

{"couchdb":"Welcome","version":"1.0.1"}

Futon, the web based management tool for CouchDB can be browsed to at http://localhost:5984/_utils/.

Being a nerd, I tried to run Futon's test suite. After they failed, I found this: The tests run only(!) in a separate browser and that browser needs to be Firefox. Maybe that's been dealt with by now.

Let's create a test database and add some bogus records like these:

{
   "_id": "3f8e4c80b3e591f9f53243bfc8158abf",
   "_rev": "1-896ed7982ecffb9729a4c79eac9ef08a",
   "description": "This is a bogus description of a test document in a couchdb database.",
   "foo": true,
   "bogosity": 99.87526349
}

{
   "_id": "f02148a1a2655e0ed25e61e8cee71695",
   "_rev": "1-a34ffd2bf0ef6c5530f78ac5fbd586de",
   "foo": true,
   "bogosity": 94.162327,
   "flapdoodle": "Blither blather bonk. Blah blabber jabber jigaboo splat. Pickle plop dribble quibble."
}

{
   "_id": "9c24d1219b651bfeb044a0162857f8ab",
   "_rev": "1-5dd2f82c03f7af2ad24e726ea1c26ed4",
   "foo": false,
   "bogosity": 88.334,
   "description": "Another bogus document in CouchDB."
}

When I first looked at CouchDB, I thought Views were more or less equivalent to SQL queries. That's not really true in some ways, but I'll get to that later. For now, let's try a couple in Futon. First, we'll just use a map function, no reducer. Let's filter our docs by bogosity. We want really bogus documents.

Map Function

function(doc) {
  if (doc.bogosity > 95.0)
    emit(null, doc);
}

Now, let's throw in a reducer. This mapper emits the bogosity value for all docs. The reducer takes their sum.

Map Function

function(doc) {
  emit(null, doc.bogosity);
}

Reduce Function

function (key, values, rereduce) {
  return sum(values);
}

It's a fun little exercise to try and take the average. That's tricky because, for example, ave(ave(a,b), ave(c)) is not necessarily the same as ave(a,b,c). That's important because the reducer needs to be free to operate on subsets of the keys emitted from the mapper, then combine the values. The wiki doc Introduction to CouchDB views explains the requirements on the map and reduce functions. There's a great interactive emulator and tutorial on CouchDB and map-reduce that will get you a bit further writing views.

One fun fact about CouchDB's views is that they're stored in CouchDB as design documents, which are just regular JSON like everything else. This is in contrast to SQL where a query is a completely different thing from the data. (OK, yes, I've heard of stored procs.)

That's the basics. At this point, a couple questions arise:

How do you do parameterized queries? For example, what if I wanted to let a user specify a cut-off for bogosity at run time?
How do I more fully get my head around these map-reduce "queries"?
Can CouchDB do distributed map-reduce like Hadoop?

There's more to design documents than views. Both _show and _list functions let you transform documents. List functions use cursor-like iterator that enables on-the-fly filtering and aggregating as well. Apparently, there are plans for _update and _filter functions as well. I'll have to do some more reading and hacking and leave those for later.

Digithead's Lab Notebook

Monday, September 27, 2010

How to send an HTTP PUT request from R

Stuff that's totally beside the point

RCurl

My point if I have one

Thursday, September 23, 2010

Geting started with CouchDB

Links

About

About Me

Blog Archive

Labels

Cheat Sheets

Featured on

Digithead's Lab Notebook

Monday, September 27, 2010

How to send an HTTP PUT request from R

Stuff that's totally beside the point

RCurl

My point if I have one

Thursday, September 23, 2010

Geting started with CouchDB

Links

About

About Me

Blog Archive

Labels

Cheat Sheets

Feedz

Featured on