One of the little applications which I built earlier this year ended up more useful than I originally anticipated. Useful enough to have hit its first performance bottleneck! Performance problems I generally grumble “nice problem to have” which profiling and refactoring, but in this case I know what the performance problem was, but lacked the appropriate solution.
This little application, Uplink, receives anonymous telemetry information from short-lived “trials” defined within the Jenkins core application. The entire end-to-end system is defined by the design document JEP-214. What the JEP does not describe is how we use and analyze the data on the other end. At the moment the “data science” behind Uplink has been exporting large dumps of JSON information from Uplink, and then bash scripting the heck out of it. As time has gone on the amount of data ingested, and therefore exportable, has increased quite a bit. This growth in data has required numerous iterations on the “Export” functionality, whilst everything else remained largely unchanged.
First iteration
The first cut at “export” functionality was as simple and straightforward as possible:
- Receive authorized “Export” HTTP request
- Send the database
SELECT * FROM events WHERE ...
- Receive results
- Format an HTTP response with the right
Content-Disposition
headers, etc.
This worked for much longer than I honestly thought it would. The Node application lives close enough to the database to retrieve large datasets within an HTTP timeout and deliver those to the client. Once the total dataset exceeded a couple hundred megabytes, things stopped working.
Second iteration
The consumer of this data was, and still is a single person wielding bash scripts a’plenty. To keep things as simple as possible, we changed the frontend to require that any “Export” define a date range to export. Initially the Data Scientist™ would request a whole week at a time, and when that stopped working, they would request individual daily exports instead. Eventually this also stopped working, somewhere around a daily dataset size of a couple of hundred megabytes.
Third iteration
Clearly loading big stupid SELECT * FROM
result sets into the application to
format and serve them to clients was not scalable. For the third iteration I
resolved to implement a direct stream from the database through the web
application to the client. In effect, I wanted the PostgreSQL database
connection to give me results immediately which would then be written
directly to the HTTP output stream; no in-memory storage.
I discovered a very useful Node package to solve the first part of the problem: pg-query-stream. The pg-query-stream package uses a database-side cursor to avoid the need to create large datasets in memory on the database or web application.
The “trick”, which to be honest isn’t a very incredible trick since Node
streams are designed to be pluggable in this way, was to connect the
pg-query-stream
directly to ExpressJS Response
which looks like a writable
stream. To form a proper HTTP response, the ExpressJS handler must first write
the response code and headers, for which response.send()
will not work, so
response.writeHead
is used instead:
const query = new QueryStream('SELECT * FROM events WHERE ...');
const datastream = dbConnection.query(query);
response.writeHead(200, {
'Content-Disposition' : `attachment; filename=${req.body.type}-${req.body.startDate}.json`,
'Content-Type': 'application/json',
});
/*
* Pipe the data to JSONStream to convert to a proper JSON string first.
*
* Once it has been formatted, _then_ pipe to the ExpressJS response object
*/
datastream.pipe(JSONStream.stringify(false)).pipe(response);
datastream.on('end', () => { response.end(); ])
(You can view the actual code used here)
This approach is, as far as I can tell, “infinitely” scalable. So long as the database can stream data to the Node application, the Node application will continue to write data into the response for the end-user.
I was so worried that I was going to have to find some way to generate bulk files on the server with some background job processing system, or something else equally complex. I’m thrilled that the solution simply required connecting one streamy thing to another streamy thing, which Node is quite well suited for.
Neat!