<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://brokenco.de//feed/by_tag/expressjs.xml" rel="self" type="application/atom+xml" /><link href="https://brokenco.de//" rel="alternate" type="text/html" /><updated>2026-05-03T00:12:50+00:00</updated><id>https://brokenco.de//feed/by_tag/expressjs.xml</id><title type="html">rtyler</title><subtitle>a moderately technical blog</subtitle><author><name>R. Tyler Croy</name></author><entry><title type="html">Streaming HTTP data with PostgreSQL and ExpressJS</title><link href="https://brokenco.de//2018/12/06/streaming-results-node-pg.html" rel="alternate" type="text/html" title="Streaming HTTP data with PostgreSQL and ExpressJS" /><published>2018-12-06T00:00:00+00:00</published><updated>2018-12-06T00:00:00+00:00</updated><id>https://brokenco.de//2018/12/06/streaming-results-node-pg</id><content type="html" xml:base="https://brokenco.de//2018/12/06/streaming-results-node-pg.html"><![CDATA[<p>One of the <a href="https://github.com/jenkins-infra/uplink">little applications</a> which
I built earlier this year ended up more useful than I originally anticipated.
Useful enough to have hit its first performance bottleneck! Performance
problems I generally grumble “nice problem to have” which profiling and
refactoring, but in this case I know what the performance problem was, but
lacked the appropriate solution.</p>

<p>This little application, Uplink, receives anonymous telemetry information from
short-lived “trials” defined within the <a href="https://github.com/jenkinsci/jenkins">Jenkins core
application</a>. The entire end-to-end system is
defined by the design document
<a href="https://github.com/jenkinsci/jep/blob/master/jep/214/README.adoc">JEP-214</a>.
What the JEP does not describe is how we use and analyze the data on the other
end. At the moment the “data science” behind Uplink has been exporting large
dumps of JSON information from Uplink, and then bash scripting the heck out of
it. As time has gone on the amount of data ingested, and therefore exportable,
has increased quite a bit. This growth in data has required numerous iterations
on the “Export” functionality, whilst everything else remained largely
unchanged.</p>

<p><strong>First iteration</strong></p>

<p>The first cut at “export” functionality was as simple and straightforward as
possible:</p>

<ol>
  <li>Receive authorized “Export” HTTP request</li>
  <li>Send the database <code class="language-plaintext highlighter-rouge">SELECT * FROM events WHERE ...</code></li>
  <li>Receive results</li>
  <li>Format an HTTP response with the right <code class="language-plaintext highlighter-rouge">Content-Disposition</code> headers, etc.</li>
</ol>

<p>This worked for much longer than I honestly thought it would. The Node
application lives close enough to the database to retrieve large datasets
within an HTTP timeout and deliver those to the client. Once the <strong>total</strong>
dataset exceeded a couple hundred megabytes, things stopped working.</p>

<p><strong>Second iteration</strong></p>

<p>The consumer of this data was, and still is a single person wielding bash
scripts a’plenty. To keep things as simple as possible, we changed the frontend
to require that any “Export” define a date range to export. Initially the Data
Scientist™ would request a whole week at a time, and when that stopped
working, they would request individual daily exports instead. Eventually this also
stopped working, somewhere around a <em>daily</em> dataset size of a couple of hundred
megabytes.</p>

<p><strong>Third iteration</strong></p>

<p>Clearly loading big stupid <code class="language-plaintext highlighter-rouge">SELECT * FROM</code> result sets into the application to
format and serve them to clients was not scalable. For the third iteration I
resolved to implement a direct stream from the database through the web
application to the client. In effect, I wanted the PostgreSQL database
connection to give me results <em>immediately</em> which would then be written
directly to the HTTP output stream; no in-memory storage.</p>

<p>I discovered a very useful Node package to solve the first part of the problem:
<a href="https://github.com/brianc/node-pg-query-stream/#pg-query-stream">pg-query-stream</a>.
The pg-query-stream package uses a database-side cursor to avoid the need to
create large datasets in memory on the database or web application.</p>

<p>The “trick”, which to be honest isn’t a very incredible trick since Node
streams are designed to be pluggable in this way, was to connect the
<code class="language-plaintext highlighter-rouge">pg-query-stream</code> directly to ExpressJS <code class="language-plaintext highlighter-rouge">Response</code> which looks like a writable
stream. To form a proper HTTP response, the ExpressJS handler must first write
the response code and headers, for which <code class="language-plaintext highlighter-rouge">response.send()</code> will not work, so
<code class="language-plaintext highlighter-rouge">response.writeHead</code> is used instead:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">query</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">QueryStream</span><span class="p">(</span><span class="dl">'</span><span class="s1">SELECT * FROM events WHERE ...</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">datastream</span>  <span class="o">=</span> <span class="nx">dbConnection</span><span class="p">.</span><span class="nx">query</span><span class="p">(</span><span class="nx">query</span><span class="p">);</span>

<span class="nx">response</span><span class="p">.</span><span class="nx">writeHead</span><span class="p">(</span><span class="mi">200</span><span class="p">,</span> <span class="p">{</span>
    <span class="dl">'</span><span class="s1">Content-Disposition</span><span class="dl">'</span> <span class="p">:</span> <span class="s2">`attachment; filename=</span><span class="p">${</span><span class="nx">req</span><span class="p">.</span><span class="nx">body</span><span class="p">.</span><span class="nx">type</span><span class="p">}</span><span class="s2">-</span><span class="p">${</span><span class="nx">req</span><span class="p">.</span><span class="nx">body</span><span class="p">.</span><span class="nx">startDate</span><span class="p">}</span><span class="s2">.json`</span><span class="p">,</span>
    <span class="dl">'</span><span class="s1">Content-Type</span><span class="dl">'</span><span class="p">:</span> <span class="dl">'</span><span class="s1">application/json</span><span class="dl">'</span><span class="p">,</span>
<span class="p">});</span>

<span class="cm">/*
 * Pipe the data to JSONStream to convert to a proper JSON string first.
 *
 * Once it has been formatted, _then_ pipe to the ExpressJS response object
 */</span>
<span class="nx">datastream</span><span class="p">.</span><span class="nx">pipe</span><span class="p">(</span><span class="nx">JSONStream</span><span class="p">.</span><span class="nx">stringify</span><span class="p">(</span><span class="kc">false</span><span class="p">)).</span><span class="nx">pipe</span><span class="p">(</span><span class="nx">response</span><span class="p">);</span>
<span class="nx">datastream</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">end</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=&gt;</span> <span class="p">{</span> <span class="nx">response</span><span class="p">.</span><span class="nx">end</span><span class="p">();</span> <span class="p">])</span>
</code></pre></div></div>
<p>(<em>You can view the actual code used <a href="https://github.com/jenkins-infra/uplink/blob/7a4b6377552d901b850c4c39570a67dd86b0a209/src/controllers/export.ts#L19-L32">here</a></em>)</p>

<hr />

<p>This approach is, as far as I can tell, “infinitely” scalable. So long as the
database can stream data to the Node application, the Node application will
continue to write data into the response for the end-user.</p>

<p>I was so worried that I was going to have to find some way to generate bulk
files on the server with some background job processing system, or something
else equally complex. I’m thrilled that the solution simply required connecting
one streamy thing to another streamy thing, which Node is quite well suited
for.</p>

<p>Neat!</p>]]></content><author><name>R. Tyler Croy</name></author><category term="javascript" /><category term="expressjs" /><category term="postgresql" /><summary type="html"><![CDATA[One of the little applications which I built earlier this year ended up more useful than I originally anticipated. Useful enough to have hit its first performance bottleneck! Performance problems I generally grumble “nice problem to have” which profiling and refactoring, but in this case I know what the performance problem was, but lacked the appropriate solution.]]></summary></entry></feed>