Entering into the data platform space with a lot of experience in more
traditional production operations is a lot of fun, especially when you ask
questions like “what if X
goes horribly wrong?” My favorite scenario to
consider is: “how much damage could one accidentally cause with our existing
policies and controls?” At Scribd we have made
Delta Lake a cornerstone of our data platform, and as such
I’ve spent a lot of time thinking about what could go wrong and how we would
defend against it.
Howdy!
Welcome to my blog where I write about software
development
, cycling, and other random nonsense. This is not
the only place I write, you can find more words I typed on the Buoyant Data blog, Scribd tech blog, and GitHub.
Understanding big data partitioning
Data partitioning is one of the principles to utilize when developing large data sets, but do you know what that actually means for the storage format? I didn’t! Many “big data” storage systems such as HDFS, S3, and Azure Data Lake Storage all are effectively a file system. This past year or so, I’ve become much more familiar with Delta Lake and kind of just assumed that data partitioning was something being done at the transaction log level. Turns out I guessed wrong.
Building a goede search engine
This weekend I finally got around to building a little Rust “full text search engine” based on the educational post written by my Scribd colleague Bart: titled Building a full-text search engine in 150 lines of Python code. Bart did a great job writing an accessible post which introduced some common search concepts using Python, my objective wasn’t necessarily to write something faster or better but to use the exercise as Rust practice. My day job is no longer writing code so the opportunity for a problem with fixed scope which would work out my Rust muscles was too good to pass up. In this post I want to share some things which I’ve learned in the process of duplicating Bart’s work.
Subscribe to my "Podcast Picks"
I am have always been a fan of podcasts, but have never had really any good way to share the interesting things I am listening to. A couple weeks ago I struck upon an idea that seems so bafflingly simple in retrospect: I could just host my own podcast feed.
Software-defined networks with FreeBSD Jails
As a comprehensive operating system FreeBSD never ceases to impress me, the
recent iterations of FreeBSD
Jails as an example have been an
absolute joy to use. The introduction of the
vnet(9)
network subsystem has completely transformed what I had originally thought
about software-defined networking. My previous exposure to the concept of
software-defined
networking was
through both OpenStack and Docker, two very
different approaches to the broad domain of “SDN”. FreeBSD’s vnet
system has
resonated most strongly with me and has allowed me some measure of success in
deploying real production-grade virtualized networks.
Dynamically adding parameters in sqlx
Bridging data types between the database and a programming language is such a
foundational feature of most database-backed applications that many developers
overlook it, until it doesn’t work. For many of my Rust-based applications I
have been enjoying sqlx which strikes
the right balance between “too close to the database”, working with raw cursors
and buckets of bytes, and “too close to the programming language”, magic object
relational mappings. It reminds me a lot of what I wanted Ruby Object
Mapper to be back when it was called “data mapper.” sqlx
can do many things, but it’s not a silver bullet and it errs on the side of
“less magic” in many cases, which leaves the developer to deal with some
trade-offs. Recently I found myself with just such a trade-off: mapping a Uuid
such that I could do IN
queries.
Thoughts on WebTorrent
WebTorrent is one of the most novel uses of some modern browser technologies that I have recently learned about. Using WebRTC is able to implement a truly peer-to-peer data transport on top of support offered by existing browsers. I came across WebTorrent when I was doing some research on what potential future options might exist for more scalable distribution of free and open source libraries and applications. In this post, I want to share some thoughts and observations I jotted down while considering WebTorrent.
Technically I'm microblogging now.
I am a big fan of the open web and although I have enjoyed Twitter the platform has regressed in dramatic form and function since I first adopted it. I remember Twitter actively avoided building a walled garden with fantastic APIs and RSS feeds open to the public. Much of the popularity of the platform hinged upon the incredible third party applications and integrations developers like me built in the first five-ish years of its existence. Over time the site has strayed from open APIs and standards, and while I still enjoy Twitter, I want some more flexibility which is why you can now subscribe to my microblog with any RSS-capable client.
Synchronizing notes with Nextcloud and Vimwiki
The quantity of things I need to keep track of or be responsible for has
exploded in the past few years, so much so that I have had to really focus on
organizing my “personal knowledgebase.” When I originally tried to spend some
time improving my information management system, I found numerous different
services offering to improve my productivity and to help me keep track of
everything. Invariably many of these tools were web apps. In order to quickly and
productively work with information, a <textarea/>
in a web page is the choice
of just about last resort. I recently revisited
Vimwiki and have been quite satisfied both by
my productivity boost and the benefits that come with having raw
text to work with. The best benefit: easy synchronization of notes with Nextcloud.
Reverse proxying a Tide application with Nginx
Every now and again I’ll encounter a silly problem, fix it, forget about it, and then later run into the exact same problem again. Today’s example is a confusing error I encountered when reverse-proxying a Tide application with Nginx. In the Tide application, I was greeted with an ever-so-descriptive error: