Building a goede search engine

This weekend I finally got around to building a little Rust “full text search engine” based on the educational post written by my Scribd colleague Bart: titled Building a full-text search engine in 150 lines of Python code. Bart did a great job writing an accessible post which introduced some common search concepts using Python, my objective wasn’t necessarily to write something faster or better but to use the exercise as Rust practice. My day job is no longer writing code so the opportunity for a problem with fixed scope which would work out my Rust muscles was too good to pass up. In this post I want to share some things which I’ve learned in the process of duplicating Bart’s work.

The exercise revolves around parsing, indexing, and searching a dump of abstracts from the English language Wikipedia. This meant consuming a rather large Gzipped dump of XML which led me to two instrumental crates:

flate2 which has the decoder support to read a Gzip (.gz) file.
quick_xml which allows for parsing XML from strings or streams.
rust-stemmer which helps with stemming, a concept which was new to me.

Initially I spent a lot of time just trying to get the file loaded. It seemed that no matter what I tried I just could not get the file to deserialize properly. I would look at snippets of the file in zcat and it all looked like valid XML to me. I copied some of the entries out of the .xml.gz file and created a simple test data set which loaded perfectly fine, but for whatever reason I could not properly load the entire dataset generated by Wikipedia. On a guess I unzipped the file and then re-gzipped it with the gunzip and gzip command line tools and that loaded properly. Unfortunately it looks like I had found a bug in flate2. Oops.

Once I was able to actually load the Wikipedia data set, the rest of the exercise went quite smoothly. I needed to refer to Rust docs quite a bit, had to borrow some suggestions from Stack Overflow when it came time to find the intersection of multiple HashSet structs, since Rust’s HashSet.intersection will only work on two sets.

After a couple hours of evening tinkering, goedesearch was born.

Performance

As one might expect, converting a Python project to Rust did introduce performance improvements. Some aspects of the Python code are intentionally slow in order to make it easier to introduce and consider piece by piece.

To give you an idea of the inherent performance improvements of Rust, just the act of deserializing a gzipped XML file into memory in my initial attempt took 4-5 minutes where the Python version took 40 or so. Since I work on an aarch64 machine, I wonder how optimized, or not Python on that architecture may be. Once the file was loaded, the search performance of the two implementations is close enough to not be really worth discussion.

The inherent Rust performance improvements aside, I still have done some minor optimizations of my Rust code to further enhance the exercise. For example, my initial version of the code used quick_xml’s serde support to deserialize entries, but I have since moved away from that for performance reasons. The original flow was:

Deserialize the .xml.gz into a Vec<Article.
Then iterate through that vector and run the indexing process.

I re-factored this code to use a more streaming approach where each <article/> in the data set is deserialized into an Article and then indexed. This speeds up the time by reducing two steps but the whole process is still fairly CPU-bound.

My next steps are to explore some crates I have been watching for distributing the indexing work across multiple cores on the machine. I am optimistic that the safe concurrency support in Rust allows me to make minimal changes to my code to take this indexing operation from dominating a single core, to utilizing every code available. Ultimately, I would love for the loading/indexing problem to be I/O bound rather than CPU bound.

Implementing goedesearch is a great example of how taking a project in one language, and porting it to another, can help teach you more about a domain. The full-text search engine is really a toy that is not meant for serious use, but playing is one of the key ways that we learn! Bart suggested checking out tantivy for a more serious Rust search implementation. The folks behind tantivy also gave a FOSDEM 2019 presentation which may be of interest.

For me, I don’t have any great plans for implementing a search engine, in Rust or otherwise, but I’m grateful to have had a fun little project to relax with this weekend!

Should you want to review it, all of the code can be found in this repository on GitHub.