Data partitioning is one of the principles to utilize when developing large data sets, but do you know what that actually means for the storage format? I didn’t! Many “big data” storage systems such as HDFS, S3, and Azure Data Lake Storage all are effectively a file system. This past year or so, I’ve become much more familiar with Delta Lake and kind of just assumed that data partitioning was something being done at the transaction log level. Turns out I guessed wrong.

Data partitioning is almost always just special directories in the file system.

I feel like I have been duped!

Consider the following example from the delta-rs test data:

delta-0.8.0-partitioned
├── _delta_log
│   └── 00000000000000000000.json
├── year=2020
│   ├── month=1
│   │   └── day=1
│   │       └── part-00000-8eafa330-3be9-4a39-ad78-fd13c2027c7e.c000.snappy.parquet
│   └── month=2
│       ├── day=3
│       │   └── part-00000-94d16827-f2fd-42cd-a060-f67ccc63ced9.c000.snappy.parquet
│       └── day=5
│           └── part-00000-89cdd4c8-2af7-4add-8ea3-3990b2f027b5.c000.snappy.parquet
└── year=2021
    ├── month=12
    │   ├── day=20
    │   │   └── part-00000-9275fdf4-3961-4184-baa0-1c8a2bb98104.c000.snappy.parquet
    │   └── day=4
    │       └── part-00000-6dc763c0-3e8b-4d52-b19e-1f92af3fbb25.c000.snappy.parquet
    └── month=4
        └── day=5
            └── part-00000-c5856301-3439-4032-a6fc-22b7bc92bebb.c000.snappy.parquet

In the above example the delta-0.8.0-partitioned Delta table is partitioned by year, month, and day. The table metadata does have the partitionColumns defined, for example:

{
    "partitionColumns":["year","month","day"]
}

What’s really giving the queries the performance benefits of partitioning is the fact that the file system layout takes these “partitions” (special directory names) into consideration when determining which files to actually load up for the query.

At the end of the day, it all just comes down to binary files on the file systems.

It’s a UNIX system! I know this!