I would guess most developers think of URLs as a string with a https:// at
the beginning. In many cases there are assumptions that are made about these URL-shaped
strings which may be confusing, misleading, or flat out incorrect. The url crate is compliant to the RFCs about URLs, but while being technically correct is the best kind of correct, that doesn’t mean it still isn’t confusing.
Here are some common misconceptions that I have seen crop up as I have worked on incorporating more and more url::Url usage in my Rust projects.
Slashes are load-bearing
Most web frameworks will take a request like https://example.com/hello// and route that to the handler for /hello, conveniently dropping the redundant trailing slashes. From a URL specification standpoint, this is probably not correct. Where I might see a couple of trailing slashes, a URL parser sees a hello path segment followed by two empty path segments. Consider the following.
let left = Url::parse("s3://bucket/prefix/")?;
let right = Url::parse("s3://bucket/prefix")?;
These are not equivalent.
The path_segments() are different too:
left: ["prefix", ""]
right: ["prefix"]
This is because that trailing slash means there’s another path segment, it just happens to be empty. Cue subtle bugs from user code which expects the two given URLs to behave identically because … well, S3 treats them as such, as do most other web servers today.
Join the fun
With that trailing empty slash meaning there’s an empty path segment on the Url, that also means that joining onto Url behaves different than you might otherwise expect. For example:
left.join("_delta_log"); // produces `s3://bucket/prefix/_delta_log`
right.join("_delta_log"); // produces `s3://bucket/_delta_log`
The docs try to make this clear:
A trailing slash is significant. Without it, the last path component is considered to be a “file” name to be removed to get at the “directory” that is used as the base.
With the subtle yet significant behavior of the trailing slash, this nuance might not be noticed by most developers.
File URLs are weird.
A file URL is one which starts with file://, but because a slash is not
always a slash on operating systems, especially those developed in Redmond, WA,
their behavior is not always consistent with what developers expect.
In the url crate I ended up filing a bug for this behavior but as of today these two produce different results:
Url::parse("file:///home/tyler/../../dev/null")?;
Url::from_file_path("/home/tyler/../../dev/null")?;
The resulting Url structs are not equivalent, and the parsing of the file URL results in canonicalization, removing the .. segments from the path and producing a Url that is effectively /dev/null. The second Url however has a .path() of the full uncanonicalized path passed in.
The oddities of file URLs about and the RFC has a lot of documented “quirks” about Windows drive lettering and file URLs, which leads to irritating bugs like this one.
Url types are better than raw str types for working with URL shaped data in
any Rust program. The additional structure is really important for many reasons.
However the use of Url doesn’t absolve the developer of considering
user-inputs where slashes are plentiful and path segments are goofy.
Personally, I was hoping simply adopting Url would make me have to care less
about garbage input, but unfortunately more structured garbage is still
garbage.