While the http crate generally has a great API I have been unsatisfied how it handles URLs. To create a HTTP request a full URL is needed with a scheme (http/https), authority (example.org) and a path (/search?q=rust) but http does enforce this and allows you to only state the path. This means both clients and servers are either unable to determine protocol and and authority information or have to do this manually.
There is already the well-established rust-url crate but it is not used by http. This has been criticized by multiple people as the http crate does not even provide conversion functions for better ergonomics. As Armin Ronacher (mitsuhiko) puts it:
With the url crate we have a pretty damn good (new) standard compliant crate that is very popular. Right now I use this everywhere and then when doing requests to hyper I need to convert them which is not very great ergonomics and performance wise.
I’ve forked the http crate and created my own http-with-url demo. Each request is created with a complete URL:
A client can deconstruct the URL into pieces: in HTTP/1.1 www.google.com is the Host header, /search?q=rust is part of the request line and https tells the client to establish a secure connection.
A server already knows the protocol and can build from the Host header and the path a complete URL. Currently this task is left to the application authors by the http crate.
Some edge cases:
HTTP defines “server-wide” requests with a path value of *. They are rarely if ever used but can’t be expressed by the standard url crate. Instead I allow adding an extension called WholeServer to requests. If it is present the path of the URL is ignored and instead a server-wide request is sent.
Servers sometimes don’t know the host. The Host header field is only required in HTTP/1.1 but the ancient HTTP/1.0 assumed that each host/domain had its own IP address and therefore for each request sent to a given IP address the host was known by the server. As the host part can’t be omitted in an absolute URL I propose to instead use a placeholder value like the broadcast IP 255.255.255.255 instead.
HTTP demo client
To demonstrate how to use the http-with-url crate I have written a simple client.
It combines a few existing crates to parse HTTP (httparse), provide proven TLS encryption (native-tls) and of course the http crate itself. With these crates I found I quite easy to write the “glue” code to build a functioning HTTP client.
For most of the time web developers have used JavaScript for all client side scripting. But with the rise of asm.js and the upcoming WebAssembly it is possible to write applications in your language of choice.
Rust is particularly well suited for in browser usage since it is easy to write correct application code, it is fast, well portable and requires no garbage collection. Since before 1.0 compilation to asm.js has been supported in Rust and since the 1.14 release WebAssembly is also supported (did not try this).
As an exercise I have written a zoomable world map that uses vector data from OpenStreetMap through Mapzen to draw tiles in the browser. The vector data is supplied in protobuf format, and it is first decoded and then formatted as an SVG fragment with Rust. You can try out the app if you use a recent version of Firefox or Chrome.Note (2020): The Mapzen tile service is no longer available, therefore the map doesn't work anymore.
Vector tiles
OpenStreetMap is a collaborative wiki world map. Anyone can add add and edit so called features that represent geographic things like roads, water, buildings, state borders and many others. These features are represented by points (a pair of latitude and longitude, for example a statue), lines (a list of points, for exammple a road) and polygons (a closed line, for example a lake). These features are stored in a giant spatial database and can be queried, but there is no way to get all information for a given area except downloading the entire database which consumes many GB of storage (to be fair you can download also only the data for smaller areas like countries but it is still hundreds of Megabytes).
Here the Mapzen API helps by saying “just give me some coordinates and a zoom level and you get all relevant information to draw a beautiful map for this area”. To get some data you need to know the tile you are interested in. To simplify implementation and increase the speed all such services don’t accept coordinate ranges like “everything inside 49.9° N to 50.13° N and 8.41° E to 8.63° E” but expect you to specify quadratic tiles they can compute once and then cache for different zoom levels. Maptiler.org lets you actually view these tiles as an overlay to Google maps. Another thing the Mapzen API does for you is cartographic generalization: If you are looking at a map for all of Europe you probably do not want to know where exactly the Zeil in Franfurt is, a single point for a city is sufficient, individual streets don’t matter at this zoom level. These generalizations cut down tile size and help you selecting what to display at which zoom level.
But why do we even care about vector tiles? Can’t you just send some images across the web? Yes this is possible and was done by all web maps (I know of) until a few years ago but it has several downsides:
Vector tile maps allow smooth animations when zooming or or moving
no custom style, a map style is once rendered and then served to all users, you just can’t change the color of streets or create a black and white map for data visualization.
overlays are not that nice, while you can certainly add some custom markers to the map they will always look look like you took a paper map and added some pins and draw with crayons on it. It works, but it is not pretty. With vector maps you can attach handlers and custom styles to anything on the map. Every building can be highlighted or clickable.
The Mapzen Vector Tile Service provides different serialization formats like GeoJSON, TopoJSON and Protocol Buffers and you can select layers that group different kinds of features together. Do not care about boundaries? Just exclude them from your query. (Even if you request all data you don’t need to draw them all, it is up to you.)
Processing tiles with Rust
A simple interface is used to process tiles: a function called process takes a single tile in protobuf binary format and returns a string representing an SVG fragment for the tile.
/// Reads a Vector File and produces an SVG fragment for a tile.pubfnprocess<R: Read>(mutr: R)-> ProtobufResult<String>{lettile: Tile=protobuf::parse_from_reader(&mutr)?;letmutstorage=Storage::new();forraw_layerintile.get_layers(){letmutlayer=Layer::new(raw_layer);layer.paint(&mutstorage)?;}Ok(String::from(storage))}
First the binary file is read with a generated rust-protobuf parser.
Next a store for the generated SVG is created. Text can be written to the store on different layers. Each layer is represented by an integer between 0 and 500. This property is present as sort_rank in the Mapzen data. Features with a low rank will be drawn first and will be eventually hidden by other features. For example a forest will have a low rank and a street crossing will have a higher rank.
The processing of the layers happens in the for-loop. Each layer is “painted” to the storage.
Last the storage is stringified and returned.
Display a map in the browser
To call the Rust code from JS some wrappers are neeeded. On the Rust side process should consume a binary array. This is represented in c-like fashion as a pointer and a length: p: *const u8, len: usize. The return value is a string, this can easily use a zero terminated string since the fragment won’t contain internal NULLs: *const c_char. The complete function header is: pub extern fn process_web(p: *const u8, len: usize) -> *const c_char
Another helper function is pub extern fn free_cstring_web(p: *mut c_char). It takes a pointer to a zero terminated string and frees it. Why doesn’t the code just use normal free()? The docs advise against it because if Rust frees the string it can assure the string is never used afterwards by zeroing the first byte. (Are there additional arguments against normal free?)
To really make the functions callable from JS the following block is also needed:
#[cfg_attr(target_arch="asmjs", link_args="-s EXPORTED_FUNCTIONS=['_process_web','_free_cstring_web']")]extern{}/// asm.js expects a main function.fnmain(){}
The important part are the link_args, they tell the compiler to reexpose the named functions.
A JavaScript wrapper is used to call process. It takes a Blob and writes it to memory accessible by Rust code.
In line 6 _process is called.
To call _process it needs to be described in JS: Module.cwrap('process_web', 'number', ['number', 'number']) It is a function with two arguments (both numbers for pointer and length) and returns another number which is a pointer to the SVG fragment.
After _process was called the result needs to be loaded to a JS string. Then both the array and the result string are freed. Forgetting this keeps them in memory and results in the script crashing with not enough memory after a few tiles loaded.
The rest of the Javascript fetches the data and binds to Leaflet. I moved the Rust code to a Web Worker because otherwise the the web app freezes while processing a tile.
Summary
The project shows that it is possible to write ordinary Rust code to be used in web apps as long as the code does not need to work directly with the DOM. (DOM Integration is planned for a later release of WebAssembly.)
Usually the motivation to add Rust to a web app will be a better performance. In this case the map still renders somewhat slow when first visiting an area and is less responsive then the maps created by the Mapzen team with pure JavaScript. This is not Rusts fault because Mapzen made some different choices like Canvas instead of SVG, better caching and had a lot more time and knowledge to optimize their implementation.
The map still misses labels. Labels need to be shown above all other map content and they may span multiple tiles. For this reason the tiles must all be drawn to the same element so labels can be rendered without errors. Another possibility is to use canvas for the output like the original Mapzen tools but this requires calling into JS and will be slow with Rust. Canvas is better than SVG in this case because the browser needs to keep track of all the nodes for a map Unfortunately I was not able to run compiled WebAssembly binaries in Firefox but they will reduce file size and source code parsing time much.
It was a fun project since it combines to of my interests: Rust and mapping. Although using Rust did not really pay off in terms of faster map load times (It turns out parsing a binary tile and emitting SVG takes its time even with Rust, caching SVG was the biggest visible improvement) and learned a lot.
Today Manishearth blogged about mitigating underhandedness by fuzzing your code. Underhanded code contains a backdoor hidden by the syntax or semantics of the programming language and should not even be visible to the cautious reviewer. Fuzzing tests a program for random (or not so random) inputs and reports found bugs like memory access errors or panics. Simple fuzzing uses random inputs but modern tools inspect the executable and modify the input to try out more branches and check obscure edge cases. Like all kinds of testing fuzzing is a good way to find underhanded code, but also many kinds of other bugs. Unlike with unit tests no one needs to write all kinds of test cases with a given input and an expected output and will still miss some bugs because of misconceptions about the topic and the impossibility to test every path. Of course fuzzing can’t find other kinds of errors, like all errors not causing illegal memory access or panics but wrong behavior. The article contains a tutorial on how to use cargo fuzz a new command-line tool to do simple fuzzing with rust crates. If you haven’t already read the article now and try it with one of your own crates (or one from crates.io).
The crate I selected for fuzzing is httpdate. I blogged about it before. It is very simple and parses a date found in a HTTP header to a SystemTime and is also able to format a time inside a header.
With the command line utility a fuzzing folder is created. A fuzzer needs a function to test. It has to take a slice of bytes as an input and test it. The function I wrote is very simple: I tries to read the bytes as an UTF-8 string and if this fails it does just nothing. Otherwise it feeds the input to parse_http_date. If this function panics we have found a bug! In case the fuzzer actually produced a valid timestamp it also formats the time again (and checks if a non-empty string is produced).
The whole logic is really simple and I had a few tests so I did not really expect many crashes. But I was proven wrong and the fuzzer immediately tried to parse some Ùníçôde characters as a date. This crashed the parser which works with string slices. Take the string "foobar". Slicing it with s[3..5] gives "ba". But if you try "später" it will panic with a message that 3 is no character boundary. This is the case because 'ä' takes up two bytes instead of one like ASCII symbols and rusts slice indices work on bytes and not codepoints. The fix was simple: Just check before parsing whether s.is_ascii() is true. I recompiled the code and started the fuzzer again (cargo fuzz stops after the first bug found). This turned up another bug: Dates which contain some thing like “May 35” will be accepted because I did never check if the individual values where in a valid range! This caused some arithmetic under- and overflows in the conversion from a date with year-month-day-hour-minute-second to seconds since the epoch. Now I check if all numbers are in a valid range. This is a quite obvious bug, I should have been able to spot it earlier. After this the fuzzer found no further bugs. I just added some checks if all required whitespace is present in the date strings and then published a bugfix release 0.2.1 to crates.io.
Why the title? While many Rustaceans say “If it compiles it works” and this mostly true the code will still contain errors. Since fuzzing in Rust is now as easy as writing a single function every library that works with inputs should be fuzzed. In the next days or weeks I will try to find bugs in other crates from the rust ecosystem and write about my findings.
If you also want to fuzz a crate but don’t know which one the ones listed for the keyword “parser” are usually simple. If they were already fuzzed in the past for example with afl.rs you most likely not find anything but otherwise I am certain there will be crashes in many crates.
TL;DR: I’ve written a httpdate
crate to parse and format dates found in HTTP. It has no dependencies and uses
std::time::SystemTime to store all times.
HTTP communicates some timestamps. To do this it uses an old format called
IMF-fixdate: Sat, 15 Oct 2016 11:48:13 GMT. (For compatibility with ancient
endpoints it supports parsing two more formats.) While Rust has the excellent
chrono crate and the deprecated but usable time crate (chrono still
depends on it) for time handling they are both overkill for the task. These
crates do a lot of things that are generally useful if you work with dates but
in this case are not needed like timezones, extensive formatting and parsing
methods and a lot more.
I wrote a small crate called httpdate to parse a timestamp to a
std::time::SystemTime and format it as required for HTTP. It is both less code
(more then ten thousand lines vs about 300 lines) so it compiles in an instant
and it is faster because it is written specially for HTTP timestamps (not that
it would matter in normal use but some micro-benchmarks will be faster).
Usage
Using httpdate is very straightforward:
With parse_http_date(s: &str) -> Result<SystemTime, ()> a date from HTTP is parsed.
The function tries to parse the string in all supported formats.
With fmt_http_date(d: SystemTime) -> String a system time is formatted
to an IMF-fixdate.
Using std::time::SystemTime
Rusts standard system time is used to store the timestamps from HTTP internally.
From the documentation:
Although a SystemTime cannot be directly inspected, the UNIX_EPOCH
constant is provided in this module as an anchor in time to learn information
about a SystemTime. By calculating the duration from this fixed point in
time, a SystemTime can be converted to a human-readable time, or perhaps
some other string representation.
To use a SystemTime I first convert it to the number of seconds since the
epoch (sub seconds are discarded in HTTP):
letsecs_since_epoch=v// the system time.duration_since(UNIX_EPOCH)// get the duration from the epoch.unwrap()// Dates prior to 1970 are not expected// common HTTP are either the current date,// last-modified dates or events in the future// like cache expiration.// Only if you do time-travel you may have problems..as_secs();// Convert the duration to an integer.
A SystemTime is constructed by adding a duration to the epoch:
The crate converts strings to an internal DateTime struct first. This structure has
fields for time (second, minute, hour) and date (day, month, year, day of week). The
result is then converted to a SystemTime. To get a HTTP date from a SystemTime
the above steps are applied the other way, but calculating the day for a given moment
is more compex.
I’ve decided to give tokio the new “network application framework” based on
mio a try and write an echo server.
RFC 862 is just one page and the section describing the TCP echo server is
just four lines long, so I paste it here in its entirety:
One echo service is defined as a connection based application on TCP.
A server listens for TCP connections on TCP port 7. Once a
connection is established any data received is sent back. This
continues until the calling user terminates the connection.
The main function first creates a Reactor. The reactor manages the
connections with the clients. The server should listen to all connections
on port 7000 (port 7 requires root, so I’ve changed it). Echo::from_stream
constructs a new task for a TCP stream. It is called by the reactor for each
connection. Last the reactor is run. It now listens to all incoming connections.
Echo is a simple struct. It contains a TCP stream and a buffer of 128 bytes to
store data send by the client.
structEcho{buf: Vec<u8>,stream: TcpStream,}
A tokio TcpStream behaves a little different than a blocking TcpStream from
the standard library. It returns data as long as there are ready bytes. If there
are no more bytes left to read it throws an io::Error from the kind
WouldBlock. If the socket was closed and there are no more bytes to read it
returns that it successfully read zero bytes. The non-blocking nature of the
stream renders some methods of std::io::Read useless. For example one cannot
use read_to_end since the stream may block at any time before being closed. On
the other hand a read_to_block function would be useful that reads until the
next block. In the echo server one could omit the loop and just read all
contents from the stream and write them to the output.
As Echo is a tokio task it shall implement the Task trait. The method tick
is called by the reactor whenever the TCP stream contains data. Here the method
reads all data from the stream and writes it back to the stream immediatly.
If there are no more bytes to read the task is terminated by returning
Tick::Final, it won’t be called again by tokio. If some bytes were read all
of them are written to the TCP stream. If there is a WouldBlock error the
reactor is informed that the task has currently no more work to do but should
be called again if the server receives more bytes on the connection.
This is the complete code minus some imports and the trivial from_stream
constructor of Echo.
To test the echo server use $ telnet localhost 7000 and enter some text. It will
be promptly returned by the server.
Nitpicking about Tokio
The echo server scratches just the surface of Tokio, but I really like how Tokio
abstracts over Mio. It is a lot simpler than rotor another abstraction crate
for mio.
Creating a server is straightforward and one needs to only implement a single
trait method to create a simple task. The documentation is good considering this
is a new project but the API still needs more polish. In no particular order:
It would be nice if listen, TcpStream and others would follow std and
accept all types implementing ToSocketAddrs so one can write
TcpStream::connect("127.0.0.1:7") and does not need to parse the addresses.
tokio::io::TcpStream has both read and write methods itself and
implements std::io::Read and Write. The two methods do the same.
I am not sure if I was supposed to use try_read instead of read. It should
work make working with non-blocking resources easier but it only splits out
the WouldBlock error case. It’s return type is Result<Option<usize>>. One
checks if there is an error, checks for None, it indicates if the resource
would block and then reads the length. I found it more concise to write
Err(ref e) if e.kind() == ErrorKind::WouldBlock inside a match statement to
test if a resource would block, it also reduces rightward shift.
The next protocol I am going to implement will be more complex (really!) than
echo but it will be a fun task to do it with tokio.
By the way this is my first blog post, please write suggestions to improve this
text on Reddit.
Update
As Carl Lerche commented on Reddit the Echo server should not use
write_all() because tokio may raise a WouldBlock error while writing the message.
In this case you can't recover from the error because you don't know how many bytes
were written before tokio blocked.
For this reason the server needs to keep additional state between calls to the tick function.
It needs to remember if there are pending bytes, and if this is the case where in the
buffer they are located. A pending: Range<usize>, needs to be added to the Echo struct.
If the start and the end of the range are at the same position there are no pending bytes
left. This is also the initial state: pending: 0..0. If there are pending bytes waiting to be written
the range gives the position of the remaining bytes in the buffer. Although the buffer is always
filled starting at the first byte partial writes may occur and only the first x bytes are written and
the pending bytes are in the middle of the buffer.
The new body of the tick() function looks like this:
The match statement is called in a loop to do as much work as possible. First if there are no
pending bytes some text is read from the TCP stream. The range of the bytes read is added to
the pending value. If there are pending bytes the server tries to write them to the TCP stream.
The range of the remaining pending bytes is updated. If the stream would block the function
returns. tick will be called again when the stream is unblocked again. If there are zero bytes
read the connection was closed and Tick::Final tells tokio that it should remove the task.
Obviously the code is now a bit longer than 35 lines, but at least it is correct. ☺
The complete code
This is the updated version of the code without write_all().