Encoding Time Series Data
I spent just shy of 9 years working on a suite of products that stored and processed environmental sensor data. This kind of data is naturally stored and processed as a time series. A time series is a set of data points, each with a time and value, and often more. They are common for sensor data, but can be used for pretty much anything that is measured over time such as prices, performance metrics, and analytical data.
There is a lot of subtlety to working with time series data, especially if you are concerned with scientific accuracy. One surprising challenge, which I’ll be discussing in this post, is storing and transmitting the data efficiently.
At my previous company, the data we dealt with was relatively low frequency (typically a reading every 5-15 minutes), but some of the data sets went back into the 1800s! Although the data is simple in structure, the size can get surprisingly big when you need to move around millions of points.
Test application
I built a simple app to measure the transmission size of the data in the different variations discussed in this post. You don’t need to look at it to follow along, but you might find it interesting. The source code is available here: https://github.com/jessemcdowell/time-series-encoding-test
The test application is written in TypeScript, and uses node, Express, and Axios. It took some fiddling to be able to control compression, and to measure the transmission size (as opposed to the total size of the data, which would normally be more important). The numbers below also include other transmission overhead like headers and chunk encoding. This will skew the numbers a bit, but the results should still be more than good enough to illustrate the differences between approaches.
In application
A typical way to store a point in application memory is as an array of simple structures/objects. With an 8-byte integer for the timestamp, and an 8-byte floating point for the value, you can represent each point with 16 bytes. This will have enough precision for most use cases.
For one million points packed in an efficient array, this would consume 16 million bytes of memory. I’m going to use this as a baseline for evaluating other options.
This is an efficient, convenient way to handle the data in an application, but chances are you will want to write it to disk, or return it from a web service. It is possible to send this binary representation over the wire, but there are some drawbacks to consider. I’ll come back to this.
JSON
JSON is a common way to transmit data between server and client or server and server. A simple representation might look like the following:
1 | [ |
This is just three points, and you can see that it’s quite a few bytes. If you exclude the whitespace (included above for readability), it’s still 157 bytes with each point taking 52 bytes. This means 1 million points takes about 52 million bytes (just over 51MB)!
Here’s how it breaks down:
- 19 bytes - object wrapper, commas, and property names
- 26 bytes - time string and its quotes
- 7 bytes - the value
52 bytes is a lot more than the 16 we need in application memory. This is more than 200% overhead! It also has lower precision than the in memory version can support. It is possible to use a less precise timestamp format, or to use smaller names for the properties (like t
and v
). Those would help a bit, but we can do better.
CSV
CSV has a lot less wrapping characters, and doesn’t need to repeat the property names for each point. Here are the same values as the JSON example:
1 | time,value |
The total is 109 bytes (or 112 with CRLF line endings). The header is just 11 bytes, and can be omitted if desired. Each point is 33 bytes, which is better than the 52 of JSON, but more than double that of our 16 bytes in memory (106% overhead). 1 million points takes just under 36 million bytes.
There are other advantages to CSV as well. It’s easy to produce, easy to parse, and can handle multiple values or other types of data for each timestamp easily. It’s more readable in debugging tools, and can be consumed easily in many applications, more easily than JSON. It has capacity for extensibility via adding columns, though it can’t naturally handle structured data the same way JSON can.
Custom binary format
A time series can be stored or transmitted in a binary format very similar to the way it would be stored in application memory. You can encode it however you want, but my simple implementation takes 16 million bytes which is considerably better than both text formats discussed above. There is some added complexity to this approach, however. In particular, you will need your own versioning strategy if you ever need to change the structure.
You may also have some difficulty getting your web server and client to transmit and receive the data correctly. An easy way to get around this is to re-pack the binary data into something like Base64 and send that through a normal string property. This is especially helpful if you have a bunch of different time series or other related data you want to return in the same response. The drawback to this approach is that Base64 adds about 33% overhead, so the same million points from before would take 21.3 million bytes instead.
Unpacking precise timestamps in JavaScript
Another bit of complexity you might encounter with binary data, or any kind of data really, is the inherent limitation of JavaScript when dealing with 64-bit integers. For most applications this probably won’t matter. At my previous company, however, it was a significant and memorable challenge. Some of our features required quoting exact timestamps back to the server, and even a single millisecond of lost precision broke things.
Another challenge was different timestamp representations on the client and server. Our server (C# and C++ based) used a single 64-bit integer to represent 100-nanosecond intervals. Our client (JavaScript) used two number properties (one to store seconds, and a second to store nanoseconds). When unpacking the binary data, every timestamp had to be divided by 10000 for seconds, and the remainder multiplied by 100 for nanoseconds. Doing this with floating points in native JavaScript was very slow. It took some frustrating research and experimentation, but I got the performance gains I needed using a bit of Web Assembly.
Standardized binary format
Another option is to use a common binary format to transmit the data. MsgPack is very similar to JSON, but a bit smaller and faster. It also has the same drawback with time series data: it includes the key names with every single point. A MsgPack equivalent of the million points take around 28 million bytes. This is 76% overhead compared to the 16 million bytes in memory.
Protobuf (the binary protocol used in gRPC) does not include keys in the data, but it is harder to use. You have to define a schema that is kept is sync between your client and server. The same million points takes around 14 million bytes in Protobuf. This is a modest 11% less than the custom binary format, but with more safeguards and flexibility.
Compressed binary
Gzip compression is an easy way to reduce the size of the data being transmitted. Even better, this is a feature you probably already have enabled on your web servers and clients.
I tested a few configurations of compressed binary data to see how they performed. One interesting discovery was that the compression was noticeably better if I arranged the data in sets (time1-time2-time3-value1-value2-value3
instead of pairs such as time1-value1-time2-value2-time3-value3
. I assume this is because of how gzip handles adjacent similarities. Here are the results for similar 1-million point blocks:
- Binary in any arrangement with no compression: 16 million bytes
- Binary pairs (
time-value-time-value
) with gzip: 10.3 million bytes - Binary sets (
time-time-value-value
) with gzip: 8.7 million bytes - Binary pairs, encoded in Base64, with gzip: 10.8 million bytes
- Binary sets, encoded in Base64, with gzip: 10.1 million bytes
I also tested MsgPack and Protobuf with gzip. Even though they are arranged in pairs, they performed slightly better:
- MsgPack: 9.8 million bytes
- Protobuf: 8.2 million bytes
One drawback to sets over pairs is that you would need all the points available to send them in sequence. If you had a very large amount of points and wanted to stream them incrementally, you would need to develop your own chunk-capable format. This is also a drawback of MsgPack which writes the size of each array at its start. I don’t know if Protobuf has this limitation too.
Compressed CSV and JSON
Compression is usually pretty effective wherever you have lots of repeating sequences, and that is certainly the case with JSON and CSV time series data. The same millions points packed in JSON took 7.5 million bytes when gzipped. That’s a compression factor over 7 times as compared to the plain JSON, and more than 2x better than straight binary!
A million points of CSV compressed to 6.9 million bytes, which is even better. The compression factor is not as impressive, but the overall size is smaller. This is the smallest of all the options in this post.
Downsampling
Another great way to reduce the size of a time series is to send fewer points. Downsampling (sometimes called Decimation) is a technique that excludes some or most of the points, and it can have a tremendous impact on the size of the data and the performance of your applications.
If you use this technique for data that is being graphed, the user might not even be able to notice. For example, If you are displaying a year of data in a spacious 1200-pixel wide area, you only have 3.3 pixels of width per day. If you have a point per hour, most of those 24 points won’t change the image at all and can be safely excluded. If you had a point every minute, you can exclude almost all the points!
There are a few ways to do this. The best technique will depend on your use case and the correct interpolation method for the data. I’m not going to get into those in this post. The result of downsampling is still a time series, so you should be able to use any of the techniques already discussed to transmit the downsampled data efficiently.
Conclusion
Time series data can be larger than you expect, but there are lots of ways to store and transmit them more efficiently. For most projects, I would start with CSV compressed by gzip, and use downsampling where possible. For cases with more strenuous requirements, there are plenty of other options to consider.
My testing and discussion were purely focused on the size of the data. Fewer bytes will generally lead to better performance, but in some situations, serialization speed or memory utilization might be more important. If that matters to you, you should work with an architect to find the best approach.