sqlChain - How Storage is Optimized.

Any time you store data you have trade offs to consider. I wanted to use sqlChain on low-cost a VPS where every GB pushes up the monthly fee. At a volunteer level I really couldn't see paying for 300GB of SSD disk space as several APIs are reportedly demanding. On Amazon EC2 that would be $30/month for data. No big deal when you have a business model; but kind of costly if you are paying from spare cash to "support the community". I'm also developing on an old laptop with very modest resources and I wanted it to be usable without requiring high-end hardware.

In building sqlChain I chose to reduce data size by eliminating redundancies, and using a few tricks, at the cost of being able to query all values directly in SQL. Some data requires external scripting to access. For the purpose of providing an API layer I think this has worked well, but some analysis cannot be done with SQL alone. C'est la vie. This article discusses these trade offs.

First thing is not storing textual or hex versions of data. That's obvious, and comes at a slight cost in query simplicity because you have to do various conversions on both inserting and selecting data. In sqlChain addresses are stored as 20 byte binary pubkey hashes, and tx/block hashes are stored as 32 byte binary values.

To get to results in a time useful for an API you need to index the data. This is probably the biggest cost, so choosing what and how you index greatly affects both size and speed. I looked for ways to reduce indexed column size. Instead of indexing on 32 byte hash values I used a scheme that truncates hashes as 5 byte IDs, converted to decimal values, requiring ony 6 or 8 bytes. This means that IDs may not directly be unique, requiring a small trade off in speed as you need to check an ID and increment it until an unused value is found. It makes queries only slightly more complex. Instead of a query like: select * from trxs where hash=%s; you need to use select * from trxs where id>=%s and hash=%s limit 1;, with an index built only on the id column. I use this for transaction ids, address ids, and output ids, with a further constraint of output ids being derived from transaction ids, cutting down again on indexing.

To see how well this works I did some testing on collision rates for the IDs. Then I revised the method until collisions were rare enough that you almost always got the first record indexed. In the last test I ran on transaction ids only 1 in 41000 records required even one increment. For address IDs I use a slightly different conversion that encodes the address type (1- or 3-) in the lowest ID bit (odd/even). Both address and transaction ids take 6 bytes and output ids take 7 bytes. In all cases my index files are actually smaller than the data files; usually much smaller, which I consider a big win.

The blockchain has tremedous data redundancies. Almost all input/output records contain standard values. Weird transactions that deviate are a very small percent of the whole, so I looked for ways to encode these redundancies. The biggest space saver is the encoding of output script PK data, which is very often just the pubkey hash with a few extra script bytes. We're already storing the pubkey hash for each output as an ID value, so I can truncate all standard outputs to a single byte. If it's zero, it indicates no more data, and we look at the pubkey hash and ID (low bit) to derive back the orignal scriptPK. There are very few scriptPKs that have a non-zero length byte, and store actual script data.

The locktime and sequence values in each transaction and input record are likewise encoded as single bits, along with a variable input/output count, in a 1 byte header. This reduces data and these values have no common need to be indexed, so not much is given up. If you wanted to count how many transactions have >1000 outputs, or a given locktime you'd have to use a script instead of a SQL query.

The biggest space saver of all is the input script data (sig,pubkey). I still carry this along for now but will add a pruning option that can drop this data. Obviously it's critical for the verification of transactions but after that it is rarely or never used. It accounts for over 25GB of data that in most cases can be simply dropped with no loss of function other than spewing out interesting hex digits on API requests. Once verified by bitcoind they never get used. Rather surprisingly, considering the size of the blockchain, the only data you actually need for spending, other than your private key, is the address, or more specifically the pubkey hash (pkh) bytes, linked to a specific output id (tx,n). These are all maintained in the outputs table as a pair (out_id,addr_id). It's deliberate that the table with most records (outputs) has the smallest row size.

The full SQL schema is in the GitHub repo, but only tells part of the story. The code to handle bit flags and ID values is in the sqlchain support module, lib/sqlchain.py.