When I first started this project I thought I'd just whip up a small program to convert the blockchain raw data into a SQL database for doing some queries and possibly extend it to pulling out data for some nifty animated data visualization sequences. My idea was I'd start with using RPC calls and then later patch on a front end so that it could talk directly on the network.
The problem quickly became processing time and speed considerations. While my code started out fabulously quick with blocks being converted at a rate of 80,000 per second it always degenerated into a sluggish crawl at 0.25 blocks per second before even nearly done. At first I thought it was due to SQL indexing being a problem and I looked at various ways to cut down on indices and not building them on the fly. It wasn't too long before I determined that the real problem was shear number of transactions per block and the number of linkages between them that made progress painful.
I adapted my code in various ways. Initially I started working with SQLite3. During the indexing problem phase I thought SQLite3 cannot handle large databases because of index table rewriting. This is still probably true but to lesser an extent than I thought. As I progressed and ran speed tests I saw that the actual query rate didn't drop off so much as the number of transactions per block, and inputs/outputs per transaction increased massively, and this was the underlying reason for the slow down. But before that realization I'd already migrated onto MySQL, hoping to find better index handling. While MySQL was somewhat better it didn't solve the problem of slow data conversion.
The real problem is that SQL data indexes are inherently based on ordering of data (B-trees) and the blockchain is primarily unordered data. Sure, the blocks are ordered but the block related data (headers) is a very small part of the overall mass, and only adds up to around 30MB of the current 40GB that makes up the blockchain. Once you look inside blocks the connectedness is primarily hashes, or key-value type linkages. And the connections are between input-output pairs that span the entire blockchain. Which, dummy me, is why the blockchain uses leveldb, hash maps and key-value indexing. And yet, a SQL interface is so useful for asking the kinds of questions that are external to how the blockchain functions.
After all that messin' about, I do have a rather flabby conversion utility that will suck up blockchain data and run crazy with SQL insertions in spastic fits building a full SQL database. At every step I tried to keep data size to a minimum, sometimes even at the cost of query convenience (more on this later). You see, I'd read that some SQL conversions were resulting in 300GB databases and needed high performance servers with SSD raid arrays. I didn't want to go down that road, if at all possible, and needed to work with a regular sluggish hard disk using less space than the actual blockchain. At least that was the goal, so far sadly missed. Hint: don't try working with the blockchain in SQL format without an SSD. There is simply too many queries/insertions, or more basically, too many IOPS. I tried all sorts of ways to cut down on these IOPS. I wrote out unindexed data tables with intermediate table linkages to be "fixed up" later. In the end the cost of a fully connected, useful to query, SQL database is a gazzillion IOPS.
I made some efforts to run with the tables on ram disk (tmpfs) with good results. The problem was lack of RAM with only 2Gb or 4GB available on my current systems and tables already adding up to around 10GB. You'll see code to handle multi-pass conversions in the script. This was to allow building tables consecutively in RAM. This worked well to an extant until hitting the outputs table. Some more work using Merge tables followed and building tables in slices, but in the end most of this was too much hassle for only moderate gains.
The game changer involved moving to my rather dated 30GB SSD (already burdened by Ubuntu OS and various other project data, movies, photos, emails) and comparing this to the first hard disk conversions. Well, even this was a massive improvement. Whereas running on a hard disk slowed to a ridiculous 1 block per 20 seconds (with 150,000 blocks to go; aborted), my weary old SSD gave me an astonishing 0.3 blocks per second up til the last 10,000 blocks. And this is where I stand today - with Python conversion script, and the burning desire to go buy a new SSD.
I'm waiting on more space and even more speed. Recent reviews of low-end available drives promise 20,000-40,000 IOPS compared with my current rather lame 600 IOPS. Either coincidentally or perhaps due to underlying DB operations, that just happens to be roughly the same number of queries per second I get out of MySQL while running the converter.
Bottom line reached, buy new SSD soon and revisit my conversion efforts. In the meantime I've pushed the in-progress script to my github. Coming soon - SQL & SQLer Too, a more in depth look at the schema I settled on, and how to use it for various reporting. Further sequels to follow on as yet unknown uses involving rendering OpenGL blockchain data visualizations destined to end up on Youtube.