Random Stuff - Adventures in Nerd Land: October 2009

So, a few days ago I blogged about an idea I had for a new storage engine for MySQL. The MySQL API is C++. so it would require me to actually learn C++. Which I tried, but it hurt my eyes. So I decided to just go ahead and write a prototype in C#, just to quickly test out the basic premise; reducing/eliminating storing multiple copies of the same data, using B-trees and hash-values. (And gaining performance in the process..)

Since then I've spent 4 days in Istanbul, which was brilliant, but before I went I made a quick prototype, using .Net 4.0beta1. The preliminary results were encouraging enough to motivate me to polish it up a bit. In essence, I took 70MB of logs, and imported it into the storage enginge in 18 seconds, resulting in about 22MB on disk. (That is 32bit-aligned, mind you, so if size was the only concern it could be trimmed down quite a bit) Fairly simplistic, and for now everything is stored as byte-arrays, it doesn't even support deleting data. But with a little brain-work, I guess it could be a useful way of quick and simple permanent storage on .Net. (I really don't like the idea of requiring SQL Server for trivial stuff)

Not the first project with that aim, I know, but it is an interesting learning experience nonetheless. Incidentally, memory mapped files are supported in .Net 4.0, which is cool. In time, I hope to also make a MySQL-version, which would make it possible to use the same datafiles from a .net-application and MySQL, which might be vaguely useful; for instance writing a small .Net snippet to create a database from some moderately complex dataset, and exporting it to MySQL with a simple file-copy. Which I would file under C, for "Cool".

So, this is where I am at the moment:

I've got a crude XML-format for defining a row-format, and a parser for it.
A class-hierarchy that works fairly well
A b-tree implementation that seems to work (I'm sure theres a few off-by-one's, but that's hardly the point)
A data loader which loads in data based on the supplied dataformat
A cmd-line app that lets me play with it

I can load data, and retrieve it as arrays of strings based on row-numbers. Each column uses two files, both memory mapped. One datafile contains an entry for each row, which is simply the record-number in the b-tree where the data can be found. The B-tree file contains entries for nodes and data, stored in pairs. A node-entry, followed by a datablock of static size.

A variation of this is for really small fields, for instance int's, in which case the column simply contains the values directly in the row-entry, instead of a pointer to the b-tree. These columns does not have b-trees.

I didn't bring a laptop to Istanbul, but I did think about it every now and then. So here is the current TO-DO list:

1. Make stress-tests. Randomly retrieve lots of rows, timing it
2. Add more statistics, maximal search length, number of levels in the b-tree, timing searches etc
3. Single datafile:
 - Extents. Need map of extents, each column has one or two extents * n, data and row
 - Need logic to get new extend when one is full
 - rowCount on table-object, not column
 - Extents handled by table-object
 - mmap one extent at a time
4. Support deletes
 - Use free-lists
5. Support variable datasizes
 - Another segment for actual data, for a total of two or three:
 - Non-optimizing:
  - 1 segment for rowdata, with pointers to another segment with valuedata.
  - a single list for fixes sizes 
 - Optimizing:
  - 1 segment for rowdata, 1 with b-tree, another for value-data
  - the b-tree requires two reads anyway, so should not take longer (ValEntry+Data)
6. Create general library for .Net
 - Need decent way of retrieving data, apart from row-number
 - GetRow on Table-object, return row as object[]
 - RegEx'es on dataload
7. Create MySQL StorageEnginge using same dataformat
8. Indexing?
9. Store data as native datatypes, at least int and char[]
10. Profile it, optimize
11. Verify that 32bit-aligne actually improves performance.

This week I've spent some time analysing about a gigabyte of logs from an old IIS5.0 server. The logs are in the standard text-format, so I quickly decided to dump it all into MySQL.

Using MySQL for this sort of thing (or any database, for that matter) works like a charm. Yes, of course I could have used Perl, or for this particular case, any number of loganalysers, but I find that working with SQL is simply quicker and more convenient in practical terms, for on-the-fly one-time analysis where I don't really know up front what I'm looking for. It also allows me to use MySQL to aggregate the data, and use Pivot-tables in Excel to chew through and graph it. It simply works well for me. (And no, I don't care that other open source tools exists. I want it to work now, not after fiddling around with it for hours. And Excel is already installed on my laptop.)

Ideally I'd like to just dump the whole thing in Excel, but Excel will only allow you some 65000 rows, which is not nearly sufficient. My dataset is about 6.4 million rows.

Anyway, importing the logfiles was quite easy, a small commandline to clean the data (think grep and sed), and load data infile... from mysql did the trick. At a leisurely pace to be sure, but it did it. (I have MySQL running on Linux in a virtual machine, so I'm not blaming this lack of speed on Linux nor MySQL..)

So, while I was working with this data it struck me how much redundancy there was in the columns. For instance, in 6.4 million rows, I have only 2479 unique IP's. This is a quick-and-dirty thing, so I didn't spend time optimizing anything. The client IP-address in my table is varchar(15), because xxx.xxx.xxx.xxx is 15 bytes. The log-files are dumped as text, keeping everything as quick and painless as possible. In essence, the database is storing 6.4 million ~15-byte values, for only 2479 unique values. Or, to state it in real numbers, my database has 70,095,785 bytes of raw data stored in the IP-address field, not counting any overhead. The unique values sums up to 30,228 bytes.

In a production environment, you might create a second table with uniqe IP-addresses, and only reference that table from the "log" table. You'd save on diskspace, but your SQL would be more cumbersome to write. As I said, this was ad-hoc, and I don't need any extra headache on this type of work.

I've looked at the numbers, and the pattern seem to be repeating. The data is about two months worth, so the date-field only has 60 unique dates. There are, however, 6.4 million dates stored in the database. The request URL has only 657 unique values, yet I am storing those 6.4 millon times. It all adds up!

I've looked around for an alternative storage engine, but without much luck. Packed MyISAM-tables compress by rows, which won't do me any good. The Archive-engine the same, if I read it correctly.

Here I think it's time to pause and explain that I am not a DBA, I am not a programmer by trade or education (although I've have programmed in several languages, for my pet projects), and I do not intimately know MySQL. I use it occasionally. I browsed the source for the first time yesterday, and read a couple of Internets about the storage engine API. So the rest of the post should be regarded with scepticism by those looking to learn, and mercy by those who already know.

With that in mind, I'll take a couple of minutes to try to explain what the Storage Engine is, and why I'd like a new one. An RDBMS (or Relational DataBase Management System, MySQL is one) consists of a lot of pieces. Some part parses SQL, generates executionplans, optimize, and all sorts of more or less interesting bits, all of which are important. The Storage Engine is the bit that actually puts your data onto your disk (or wherever you want it). In MySQL, this part can be changed, depending on your needs. For instance, you have a storage engine that never stores anything to disk, it only keeps it in memory. This would be convenient for things that does not need to live long, but needs to be fast. The rest of the RDBMS doesn't know much about this, it only knows how to ask the storage engine for data. It really doesn't know what the storage engine does with the data, or how. It just wants to get what it asks for.

Likewise, when you write a select-statement, you don't need to worry much about which storage engine is used. It just all magically work. So you can use different storage engines, with different properties, just as long as those storage engines follow the rules, i.e. the API.

For a better explanation, check out the official documentation.

So back to my plan. Here is the crux of that I want: A storage engine that automatically checks if any value stored to a column, has been stored before. If so, simply point to the existing datablock. (And use Copy-on-Write for updates). In this particular case it would save me a LOT of data. I haven't crunched the numbers properly yet, but keep in mind that the entire Gigabyte of logfiles compresses to about 17MB, so there's a lot of redundancy in there.. Potentially I might even be able to hold the whole table in memory at once, which would speed up things quite a bit.

Think of it like this: a list for each column in a table, with one entry for every row. Let's call this the "row-list". Another list for each column in a table, with one entry for every value. Let's call this the "value-list". Each entry in the row-list is just a pointer to the correct entry in the value-list.

So, when an Insert comes along, the engine would look at each value, compute a hash-value, check this value against the value-list (or B-tree) for the relevant column looking for a match. If found, insert the pointer to the corresponding node in the value-list into the row-list, and increase the reference-counter on the value-list node. If not found, create a new node in the value-list, and point to it in the row-list.

A Delete would simply be decreasing the reference-count on the value-list, deleting the node if we reached zero, and delete the node in the row-list.

This all sounds frightfully involved, but my gut tells me it would pay off by reducing disk IO and memory-footprint. (Not sure if cache-locality might become an issue, though)

A quick estimate of datavolumes goes like this: I'm guessing I'd need 4 bytes (a 32-bit pointer) per row per column, in the row-list. Then I'd need a 32-bit hash-value and a reference counter on the value-list, plus the value itself. Total 4 bytes per row, 8 bytes per unique value, plus the actal data. I'm comparing against the raw data, not making any assumptions on MyISAM overhead. (I would also probably need some other bookkeeping datastructures/fields, but I guess MyISAM also uses those, so it evens out. In my head it does.)

A quick graph in Excel tells me that on a 15 byte value, I'd save on data up to about 48% unique values. In my dataset the unique values are less than 1%, so there is potential for BIG savings, as demonstrated with the IP-address example. I estimate 32-bit pointers are more than sufficient, remember we're talking about unique values. If you have more that 4 billion of those, you should probably look at a different storage engine! (The corresponding number for 64-bit pointers, incidentally, would be ~22%..)

Of course, the storage engine would need to be smart enough not to use this scheme for small fields. No sense in replacing a single byte-value with a 12 byte overhead..

One other interesting aspect is that I'd have 'free' indices; if value A matches an expression, this would apply to ALL rows that points to that node in the value-list. So there would be no need to evaluate those. I don't know if I would be able to get the optimizer to utilize this in any meaningful way, though.

I think this storage engine would be very useful for things like logs that has a lot of redundancy in it. It would, as far as I understand, cut down the importance of getting your data to normal form drastically, and might even throw in some nice enhancements for 'free'. The write-speed would probably be lower than, say, MyISAM, but that would be fine for the sort of scenario I'm looking at. It may on the other han turn out to be higher, for certain workloads, since there'd be way less disk-IO.. For both read and write. But that's speculation/whishful thinking.

So, step one in my plan for World Dominance: Learn C++.

Random Stuff - Adventures in Nerd Land

Sunday, October 11, 2009

An idea for a new Storage Engine for MySQL, Part II

Friday, October 2, 2009

An idea for a new storage engine for MySQL