This week I've spent some time analysing about a gigabyte of logs from an old IIS5.0 server. The logs are in the standard text-format, so I quickly decided to dump it all into MySQL.
Using MySQL for this sort of thing (or any database, for that matter) works like a charm. Yes, of course I could have used Perl, or for this particular case, any number of loganalysers, but I find that working with SQL is simply quicker and more convenient in practical terms, for on-the-fly one-time analysis where I don't really know up front what I'm looking for. It also allows me to use MySQL to aggregate the data, and use Pivot-tables in Excel to chew through and graph it. It simply works well for me.
(And no, I don't care that other open source tools exists. I want it to work now, not after fiddling around with it for hours. And Excel is already installed on my laptop.)
Ideally I'd like to just dump the whole thing in Excel, but Excel will only allow you some 65000 rows, which is not nearly sufficient. My dataset is about 6.4 million rows.
Anyway, importing the logfiles was quite easy, a small commandline to clean the data (think grep and sed), and
load data infile... from mysql did the trick. At a leisurely pace to be sure, but it did it. (I have MySQL running on Linux in a virtual machine, so I'm not blaming this lack of speed on Linux nor MySQL..)
So, while I was working with this data it struck me how much redundancy there was in the columns. For instance, in 6.4 million rows, I have only 2479 unique IP's. This is a quick-and-dirty thing, so I didn't spend time optimizing anything. The client IP-address in my table is varchar(15), because xxx.xxx.xxx.xxx is 15 bytes. The log-files are dumped as text, keeping everything as quick and painless as possible. In essence, the database is storing 6.4 million ~15-byte values, for only 2479
unique values. Or, to state it in real numbers, my database has 70,095,785 bytes of raw data stored in the IP-address field, not counting any overhead. The unique values sums up to 30,228 bytes.
In a production environment, you might create a second table with uniqe IP-addresses, and only reference that table from the "log" table. You'd save on diskspace, but your SQL would be more cumbersome to write. As I said, this was ad-hoc, and I don't need any extra headache on this type of work.
I've looked at the numbers, and the pattern seem to be repeating. The data is about two months worth, so the date-field only has 60 unique dates. There are, however, 6.4 million dates stored in the database. The request URL has only 657 unique values, yet I am storing those 6.4 millon times. It all adds up!
I've looked around for an alternative storage engine, but without much luck. Packed MyISAM-tables compress by rows, which won't do me any good. The Archive-engine the same, if I read it correctly.
Here I think it's time to pause and explain that I am not a DBA, I am not a programmer by trade or education
(although I've have programmed in several languages, for my pet projects), and I do not intimately know MySQL. I use it occasionally. I browsed the source for the first time yesterday, and read a couple of Internets about the storage engine API. So the rest of the post should be regarded with scepticism by those looking to learn, and mercy by those who already know.
With that in mind, I'll take a couple of minutes to try to explain what the Storage Engine is, and why I'd like a new one. An RDBMS (or Relational DataBase Management System, MySQL is one) consists of a lot of pieces. Some part parses SQL, generates executionplans, optimize, and all sorts of more or less interesting bits, all of which are important. The Storage Engine is the bit that actually puts your data onto your disk (or wherever you want it). In MySQL, this part can be changed, depending on your needs. For instance, you have a storage engine that never stores anything to disk, it only keeps it in memory. This would be convenient for things that does not need to live long, but needs to be fast. The rest of the RDBMS doesn't know much about this, it only knows how to ask the storage engine for data. It really doesn't know what the storage engine does with the data, or how. It just wants to get what it asks for.
Likewise, when you write a select-statement, you don't need to worry much about which storage engine is used. It just all magically work. So you can use different storage engines, with different properties, just as long as those storage engines follow the rules, i.e. the API.
For a better explanation, check out the
official documentation.
So back to my plan. Here is the crux of that I want: A storage engine that automatically checks if any value stored to a column, has been stored before. If so, simply point to the existing datablock. (And use Copy-on-Write for updates). In this particular case it would save me a LOT of data. I haven't crunched the numbers properly yet, but keep in mind that the entire Gigabyte of logfiles compresses to about 17MB, so there's a lot of redundancy in there.. Potentially I might even be able to hold the whole table in memory at once, which would speed up things quite a bit.
Think of it like this: a list for each column in a table, with one entry for every row. Let's call this the "row-list". Another list for each column in a table, with one entry for every value. Let's call this the "value-list". Each entry in the row-list is just a pointer to the correct entry in the value-list.
So, when an Insert comes along, the engine would look at each value, compute a hash-value, check this value against the value-list (or B-tree) for the relevant column looking for a match. If found, insert the pointer to the corresponding node in the value-list into the row-list, and increase the reference-counter on the value-list node. If not found, create a new node in the value-list, and point to it in the row-list.
A Delete would simply be decreasing the reference-count on the value-list, deleting the node if we reached zero, and delete the node in the row-list.
This all sounds frightfully involved, but my gut tells me it would pay off by reducing disk IO and memory-footprint.
(Not sure if cache-locality might become an issue, though)
A quick estimate of datavolumes goes like this: I'm guessing I'd need 4 bytes (a 32-bit pointer) per row per column, in the row-list. Then I'd need a 32-bit hash-value and a reference counter on the value-list, plus the value itself. Total 4 bytes per row, 8 bytes per unique value, plus the actal data. I'm comparing against the raw data, not making any assumptions on MyISAM overhead. (I would also probably need some other bookkeeping datastructures/fields, but I guess MyISAM also uses those, so it evens out. In my head it does.)
A quick graph in Excel tells me that on a 15 byte value, I'd save on data up to about 48% unique values. In my dataset the unique values are less than 1%, so there is potential for BIG savings, as demonstrated with the IP-address example. I estimate 32-bit pointers are more than sufficient, remember we're talking about
unique values. If you have more that 4 billion of those, you should probably look at a different storage engine! (The corresponding number for 64-bit pointers, incidentally, would be ~22%..)
Of course, the storage engine would need to be smart enough not to use this scheme for small fields. No sense in replacing a single byte-value with a 12 byte overhead..
One other interesting aspect is that I'd have 'free' indices; if value A matches an expression, this would apply to ALL rows that points to that node in the value-list. So there would be no need to evaluate those. I don't know if I would be able to get the optimizer to utilize this in any meaningful way, though.
I think this storage engine would be very useful for things like logs that has a lot of redundancy in it. It would, as far as I understand, cut down the importance of getting your data to normal form drastically, and might even throw in some nice enhancements for 'free'. The write-speed would probably be lower than, say, MyISAM, but that would be fine for the sort of scenario I'm looking at. It may on the other han turn out to be higher, for certain workloads, since there'd be way less disk-IO.. For both read and write. But that's speculation/whishful thinking.
So, step one in my plan for World Dominance: Learn C++.