Tuesday, July 19, 2011

The trouble with physics

I just finished reading Lee Smolin's book "The trouble with physics". Enjoyable read, and he makes some very good points about the current state of physics and "whats wrong with it". As a layman, I tend to agree with most of what he writes.


There is, however, one thing about physics, and science in general that bugs me. The tendency to boldly declare that "this is how nature works!"


Take a concrete example. A tennisball dropped from a height is said to "Obey Newtons Laws". Let's engage in a "gedankenexperiment":

  1. Drop a tennisball from a given heigh. In a vacuum, if you so wish. Very accurately track the downwards accelleration.
  2. Observe that this accelleration, within experimental error, is in accordance with Newtons Laws.
  3. Place said tennisball on a tabletop, with a pen and a standard Calculus-test.
  4. Wait.
  5. Observe that the tennisball is really bad at Calculus.
  6. Conclude that the tennisball can NOT "obey Newtons laws" at all, it cant tell an integral from a racket! The tennisball does as it damn well pleases, and Newtons Laws is a model that describes what the tennisball pleases. A very good model, perhaps, but a model nonetheless.
If we accept that whatever mathematical constructs we have of the real world are, in fact, MODELS of the real world I think we'd be a lot more productive in science in general, and perhaps physics in particular. If we accepts our knowledge as models, rather than "laws of nature", it would probably be easier to accept refinement of these models. Refinements that would better explain our observations.


The Oxford English Dictionary defines scientific method as "consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses."


Now would be a good time to observe that all previous models of the physical universe has been proven false. And I suppose also formulate a hypothesis that our current knowledge will be surpassed. Assuming that we are now at the peak of human knowledge, and we now know how nature works, to me seems preposterous. At best, we can hope that our current theories and models are good ones that will last a while, before being replaced by something better and more accurate. I find it ever so slightly ironic, when scientists (or perhaps the press is more to blame for my impression of what the scientists proclaim) says that we now know "how nature really works". It means they can all go home. Job done. Turn of the lights and lock the door, we've got it.


It also strikes me that "old knowledge dies hard". Case in point: String theory. For some reason it is easier to believe that we live in a multidimensional universe with "invisible rolled up dimensions", than to believe that general relativity is incorrect/incomplete. I think Occam would have something to say about that..


I think it is interesting to ponder this unwillingness to tear down old truths. I am not arguing that we should welcome every claim to cold fusion at face value, but I do believe that theres something to be said for taking a second look at long established truths every now and then. Technology improves, our ability to measure and quantify improves, experimental error continues to shrink. Perhaps some of those old truths are no longer quite as true? Or at least, not complete?


Which brings me on to another pet peeve. "It's so beautiful it must be true!" Why? Beatiful, yes, but so what? The tennisball, at least, has proven itself useless at calculus, so I don't imagine it cares much for the beauty of the laws it is supposed to obey. I find it strange this belief that the world must operate according to

  1. mathematical principles (perhaps not so strange on its own, but still),
  2. these mathematical principles must be describable within our current framework of mathematics, and
  3. the principles must be beautiful.

Wednesday, October 27, 2010

TO-DO

  1. Update TO-DO list.

Wednesday, August 4, 2010

Huffman coding

I've just completed small project in C#; Huffman-coding. For no particular reason, other than to keep my brain occupied. There's something oddly satisfying about making code do what it is supposed to. Even if, as in this case, it is not very useful. The code does a poor job of saving disk-space, Serialize() is way to greedy for that. I tried half-heartedly to save space by using a baseclass with only the essential fields for decoding, but it was not all that effective. It did prove to be a useful exercise, though. 

I couldn't find the motivation to code up something smarter, the challenge was to keep track of the bits. Which I think is correct, at least it decodes whatever I throw at it correctly. (Note that "whatever" in this context is limited to 3 random files that happened to have a short path-name on my computer, so I suppose the testing hasn't been all that thorough. Which doesn't matter, really, as it is pretty useless. :))

Huffman-coding is in principle quite simple. Count how many occurrences there are in the input of  each symbol, and assign a short code to the most common ones, and a long one for the least common ones. There are a number of pages describing the exact mechanisms, but in short it builds a binary tree out of the symbols, with a short path from the root of the tree to the common symbols, and a long path to the less common symbols. When decoding, start at the root node, read a bit from the input stream, and take the left child if it is 0, or the right child if it is 1. Get another bit from the inputstream, and keep going until you hit a node with no children, which in this case means you've arrived at the symbol the code represents. Output the symbol, start at the root node, and get another bit from the input. Quite simple, really. 

There's a number of scenarios that will make the code fail. If the input it all a single symbol, and a bunch of others I haven't considered.

It could be interesting to make symbol-size of the input variable. Currently it works on bytes, but in principle it could work on any fixed number of bits. The tree would be a lot larger, so finding a better way to store it on disk than serialize would definitely be required. Also, I suppose the efficiency requires that some symbols are way more prevalent than others, not sure e.g. 16-bit symbol-size would work. Interesting, though.. 

The only reason I'm publishing is that I've recently gotten into mathematics. And writing about stuff makes me think about the subject in a more structured way, aiding my learning. So I've planned a series of posts, and it starts with Huffman-coding. The plan is to move on to Markov-chains, then a hybrid Huffman-Markov coding, then to Hidden Markov Models. If I ever get around to writing it.. 

Anyway, the code. For what it's worth. 
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Runtime.Serialization.Formatters.Binary;
using System.Runtime.InteropServices;

namespace Huffman {
    class Program {
        static StreamReader finput;
        static StreamWriter foutput;

        // This class contains fields required when decoding.
        [Serializable]
        class DecodeNode {
            public byte value;                      // Which symbol the Node represent
            public DecodeNode left = null;          // Left child-node
            public DecodeNode right = null;         // Right child-node
        }

        // This class also contains definitions only required when encoding
        class Node : DecodeNode, IComparable<Node> {
            public ulong count;             // Number of times the symbol occurs in input-file
            public uint code = 0;           // Code for this symbol
            public uint bitcount = 0;       // Number of bits in the encode
            public bool isLeaf = false;     // Flag to signify if it has a value

            // This will allow us to compare the frequencies of other values easily.
            // Ends up with "lowest to highest"
            public int CompareTo(Node other) {
                return (count.CompareTo(other.count));
            }

            // There MUST be a more elegant way of losing the "extra info" in Node before serializing..
            public DecodeNode ToDecodeNode() {
                DecodeNode d = new DecodeNode();

                if (this.isLeaf) {
                    d.value = this.value;
                } else {
                    d.left = ((Node)this.left).ToDecodeNode();
                    d.right = ((Node)this.right).ToDecodeNode();
                }

                return (d);
            }

            // Count number nodes in (sub-) tree, including this. Not required, just info
            public int TreeCount() {
                int i = 1;

                if (left != null)
                    i += ((Node)left).TreeCount();
                if (right != null)
                    i += ((Node)right).TreeCount();

                return (i);
            }
        }

        // Used for the code-table, when encoding
        class CodeEntry {
            public uint code;
            public byte bitcount;
        }

        // Support-method for encode()
        // Recursively traverse the tree, making a encode table as we go along.
        // Assumes maximum 32 bit codes, returns a codetable for use in encoding.
        static CodeEntry[] makecodes(Node v) {

            // Set up encode table
            CodeEntry[] ct = new CodeEntry[256];

            // Only called with this proto from root, therefore code and codelength == 0
            makecodes(ct, v, 0, 0);

            return (ct);
        }

        static void makecodes(CodeEntry[] ctable, Node v, uint code, byte bitcount) {
            if (v.isLeaf) {
                CodeEntry e = new CodeEntry();
                e.bitcount = bitcount;
                e.code = code;

                ctable[v.value] = e;
                return;
            }

            bitcount++;         // Code is 1 bit longer
            code <<= 1;         // Left-shift 1 bit

            // Left child
            if (v.left != null)
                makecodes(ctable, (Node)v.left, code, bitcount);

            // Right child
            code++; // Make last bit 1
            if (v.right != null)
                makecodes(ctable, (Node)v.right, code, bitcount);
        }

        static Node maketree(List<Node> values) {

            // Remove all values where count == 0
            values.RemoveAll(delegate(Node n) { return n.count == 0; });

            // Now build a tree. 
            // Remove the two least common values, add a new node which is NOT a leaf node. Frq = count of node 1 + count of node 2
            // Insert the new node into the list. Do until only one node is left.
            Node node, node1, node2;
            while (values.Count > 1) {
                // Sort the array so that the LEAST common values are "on top". 
                values.Sort();

                // Grab 2 least frequent nodes from the list
                node1 = values[0];
                node2 = values[1];
                values.RemoveRange(0, 2);

                // Create a parent node for these
                node = new Node();
                node.count = node1.count + node2.count;
                node.left = node1;
                node.right = node2;

                // Add new parent node to list
                values.Add(node);
            }

            return (values[0]);
        }

        // For debugging
        static void dumptree(DecodeNode n, string level) {
            level += ".";
            if ((n.left == null) && (n.right == null))
                Console.WriteLine(level + "Val: " + n.value);
            else {
                Console.WriteLine(level);
                dumptree(n.left, level);
                dumptree(n.right, level);
            }
        }

        static void encode() {
            // 500 seems to be a ballpark figure for number of nodes in the tree.
            List<Node> values = new List<Node>(500);

            // Add objects with values, so we can later identify it
            for (int i = 0; i < 256; i++) {
                Node v = new Node();
                v.value = (byte)i;
                v.isLeaf = true;
                values.Add(v);
            }

            Stream fin = finput.BaseStream;
            int b;

            // Count each byte. The values are at their offsets in the list
            while ((b = fin.ReadByte()) != -1)
                values[(byte)b].count++;

            Node root = maketree(values);

            #region Dump tree to disk
            // This is pretty inefficient use of diskspace. Serialize() use a lot of space..

            DecodeNode decoderoot = root.ToDecodeNode();    // Convert to smaller objects, saves diskspace in output.

            // decoderoot contains all nodes, with enough information to recreate the tree at decode-time.
            Stream fout = foutput.BaseStream;
            BinaryFormatter binfmt = new BinaryFormatter();
            binfmt.Serialize(fout, decoderoot);
            decoderoot = null; 
            #endregion

            // Create codes
            CodeEntry[] codeTable = makecodes(root);

            uint buffer = 0;
            int bits_in_buffer = 0;

            fin.Seek(0, 0);                   // Start from beginning

            // Now encode the entire stream.
            CodeEntry e;
            while ((b = fin.ReadByte()) != -1) {

                // Pull relevant entry from the codelist
                e = codeTable[b];

                // Add new code to buffer. Leftshift code so it lands next to  whatever is already in the buffer
                bits_in_buffer += e.bitcount;
                buffer |= (e.code << (32 - bits_in_buffer));

                // Quick check..
                if (bits_in_buffer > 32)
                    Console.WriteLine("Fatal! Buffer overflow in encode()");

                // Write buffer to output
                while (bits_in_buffer >= 8) {
                    fout.WriteByte((byte)(buffer >> 24));   // WriteByte takes the right 8 bits out of an int, so we must right-shift by 24 bits
                    buffer <<= 8;                           // Shift the whole buffer 8 bits to the left
                    bits_in_buffer -= 8;
                }
            }

            // Write out any bits left in the buffer, will be less than 8.
            // There is a slight chance decode() will output an extra character at the end of the output,
            // if a value-node is found at the end of however many padding 0's is written. Can be solved by inserting
            // a virtual "EOF-node" in the tree, or simply writing the filesize into the encoded file and stop
            // decoding when the correct number of bytes has been written. 
            if (bits_in_buffer != 0)
                fout.WriteByte((byte)(buffer >> 24));

            foutput.Close();
            return;
        }

        static void decode() {
            DecodeNode root;

            Stream fout = foutput.BaseStream;
            Stream fin = finput.BaseStream;

            // Recreate the b-tree
            BinaryFormatter binfmt = new BinaryFormatter();
            root = (DecodeNode)binfmt.Deserialize(fin);

            // Read through the whole file, byte by byte, decoding as we go
            int v;
            byte val;

            DecodeNode current = root;
            while ((v = fin.ReadByte()) != -1) {
                val = (byte)v;

                // Do each bit, left to right
                for (int i = 0; i < 8; i++) {

                    // Walk the tree
                    if ((val & 128) == 128)  // Check if MSB is 1
                        current = current.right;
                    else
                        current = current.left;

                    // If we are at a leaf-node, output the value, and reset to root of tree for the next bits.
                    if ((current.left == null) && (current.right == null)) {
                        fout.WriteByte(current.value);
                        current = root;
                    }

                    // Shift next bit into position
                    val <<= 1;
                }
            }

            fout.Close();

            return;
        }

        static void usage() {
            System.Console.WriteLine("Usage:");
            System.Console.WriteLine("Huffman [-c|-d] <in-file> <out-file>");
            System.Console.WriteLine("-c    encode <in-file>, write result in <out-file> ");
            System.Console.WriteLine("-d    decode <in-file>, write result in <out-file> ");
        }

        static void Main(string[] args) {
            if (args.Length != 3) {
                usage();
                return;
            }

            // Open inputstram
            try {
                finput = new StreamReader(args[1]);
            } catch (Exception e) {
                System.Console.WriteLine(args[1] + " is not a valid input-file:\n" + e.Message);
                usage();
                return;
            }

            // Open outputstram
            try {
                foutput = new StreamWriter(args[2]);
            } catch (Exception e) {
                System.Console.WriteLine(args[2] + " is not a valid output-file:\n" + e.Message);
                usage();
                return;
            }

            if (args[0].Equals("-c"))
                encode();
            else if (args[0].Equals("-d"))
                decode();
            else
                usage();
        }
    }
}



Tuesday, November 10, 2009

10 Rules for a Successful Project

I've been part of many projects over the years. Some large, as in several hundred million Dollars, some smaller. Most of them has been less than successful. The latest project, however, is a resounding success. It is large, over 50 million Euro, it is relatively high tech, and it is hugely fun. It has also been an eye-opener.

Whole books have been written about project management, hundreds of them. I've read a few, but learned little. The books tend to be very general, and contains what in programming books are referred to as "patterns". I think they all have their value, but they need to be used with caution. You can not read these books and blindly apply your freshly acquired knowledge, you need to thoroughly analyze the applicability of each and every point made. (This applies to many or most books, obviously)

I think the books would be more useful if they simply contained a list of rules, which should take up a page or two, and the rest of the book should be used to elaborate on each point, and provide examples of how in practice the rule can be followed (it's a book after all, so it must by convention contain more than two pages).

So after very little deliberation, I hereby grant you unrestricted access to my accumulated knowledge of project management after a good couple of years in a well run project (and many years in less well run projects).

1. Understand the assignment
This is the single most important rule. When you were in school, I am sure the teacher before every exam told you to read the assigment carefully, and make sure you understand exactly what is required. This is fairly obvious, but I think you will find that too many projects crumble because of this alone. Make sure that not only YOU understand the requirements, but also all other members of the project team, as well as the customer. Many requirement-matrixes are blindly based on "hand-me-downs" (someone did something vaguely similar somewhere else, or someone stumbled across Google), so frequently the customer does not really know what a particular requirement means, and why it is there. This is just a fact of life, and it's better to recognize and deal with it.

If there are no requirement matrix, make one. Yes, really. Document whatever expectation the customer has before commencing the project. Then you have a baseline from which you can estimate cost and resources, and something to bargain with when the requirements change. Make the customer adopt it as his own.

Going through the requirement-matrix with the customer will ensure there's no unpleasant surprises when the project is done. Even if you've met all requirements on time and on budget, the project will still be a failure if the customer thinks he's getting something he's not.

In practice, we use an Excel workbook, with every single requirement on its own row (currently about 1200). We then go through each requirement, and try to "translate" it into something tangible (requrements are usually formulated roundly, so as to not impose restrictions on the actual solution). We then go over those "interpretations" with the customer, and ensure that we understand what the customer means by the requirement, and the customer "understands our understanding". Labour intensive for a while, but very, very useful. We also do a number of other things to those requirements, but thats the most important part. There are tools to assist in this work, but I think the mentality of ensuring you know what you are getting into is by far the most important. The tools will follow.

Almost as important as understanding the requirements is making sure the customer/PM understands your requirements. If you are to deliver a, b and c, you need x, y and z. Make sure any caveats are clearly documented. If, at a later date, these conditions are not met, you have documentation on whose responsibility this was.

As an example, I was once tasked with implementing an ERP-system in a startup GSM operator. Following my own advice, I tried to get a requirement matrix for the system. What is it supposed to do? What information do you actually want it to provide? There was no such matrix, so I arranged meetings with the stakeholders. Starting with the project owner. As it turned out, nobody really knew WHAT they wanted, and most of them didn't know WHY they should want it at all! You can imagine the success I would have had, delivering a system nobody knows what is supposed to do. I killed the project.

2. Well defined roles and responsibilities
Again something that should be self evident, but is sinned against time and time again. Every member of the team must know exactly what is expected of them, and what their authority is. It's hard to do your job if you don't know what your job is. I can not stress this enough. As I see it, it's the Project Manager's job to define and communicate these roles. If he does not do this, he is not a good PM! Simple as that.

Under the same heading I want to add "processes". Depending on the size of the project team, well defined processes are incredible important. They ensure people know what to do in certain circumstances, and also enforce #3 on this list.

3. Document it!
"It" referring to pretty much everything. If you are in a meeting, make sure minutes are taken and sent out. If it is a meeting where anything gets decided, you will have a written record of that decision. If it's a meeting where potential issues as discussed, you will have a written record of those troubles being alerted. If someone wants you to do something not previously agreed on, get it in writing. You will need it.

Making sure things are documented in this way not only serves to cover your ass, it also as a mental filter for those making the request. If they understand that things are documented and their request might come back and bite them on the behind, they will think more throroughly through their request. Which is a Good Thing.

4. Manage Change
Change will happen. There will be changes to your requirement-matrix, or schedule, or something else. Make sure these changes are documented, and that you clearly communicate any and all effects of that change. If you are in a project for a paying customer, having a clear requirement-matrix (as you of course will have after reading Rule 1), you can now recuperate any additional cost associated with implementing that change. If you do not have a requirement-matrix when the project starts, you are working blind. The cost estimates for the project was based on air, and it's hard to charge extra (or ask for more resources) for more air.

5. Interfaces
In any project you will have interfaces to other systems/parts of the organisation/something. It could be technical interfaces (protocol x/y/z), or it could be organizational (so-and-so are your responsibilities when operating this and that). Make sure those interfaces are documented.

Translation: make sure you have a document which clearly states what you will deliver, and what you need the other party to deliver, with a timeline. Make sure it is signed by both parties. Partially to cover your ass if the other part does not deliver on time, and partially to force the other part to recognize the interface.

6. Take responsibility - Establish trust in the group
If you are a Project Manager, or in any other position where you get to take decisions, take them. Remember, a bad decision is infinately better than no decision at all. And you need to take those decisions, even if you don't like nor care for them. That's the price you pay for playing the Big Man.

This is also paramount when it comes to establishing trust in the project group. If the people who work with you do not trust you or each other, they will never pull in the same direction as you. Which is a sure-fire way of cocking up a project. Projects are difficult enough when everybody is on the same page and pulling in the same direction, let alone having to deal with individual agendas.

If you are not in a position to make decision, you still need to earn the trust of your managers. Easy enough; deliver quality work, on time. Thats all there is to it.

7. Don't assume
Anything. Make sure "someone" reads through any and all documentation sent from, or used by, the project group. Any solution or process should be approved by a key member of the project team. If you have sub-contractors, make sure you have someone in the team with enough knowledge of the subject to ask the tricky questions, and understand if something fishy is going on. Yes, I have seen BIG projects fail because the project team assumed the sub-contractor had the best possible technical resources to do the job, and it was not really worth it to kick up a shit-storm to get resources on the project team who could do QA on the design-documents. (Turns out, there were no design-documents. The big, multinational tech-company was winging it. And failing.)

8. Get commitment
I've seen too many project staffed by 5% resources. "This guy will use 5% of his time, that guy will use 10%, together with all the others that will make up the total number of hours required". I've never seen it work.

I think it boils down to ownership: If I only have a marginal role in a project, I am unlikely to look really hard for potential problems, and take action to avoid it. Someone else must have done that already. But when ALL the resources are marginal, none will. I've turned down several projects because they only wanted a few percent of my time. It has nothing to do with money, I get paid the same either way, and all to do with success. I don't want to be part of a project I do not think will succeed. So, when you staff your project, make sure your key members are 100% allocated (or damn close). If you are a key member, insist that you are 100% allocated. If this is not possible, I would walk away.

9. Risk Management
Risk Management is one of those terms which are kinda hard to pin down, as it could be pretty much anything. I like to think of it this way; EVERYTHING we do in the project is rooted in risk management. Everything that is documented is documented in order to minimize the risk that we're not in control. Any product chosen is chosen in order to minimize the risk of it not doing whatever it needs to do properly. Any technical solution or routine is establised to minimize risks. The Excel-sheet in Rule #1 is used to minimize the risk that we miss our target. Having a concious view of risk management is important, as well as useful.

In practice, we do risk management sessions at various times in the project. For any big change, we assess the risks associated, and come up with ways to mitigate the risks. Then we make sure the risks identified is communicated to the "higher ups". Again, the mentality is the important thing. How you choose to implement it is up to you.

10. Test
After anything has been implemented (not only technical solutions), make sure you test it! Every testable requirement should be tested and documented, but also functionality that has not specifically been requested. I've never seen a requirement matrix that specifies that the users must be able to change the passwords on their PC, but it's a pretty big deal if it doesn't work..

This is also one of the things we go through when we discuss the requirements with the customer. At the end of the project, time comes to get paid. If it's a sizeable project, the customer wants to make sure you've delivered everything you've agreed to. Asking the question for every requirement "how should we document delivery of this requirement" is worthwhile. Once you can produce test-reports for the requirements, objections tend to go away. Which is another Good Thing.

There are more..
There's a lot more rules, but those are Top 10 in my book. I will mention one more, but without a number (or else it would be "Top 11 in my book", which doesn't really work..) A Plan. Really? A plan? Yes. A plan. Every project has a plan. But they are frequently badly formed, and not really followed. The plan should contain all your milestones, the most important tasks, and whatever interdependencies there are. That does not mean it should be as detailed as possible, but rather that the "right" tasks should be there. And it should be followed up. Every week the team should assemble and report the status of their tasks. If something is falling behind, the plan will show any snowball-effect, and solutions or new priorities should be implemented in the plan. The plan is not a static document, but rather a tool for managing the progress or lack thereof.

In general this last statement applies to many traditional tools in PM. They are not something you just need to have because the book says so, they are tools, and need to be used as such. If something really does not help with finishing the project, just drop it. Find something else that works.

A lot of what I have written can be interpreted as "Cover Your Ass" tactics. It's not. When everything is documented, uncertainty and risk is minimized. Everybody knows who is responsible. How often have you been in a situation where the opposite party says "Me? I thought you were supposed to do that?" Having it documented (and accessible) will aid in minimizing silly misunderstandings like that.

I do not suggest you let something slide because you know it's not your job. Quite the opposite, actually, make whoever is responsible aware that something is not up to snuff. This is perhaps the most important rule of all; have an environment in the project team, and include the customer in this, where everyone is looking out for each other! Making mistakes is not a hanging offence! (Making the same mistake twice is, however)

Sunday, October 11, 2009

An idea for a new Storage Engine for MySQL, Part II

So, a few days ago I blogged about an idea I had for a new storage engine for MySQL. The MySQL API is C++. so it would require me to actually learn C++. Which I tried, but it hurt my eyes. So I decided to just go ahead and write a prototype in C#, just to quickly test out the basic premise; reducing/eliminating storing multiple copies of the same data, using B-trees and hash-values. (And gaining performance in the process..)

Since then I've spent 4 days in Istanbul, which was brilliant, but before I went I made a quick prototype, using .Net 4.0beta1. The preliminary results were encouraging enough to motivate me to polish it up a bit. In essence, I took 70MB of logs, and imported it into the storage enginge in 18 seconds, resulting in about 22MB on disk. (That is 32bit-aligned, mind you, so if size was the only concern it could be trimmed down quite a bit) Fairly simplistic, and for now everything is stored as byte-arrays, it doesn't even support deleting data. But with a little brain-work, I guess it could be a useful way of quick and simple permanent storage on .Net. (I really don't like the idea of requiring SQL Server for trivial stuff)

Not the first project with that aim, I know, but it is an interesting learning experience nonetheless. Incidentally, memory mapped files are supported in .Net 4.0, which is cool. In time, I hope to also make a MySQL-version, which would make it possible to use the same datafiles from a .net-application and MySQL, which might be vaguely useful; for instance writing a small .Net snippet to create a database from some moderately complex dataset, and exporting it to MySQL with a simple file-copy. Which I would file under C, for "Cool".

So, this is where I am at the moment:
  1. I've got a crude XML-format for defining a row-format, and a parser for it.
  2. A class-hierarchy that works fairly well
  3. A b-tree implementation that seems to work (I'm sure theres a few off-by-one's, but that's hardly the point)
  4. A data loader which loads in data based on the supplied dataformat
  5. A cmd-line app that lets me play with it
I can load data, and retrieve it as arrays of strings based on row-numbers. Each column uses two files, both memory mapped. One datafile contains an entry for each row, which is simply the record-number in the b-tree where the data can be found. The B-tree file contains entries for nodes and data, stored in pairs. A node-entry, followed by a datablock of static size.

A variation of this is for really small fields, for instance int's, in which case the column simply contains the values directly in the row-entry, instead of a pointer to the b-tree. These columns does not have b-trees.

I didn't bring a laptop to Istanbul, but I did think about it every now and then. So here is the current TO-DO list:

1. Make stress-tests. Randomly retrieve lots of rows, timing it
2. Add more statistics, maximal search length, number of levels in the b-tree, timing searches etc
3. Single datafile:
 - Extents. Need map of extents, each column has one or two extents * n, data and row
 - Need logic to get new extend when one is full
 - rowCount on table-object, not column
 - Extents handled by table-object
 - mmap one extent at a time
4. Support deletes
 - Use free-lists
5. Support variable datasizes
 - Another segment for actual data, for a total of two or three:
 - Non-optimizing:
  - 1 segment for rowdata, with pointers to another segment with valuedata.
  - a single list for fixes sizes 
 - Optimizing:
  - 1 segment for rowdata, 1 with b-tree, another for value-data
  - the b-tree requires two reads anyway, so should not take longer (ValEntry+Data)
6. Create general library for .Net
 - Need decent way of retrieving data, apart from row-number
 - GetRow on Table-object, return row as object[]
 - RegEx'es on dataload
7. Create MySQL StorageEnginge using same dataformat
8. Indexing?
9. Store data as native datatypes, at least int and char[]
10. Profile it, optimize
11. Verify that 32bit-aligne actually improves performance.

Friday, October 2, 2009

An idea for a new storage engine for MySQL

This week I've spent some time analysing about a gigabyte of logs from an old IIS5.0 server. The logs are in the standard text-format, so I quickly decided to dump it all into MySQL.

Using MySQL for this sort of thing (or any database, for that matter) works like a charm. Yes, of course I could have used Perl, or for this particular case, any number of loganalysers, but I find that working with SQL is simply quicker and more convenient in practical terms, for on-the-fly one-time analysis where I don't really know up front what I'm looking for. It also allows me to use MySQL to aggregate the data, and use Pivot-tables in Excel to chew through and graph it. It simply works well for me. (And no, I don't care that other open source tools exists. I want it to work now, not after fiddling around with it for hours. And Excel is already installed on my laptop.)

Ideally I'd like to just dump the whole thing in Excel, but Excel will only allow you some 65000 rows, which is not nearly sufficient. My dataset is about 6.4 million rows.

Anyway, importing the logfiles was quite easy, a small commandline to clean the data (think grep and sed), and load data infile... from mysql did the trick. At a leisurely pace to be sure, but it did it. (I have MySQL running on Linux in a virtual machine, so I'm not blaming this lack of speed on Linux nor MySQL..)

So, while I was working with this data it struck me how much redundancy there was in the columns. For instance, in 6.4 million rows, I have only 2479 unique IP's. This is a quick-and-dirty thing, so I didn't spend time optimizing anything. The client IP-address in my table is varchar(15), because xxx.xxx.xxx.xxx is 15 bytes. The log-files are dumped as text, keeping everything as quick and painless as possible. In essence, the database is storing 6.4 million ~15-byte values, for only 2479 unique values. Or, to state it in real numbers, my database has 70,095,785 bytes of raw data stored in the IP-address field, not counting any overhead. The unique values sums up to 30,228 bytes.

In a production environment, you might create a second table with uniqe IP-addresses, and only reference that table from the "log" table. You'd save on diskspace, but your SQL would be more cumbersome to write. As I said, this was ad-hoc, and I don't need any extra headache on this type of work.

I've looked at the numbers, and the pattern seem to be repeating. The data is about two months worth, so the date-field only has 60 unique dates. There are, however, 6.4 million dates stored in the database. The request URL has only 657 unique values, yet I am storing those 6.4 millon times. It all adds up!

I've looked around for an alternative storage engine, but without much luck. Packed MyISAM-tables compress by rows, which won't do me any good. The Archive-engine the same, if I read it correctly.

Here I think it's time to pause and explain that I am not a DBA, I am not a programmer by trade or education (although I've have programmed in several languages, for my pet projects), and I do not intimately know MySQL. I use it occasionally. I browsed the source for the first time yesterday, and read a couple of Internets about the storage engine API. So the rest of the post should be regarded with scepticism by those looking to learn, and mercy by those who already know.

With that in mind, I'll take a couple of minutes to try to explain what the Storage Engine is, and why I'd like a new one. An RDBMS (or Relational DataBase Management System, MySQL is one) consists of a lot of pieces. Some part parses SQL, generates executionplans, optimize, and all sorts of more or less interesting bits, all of which are important. The Storage Engine is the bit that actually puts your data onto your disk (or wherever you want it). In MySQL, this part can be changed, depending on your needs. For instance, you have a storage engine that never stores anything to disk, it only keeps it in memory. This would be convenient for things that does not need to live long, but needs to be fast. The rest of the RDBMS doesn't know much about this, it only knows how to ask the storage engine for data. It really doesn't know what the storage engine does with the data, or how. It just wants to get what it asks for.

Likewise, when you write a select-statement, you don't need to worry much about which storage engine is used. It just all magically work. So you can use different storage engines, with different properties, just as long as those storage engines follow the rules, i.e. the API.

For a better explanation, check out the official documentation.

So back to my plan. Here is the crux of that I want: A storage engine that automatically checks if any value stored to a column, has been stored before. If so, simply point to the existing datablock. (And use Copy-on-Write for updates). In this particular case it would save me a LOT of data. I haven't crunched the numbers properly yet, but keep in mind that the entire Gigabyte of logfiles compresses to about 17MB, so there's a lot of redundancy in there.. Potentially I might even be able to hold the whole table in memory at once, which would speed up things quite a bit.

Think of it like this: a list for each column in a table, with one entry for every row. Let's call this the "row-list". Another list for each column in a table, with one entry for every value. Let's call this the "value-list". Each entry in the row-list is just a pointer to the correct entry in the value-list.

So, when an Insert comes along, the engine would look at each value, compute a hash-value, check this value against the value-list (or B-tree) for the relevant column looking for a match. If found, insert the pointer to the corresponding node in the value-list into the row-list, and increase the reference-counter on the value-list node. If not found, create a new node in the value-list, and point to it in the row-list.

A Delete would simply be decreasing the reference-count on the value-list, deleting the node if we reached zero, and delete the node in the row-list.

This all sounds frightfully involved, but my gut tells me it would pay off by reducing disk IO and memory-footprint. (Not sure if cache-locality might become an issue, though)

A quick estimate of datavolumes goes like this: I'm guessing I'd need 4 bytes (a 32-bit pointer) per row per column, in the row-list. Then I'd need a 32-bit hash-value and a reference counter on the value-list, plus the value itself. Total 4 bytes per row, 8 bytes per unique value, plus the actal data. I'm comparing against the raw data, not making any assumptions on MyISAM overhead. (I would also probably need some other bookkeeping datastructures/fields, but I guess MyISAM also uses those, so it evens out. In my head it does.)

A quick graph in Excel tells me that on a 15 byte value, I'd save on data up to about 48% unique values. In my dataset the unique values are less than 1%, so there is potential for BIG savings, as demonstrated with the IP-address example. I estimate 32-bit pointers are more than sufficient, remember we're talking about unique values. If you have more that 4 billion of those, you should probably look at a different storage engine! (The corresponding number for 64-bit pointers, incidentally, would be ~22%..)

Of course, the storage engine would need to be smart enough not to use this scheme for small fields. No sense in replacing a single byte-value with a 12 byte overhead..

One other interesting aspect is that I'd have 'free' indices; if value A matches an expression, this would apply to ALL rows that points to that node in the value-list. So there would be no need to evaluate those. I don't know if I would be able to get the optimizer to utilize this in any meaningful way, though.

I think this storage engine would be very useful for things like logs that has a lot of redundancy in it. It would, as far as I understand, cut down the importance of getting your data to normal form drastically, and might even throw in some nice enhancements for 'free'. The write-speed would probably be lower than, say, MyISAM, but that would be fine for the sort of scenario I'm looking at. It may on the other han turn out to be higher, for certain workloads, since there'd be way less disk-IO.. For both read and write. But that's speculation/whishful thinking.

So, step one in my plan for World Dominance: Learn C++.

Monday, September 28, 2009

Videogames, part II

So.. Number 2 on my list of annoyances in videogames (in no particular order). Superfluous Press whichever Button! This actually stopped me from playing through the greatest videogame in the history of man kind (in fact, the history of the universe, judging by it's diciples); Metal Gear Solid 2. The number of keypresses required just to be able to move an arthritis-ridden, stooping, smoking old geezer around the screen proved too much for me. I did not even finish the first level. It went straight to Gamestop, where I could at least recuperate some of my money. So much the better that the gamestudios never see a dime of that extra value generated from the second hand market. It is my one and only way of letting them know that I do not appreciate them wasting my time. By the way, MGS2 is by no means the only one, it just happends to be the worst one I've experienced so far. The post isn't about MGS2 in particular, but about consolegames in general.

First, there's just about 600.000 logos, pictures og god knows what, god knows why, and videos. This all invariably results in vigorous keypresses, on just about any key there is on the controller. I've also resorted to picking up any other controllers connected to the console in question, just in case there's a bug. To no avail, ofcourse. Who, in their right mind, thinks that I care who made the soundengine?? Or, more accurately, who thinks that I care about it so much, that I will find this information interesting, not only before I play the game for the first time, but every single time I play it?? This can not wait untill the credits roll?? This, I think, can be likened to prominently, for 15 minutes, dispaying the name of the gaffer before every movie shown. (No, I don't know what a 'gaffer' is either. But I know every movie have one.)

Then there's the "Press Start". Ok... Well, I've spent some time considering this, and I've come to the conclusion that it might be excusable, in the event that the game console has more than one controller attached. If not, NOOOOOOOOO! Get it? Just no. We, as a species, have placed people on the moon. The Cell-processor is, as far as I'm concerned, the grestest feat since the Alpha-processor, with a straight lineage to the wheel. It's a marvel. A testimony to human ingenuity. Yet, there it is, all nicely tucked into a sleek black console in my "study". (That's what my wife calls the room where she puts all of my stuff.) So don't tell me we can't make a videogame that doesn't require me to tell it that I want to play. It's on. That means I want to play it.

Well, so I've accepted the Press Start.. For now. So now what? Well, now comes, if you're lucky, the "Please select a storage device". Hm. I only have the one, so I guess I'll choose that one. What would any reasonable intelligent being guess, if pushed? There you go. Gamestudios are entirely staffed by an alien race, come to this planet in order to subdue us all into a mindless dance of pressing buttons, for no apparent reason. They, meanwhile, will probably ravage the women we ignore while in front of our consoles. Meh.. The game, by the way, needs to know which storage device to use every time it is started..

"Please select storage device" is only acceptable if there is more than one storage device! And only, only the first time the game is started. How hard can this be??

Well, that's three (or many). Time to play! Nope. First, they feel the need to inform you that this particular game has a cutting edge feature. One feature so new, so inventive, so revolutionary, that reminding you of it every time you start the game is entirely justified. Nay, required, even. Nevermind that you've logged in the better part of your toddlers life playing the game, nevermind that you've started the game far more times than you've called your mother in your entire life. No, the task they perform, and what you need to acknowledge with a keypress, is that of saving your progress. And it is of vital importance that you do not, I repeat not, under any circumstance, within the timespan of 2 ms take a dive towards your gameconsole and yank the out powercable. Good to know.

Then theres the Main Menu. Dear Lord.. I don't even know where to begin with that one. I've started games on my console literally thousands of times. I do not own thousands of games. That leads to the inevitable conclusion, again if you are a reasonable human being, that I continue where I left off far more times than I start a new game. In short, the number of keypresses from I start the console with a gamedisk in it untill I'm exploding slugs with a flamethrower should be Nil. If you insist, I can settle for one. "Do you want to continue where you left off [Yes][No]". That would be sensible. That would make me happy.

There's more keypresses. Sooo many more. But I think I've made my point. Superflous keypresses are bad.