How I learned to stop worrying and love statistics

09/08/2012

I've always hated statistics. Nothing smells more like accountancy, rimmed glasses, bookkeeping, and horrible little news reports than statistics. A career in statistics was the image of compiling endless financial reports to a stony board of directors in an attempt to squeeze out those few more dollars from the public. It was the lowest. It was the selling of an mathematical mind to the machine and the end of all beauty and expanse. There was no doubt in my mind that statistics was simply evil.

So it was a mysterious change when it happened, and it all began with a search engines module at University. This was easily one of the best courses I took in my time at University and from the beginning of the course what became most clear was that making an effective search engine had nothing to do with understanding the English language, with extracting semantic meaning from queries or documents, with logic, reason, or human experience. It was all to do with raw, unadulterated statistics.

And suddenly I saw the glint of gold. I saw a promise in statistics. Hiding beneath dusty logarithm lookup tables and hypothesis testing was the promise of an Oracle Machine. Something that could be queried and provide answers in milliseconds. This was knowledge like had never been seen before and yet it was nothing to do with knowledge, logic, semantics, or meaning. It was just numbers, just data and statistics and a query box. Ultimately the question in my mind was "how can this be?", and secondly, against my better judgement, "how can I get it?".

An internet search engine relies on the systematic de-construction and processing of text. The text is crippled; stripped of meaning until it is completely void and will fit into nice neat data structures for processing. Only then would the data shine through. And once the numbers were ready, the statistical algorithms could roll along and process the data. Finally the questions we all had could be answered in the blink of an eye. Building a search engine is not re-inventing the wheel, it is rediscovering the holy grail.

The first thing to go is syntax. The hierarchy of language, which structures and subjugates words into a towering tree, is unimportant under statistics. All web pages, documents, and queries are reformed and stored as jumbled lists of words. Context is not truly lost. Those words which often are together are still in association via their combined presence in a list. Everything is just a little more anarchic. The words have been freed of their sentences. There is no longer a primary verb, or a root pronoun. Under the statistical system all words are equal, and as you would expect, some are more equal than others.

The important words are those which do not occur often. "The" is largely a useless citizen; syntactic glue. No room is left in our system for such common words and where possible they are removed. The "aardvarks" and "armamentaria" are king, because you can be sure if they exist in a query then they must be key. So how are these statuses assigned? Not by some governing hand. We look toward the Laws of Text, Zipf's law and Heaps' law. These laws tell you, in beautiful fairness and balance, the relative importance of words in a language. Even the numbers and numerals can be governed using Benford's law. Nothing is left to chance, all is mathematical.

But all this begs the question. Do we really need words in the first place? Is this bureaucracy? Can something smaller suffice - say, a symbol, a letter? In languages such as Japanese, with no spaces to separate words, we can simply assume that each overlapping pair of symbols, as well as each individual symbol, is a word in its own right. As we accumulate more and more web pages and documents, the pairs which are actually words will continue to appear, while those which are not words will not. It soon becomes clear what is, and what isn't a word. This system, of taking N-grams, sets of symbols is effective. Even more effective than just splitting via words. Even in European languages. The reason is we can match policeman with policemen, even with no idea of the semantic relationship. Words are not required. A good search engine can use just symbols.

We note that in using N-Grams, the more documents you have the better. This is another devilish aspect of statistics. In statistics, more is more. You can never have too much data. The reason is simple. Signal adds up and noise cancels out. More precisely, when you have more data, the probability of something becoming statistically significant via chance is lessened, while the probability of something becoming statistically significant via actuality, is increased. In a logical formula, the smaller the formula the better. But in our search engine, the more websites scanned the better - even if what they contain is largely junk.

So now our documents are simply jumbled lists of words and their relative importance. We have spiders crawling the web and accumulating more data for us, and we have an index slowly ticking over and processing the document data. All that remains now is to design the statistical models via which we rate our documents for a given query. Because of our destruction of the text we can build effective data structures and feed them into a huge database. The final step is just to turn it on.

A human-free system. A system of knowledge automated by the cold clicking hands of a computer.

The secret in statistics is rather simple. The power it provides is a concise mantra. In logic, deduction and mathematical proof one can divulge true answer to precise questions. Statistics, on the other hand, can provide compelling answers to all questions. Statistics focuses on the question rather than the answer. As Douglas Adams revealed, if you wish to know the answer to the meaning of life the universe and everything, you must first know the question.

The difference can be shown with a trick. When queried with the question...

"How many legs does the average person have?"

A logical system will answer ~1.99
A statistical system will answer 2

A good logical system will know the answer to a question.

A good statistical system will know what question you are asking.

This is both the beauty and the danger of statistics. Much like a search engine, the goal of a statistical system is to tell you exactly what you want to hear. A statistical system does not answer a question with the precision and truth of a logical system, but it should capture the absolute intuition of what you are asking from it. It will know when "average" is translated to "mean" and when one really intends for the "mode". When a CEO asks his statistician "is the company doing good?", a good statistician will formalise and calculate the exact notion the CEO holds of "company doing good", and present it to the CEO.

The danger comes when the intuitive notion of "company doing good" differs from person to person. Perhaps the CEO is unconcerned with the variable counting toxic waste dumped, while a citizen rates this variable highly on their intuition of that evaluation. The power of statistics comes from its subjectiveness and lack of true meaning, but it is also its heel.

What really is the "mean" or the "standard deviation" other than the formalisation of some human intuition? In exact terms the mean is not "the average" because as we discussed above, that is a subjective and relative notion. The mean is only itself - that is the sum of all data points divided by the count of data points. The same is true for search engines. If I search for "The Best Page In The Universe" Google does not return the best page in the universe. It returns the tf.idf weighted sum of my query terms against its index including user ranked weights, individual behaviour weights and pagerank.

Statistics is not boring. Far from it. At its heart it is the beautiful and twisted cousin of logic and reason. It is deceptively powerful. It gives you the chance to throw your pennies in the well and get an answer back. Most of all, statistics is agnostic, subjective and human. Unlike the godlike sentience of logic and reason, statistics is the devil inside. For that reason I love it.