Icon 17 Line Markov Chain

 

Icon 14 Character Random Number Generator

 

Icon Simple Two Joint IK

 

Icon Generating Icons with Pixel Sorting

 

Icon Neural Network Ambient Occlusion

 

Icon Three Short Stories about the East Coast Main Line

 

Icon The New Alphabet

 

Icon "The Color Munifni Exists"

 

Icon A Deep Learning Framework For Character Motion Synthesis and Editing

 

Icon The Halting Problem and The Moral Arbitrator

 

Icon The Witness

 

Icon Four Seasons Crisp Omelette

 

Icon At the Bottom of the Elevator

 

Icon Tracing Functions in Python

 

Icon Still Things and Moving Things

 

Icon water.cpp

 

Icon Making Poetry in Piet

 

Icon Learning Motion Manifolds with Convolutional Autoencoders

 

Icon Learning an Inverse Rig Mapping for Character Animation

 

Icon Infinity Doesn't Exist

 

Icon Polyconf

 

Icon Raleigh

 

Icon The Skagerrak

 

Icon Printing a Stack Trace with MinGW

 

Icon The Border Pines

 

Icon You could have invented Parser Combinators

 

Icon Ready for the Fight

 

Icon Earthbound

 

Icon Turing Drawings

 

Icon Lost Child Announcement

 

Icon Shelter

 

Icon Data Science, how hard can it be?

 

Icon Denki Furo

 

Icon In Defence of the Unitype

 

Icon Maya Velocity Node

 

Icon Sandy Denny

 

Icon What type of Machine is the C Preprocessor?

 

Icon Which AI is more human?

 

Icon Gone Home

 

Icon Thoughts on Japan

 

Icon Can Computers Think?

 

Icon Counting Sheep & Infinity

 

Icon How Nature Builds Computers

 

Icon Painkillers

 

Icon Correct Box Sphere Intersection

 

Icon Avoiding Shader Conditionals

 

Icon Writing Portable OpenGL

 

Icon The Only Cable Car in Ireland

 

Icon Is the C Preprocessor Turing Complete?

 

Icon The aesthetics of code

 

Icon Issues with SDL on iOS and Android

 

Icon How I learned to stop worrying and love statistics

 

Icon PyMark

 

Icon AutoC Tools

 

Icon Scripting xNormal with Python

 

Icon Six Myths About Ray Tracing

 

Icon The Web Giants Will Fall

 

Icon PyAutoC

 

Icon The Pirate Song

 

Icon Dear Esther

 

Icon Unsharp Anti Aliasing

 

Icon The First Boy

 

Icon Parallel programming isn't hard, optimisation is.

 

Icon Skyrim

 

Icon Recognizing a language is solving a problem

 

Icon Could an animal learn to program?

 

Icon RAGE

 

Icon Pure Depth SSAO

 

Icon Synchronized in Python

 

Icon 3d Printing

 

Icon Real Time Graphics is Virtual Reality

 

Icon Painting Style Renderer

 

Icon A very hard problem

 

Icon Indie Development vs Modding

 

Icon Corange

 

Icon 3ds Max PLY Exporter

 

Icon A Case for the Technical Artist

 

Icon Enums

 

Icon Scorpions have won evolution

 

Icon Dirt and Ashes

 

Icon Lazy Python

 

Icon Subdivision Modelling

 

Icon The Owl

 

Icon Mouse Traps

 

Icon Updated Art Reel

 

Icon Tech Reel

 

Icon Graphics Aren't the Enemy

 

Icon On Being A Games Artist

 

Icon The Bluebird

 

Icon Everything2

 

Icon Duck Engine

 

Icon Boarding Preview

 

Icon Sailing Preview

 

Icon Exodus Village Flyover

 

Icon Art Reel

 

Icon LOL I DREW THIS DRAGON

 

Icon One Cat Just Leads To Another

How I learned to stop worrying and love statistics

Created on Aug. 9, 2012, 12:25 p.m.

I've always hated statistics. Nothing smells more like accountancy, rimmed glasses, bookkeeping, and horrible little news reports than statistics. A career in statistics was the image of compiling endless financial reports to a stony board of directors in an attempt to squeeze out those few more dollars from the public. It was the lowest. It was the selling of an mathematical mind to the machine and the end of all beauty and expanse. There was no doubt in my mind that statistics was simply evil.

So it was a mysterious change when it happened, and it all began with a search engines module at University. This was easily one of the best courses I took in my time at University and from the beginning of the course what became most clear was that making an effective search engine had nothing to do with understanding the English language, with extracting semantic meaning from queries or documents, with logic, reason, or human experience. It was all to do with raw, unadulterated statistics.

And suddenly I saw the glint of gold. I saw a promise in statistics. Hiding beneath dusty logarithm lookup tables and hypothesis testing was the promise of an Oracle Machine. Something that could be queried and provide answers in milliseconds. This was knowledge like had never been seen before and yet it was nothing to do with knowledge, logic, semantics, or meaning. It was just numbers, just data and statistics and a query box. Ultimately the question in my mind was "how can this be?", and secondly, against my better judgement, "how can I get it?".

An internet search engine relies on the systematic de-construction and processing of text. The text is crippled; stripped of meaning until it is completely void and will fit into nice neat data structures for processing. Only then would the data shine through. And once the numbers were ready, the statistical algorithms could roll along and process the data. Finally the questions we all had could be answered in the blink of an eye. Building a search engine is not re-inventing the wheel, it is rediscovering the holy grail.



The first thing to go is syntax. The hierarchy of language, which structures and subjugates words into a towering tree, is unimportant under statistics. All web pages, documents, and queries are reformed and stored as jumbled lists of words. Context is not truly lost. Those words which often are together are still in association via their combined presence in a list. Everything is just a little more anarchic. The words have been freed of their sentences. There is no longer a primary verb, or a root pronoun. Under the statistical system all words are equal, and as you would expect, some are more equal than others.

The important words are those which do not occur often. "The" is largely a useless citizen; syntactic glue. No room is left in our system for such common words and where possible they are removed. The "aardvarks" and "armamentaria" are king, because you can be sure if they exist in a query then they must be key. So how are these statuses assigned? Not by some governing hand. We look toward the Laws of Text, Zipf's law and Heaps' law. These laws tell you, in beautiful fairness and balance, the relative importance of words in a language. Even the numbers and numerals can be governed using Benford's law. Nothing is left to chance, all is mathematical.

But all this begs the question. Do we really need words in the first place? Is this bureaucracy? Can something smaller suffice - say, a symbol, a letter? In languages such as Japanese, with no spaces to separate words, we can simply assume that each overlapping pair of symbols, as well as each individual symbol, is a word in its own right. As we accumulate more and more web pages and documents, the pairs which are actually words will continue to appear, while those which are not words will not. It soon becomes clear what is, and what isn't a word. This system, of taking N-grams, sets of symbols is effective. Even more effective than just splitting via words. Even in European languages. The reason is we can match policeman with policemen, even with no idea of the semantic relationship. Words are not required. A good search engine can use just symbols.

We note that in using N-Grams, the more documents you have the better. This is another devilish aspect of statistics. In statistics, more is more. You can never have too much data. The reason is simple. Signal adds up and noise cancels out. More precisely, when you have more data, the probability of something becoming statistically significant via chance is lessened, while the probability of something becoming statistically significant via actuality, is increased. In a logical formula, the smaller the formula the better. But in our search engine, the more websites scanned the better - even if what they contain is largely junk.

So now our documents are simply jumbled lists of words and their relative importance. We have spiders crawling the web and accumulating more data for us, and we have an index slowly ticking over and processing the document data. All that remains now is to design the statistical models via which we rate our documents for a given query. Because of our destruction of the text we can build effective data structures and feed them into a huge database. The final step is just to turn it on.

A human-free system. A system of knowledge automated by the cold clicking hands of a computer.



The secret in statistics is rather simple. The power it provides is a concise mantra. In logic, deduction and mathematical proof one can divulge true answer to precise questions. Statistics, on the other hand, can provide compelling answers to all questions. Statistics focuses on the question rather than the answer. As Douglas Adams revealed, if you wish to know the answer to the meaning of life the universe and everything, you must first know the question.

The difference can be shown with a trick. When queried with the question...

"How many legs does the average person have?"

  • A logical system will answer ~1.99
  • A statistical system will answer 2

A good logical system will know the answer to a question.

A good statistical system will know what question you are asking.



This is both the beauty and the danger of statistics. Much like a search engine, the goal of a statistical system is to tell you exactly what you want to hear. A statistical system does not answer a question with the precision and truth of a logical system, but it should capture the absolute intuition of what you are asking from it. It will know when "average" is translated to "mean" and when one really intends for the "mode". When a CEO asks his statistician "is the company doing good?", a good statistician will formalise and calculate the exact notion the CEO holds of "company doing good", and present it to the CEO.

The danger comes when the intuitive notion of "company doing good" differs from person to person. Perhaps the CEO is unconcerned with the variable counting toxic waste dumped, while a citizen rates this variable highly on their intuition of that evaluation. The power of statistics comes from its subjectiveness and lack of true meaning, but it is also its heel.

What really is the "mean" or the "standard deviation" other than the formalisation of some human intuition? In exact terms the mean is not "the average" because as we discussed above, that is a subjective and relative notion. The mean is only itself - that is the sum of all data points divided by the count of data points. The same is true for search engines. If I search for "The Best Page In The Universe" Google does not return the best page in the universe. It returns the tf.idf weighted sum of my query terms against its index including user ranked weights, individual behaviour weights and pagerank.

Statistics is not boring. Far from it. At its heart it is the beautiful and twisted cousin of logic and reason. It is deceptively powerful. It gives you the chance to throw your pennies in the well and get an answer back. Most of all, statistics is agnostic, subjective and human. Unlike the godlike sentience of logic and reason, statistics is the devil inside. For that reason I love it.

github twitter rss