Recognizing a language is solving a problem

13/11/2011

All though Could an Animal learn to Program I've been talking as if the ability to recognize a language is some kind of general case in problem solving. Any problem which can be formalized can be re-realized in the form of a language recognition problem. And more than that - the complexity of the problem corresponds to how complicated a machine, grammar or algorithm you need to recognize it. This is at first a very unintuitive notion. Unfortunately how the problem complexity maps to the language class it a bit beyond this blog, but I'll try to give some form of explanation on why the rest is true.

First, what does it mean to recognize a language? In formal language theory it means that given some string you can do some kind of algorithm or computation and either accept or reject the string based on if it is in the language or not. So we have two concepts of the language. The set of strings this machine accepts but also the actual specification of the machine itself. From the machine we can generate the set of strings, simply by running it on every possible string and only including the ones it accepts on. But it isn't always possible to know what the machine looks like given a set of strings. Asking what this machine looks like is actually one way of asking "How do we solve this problem?".

So how do we phrase a problem in terms of language recognition? It is impossible to give a general answer to this so it is better to give some examples. We'll start with a "yes, no" problem as those are a bit easier. For example: "I have some shape and I want to know if it has 10 sides".

The idea is to represent this shape as a string and then check if it is in the language of "ten sided shapes" using some machine. One way to do this would be to number the letters "A-Z" as 0 to 26 and then use these to encode the coordinates X,Y of each corner of the shape. We then just concatenate these into a string and feed it into our machine. A triangle with points at (0,0) (2,8) (5,4) would become the string "AACIFE". Due to the geometric fact that polygons in this encoding must have the same number of edges as corners, our machine would then just get the length of the string, divide it by two (each coordinate is two symbols long) and see if the result was equal to ten. In other words our machine would recognize all strings of length 20. So given this encoding, the language of strings of length 20 is the same as the set of shapes with ten sides.

But what about functions? How do we solve 2+2 by recognizing a language? Well a function can be thought of in terms of it's mapping. Addition maps two number to one, and the way we encode this as a string it to group both the inputs and the outputs together into one string. If we use our previous encoding, but imagining a unique symbol for all numbers (not just 0 to 26), and we also using a special symbol (say "|") to separate our inputs and outputs then we can think of the language of addition being the set of strings: "X|Y|Z" where X,Y and Z are symbols in our alphabet and the value of Z is equal the value of X plus the value of Y.

A machine which recognizes this language is simple. It first checks the string is in the correct format (three symbols separated by | symbols). It then finds the encoded value of X, the encoded value of Y and checks that the value of Z is equal to X plus Y. If all of these things are the case it accepts, otherwise it rejects. You can see that given any string this will only accept strings in the correct format and which represent an addition in our encoding. If we want to find an output Z given that we know X and Y we just enumerate for all different Zs until the machine accepts our string. Then we know we have the correct value for Z.

This probably seems backwards and weird but it is a very powerful way of thinking about problem solving and really gets the core fundamentals of it.

Here is a demonstration of the power of this model for problem solving, and I'll warn you that the maths is about to get heavy.

We can find out how many problems, of the total number of possible problems are actually solvable.

First, if you can, imagine the set of all possible languages, including languages with an infinite number of strings in them. This itself an infinite set so the answer to the above problem is not going to be a fixed number, but we can try to consider what proportion of this infinite set we can possibly recognize.

Lets look at the languages with a non-infinte number of strings first. These are easy. We can recognize these just be enumerating all the strings and checking if the input is equal to any of them. Just like we did for palindromes of length three. Sure, some languages might be huge but technically they are still recognizable in this way.

Next the infinite languages. This is a bit harder but we can definitely recognize some of these. For example there are an infinte number of possible strings in the English language, yet I can recognize that perfectly fine. The addition language I explained above is also infinite and we can recognize that. We can perform addition on numbers of any size. So perhaps it is possible to recognize all infinite languages?

Unfortunately not. Consider a language which is infinite but the strings in it have no seeming pattern or relation. Like the decimal expansion of pi it just goes on forever and you never know what the next string will be next. And there are lots of these languages. In fact the vast majority of languages will be like this and there is no way to generate or recognize them because they just express no logic or reasoning behind them.

The only infinite languages we can recognize are the ones with some kind of "pattern" in, like the addition language, palindromes, or even English. For the mathematicians here, this relation actually corresponds to the size of Natural Numbers in comparison to the size of the Real Numbers. There are an infinite number of languages, and an infinite number of recognizeable languages, but the number of total languages is a "bigger" kind of infinity.

And that is really nice way to see what the serious maths behind theory of computation is about. In short - how much of a "pattern" do you need in an infinite language for it to become recognizable by some machine. But we don't stop there, we even get into classifying these different kinds of patterns so we can say what problems are similar. This helps us solve problems in future because we can quickly see how we solved previous similar ones.

The language of English, with all of it's infinite possible strings, might not appear to have much of a pattern if you were to just list strings at random. But the really amazing thing is that there is a pattern, which you are recognizing right now just by reading this text. This is what mathematicians are doing all the time, just more formally. Looking for these weird patterns in infinite alien languages and recognizing what before was impossible to understand.