On the origin of computational complexity

My field is “Informatics”

I usually tell people I’m a programmer, but that describes my job, not my field. The thing I studied for at uni is “computer science”, or “informatics” as I like to call it. It is a field of science, and is not to be confused with the field of technology called “programming” or “software engineering”, or indeed also “computer science”. The language in which informatics was taught to me includes such constructs as the Turing machine, formal grammars, finite state machines, graphs, and expressions. In a way, this language influences how we understand the topics we study. So even though people may say it’s useless to say the same thing in a different language than the one it’s already been said in, I think it deepens our understanding of a subject if we describe it in different words. So this blogpost describes informatics in my words.

The basis: Booleans

Booleans are variables that maybe true or false. They are an important building block of everything else in informatics. When we talk about information in informatics we usually assume that this information is represented as a string of 0s and 1s. It is possible to represent information using more than two digits or letters, but Shannon’s Theory of Information explains that information can always be translated into a stream of 0s and 1s, so for simplicity’s sake, we usually assume we deal with Boolean (or ‘binary’) variables. A valuation of variables “p, q, …” can be any sequence of 0s and 1s that has one digit per variable. For instance:

p q r

0 1 0

A truth table can describe a function, so it is more powerful: it assigns a 0 or a 1 to each possible valuation, for instance:

p q r value

0 0 0 1

0 0 1 0

0 1 0 1

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 1

1 1 1 1

Rewriting expressions

A state of knowledge about a list of such Boolean variables can be written as a stream of 0s and 1s, but to give a better insight into the interrelation of the variables, and also because it’s often shorter, we often use Boolean expressions. These consist of variable names “p, q, …” and operators “NOT(…)”, “(… AND …)” and “(… OR …)”. Here are some examples of expressions that are equivalent to the truth table mentioned earlier:

((p AND q) OR (NOT(p) AND NOT(r)))

(NOT((p OR r) AND (NOT(p AND q))))

As you can see in this example already, one truth table can be equivalent to several different expressions. These expressions are interpreted by us humans to mean ‘AND’, ‘NOT’ and ‘OR’ in their real-world meanings, but the way they are defined in informatics is only behavioural. They are defined in terms of how the resulting expression corresponds to a truth table. Mathematicians will tell you that this is on purpose; the system of expressions is defined as a mathematical abstraction, not in relation to real-world phenomena which we map it onto.

When computer programs run, or we make logical deductions using pencil and paper, no information is added during the process. Information might be removed, when some of the information that the process started with is forgotten or thrown away halfway through, but that is optional. The information is only being reorganized. So reasoning, the running of computer programs, is rewriting expressions. In fact, 3-SAT, one of the most famous computation tasks in informatics, can be solved by rewriting an expression from Product-of-Sums form “(a OR b OR c) AND (d OR e OR f) AND …” to Sum-of-Products form “(a AND b AND c) OR (d AND e AND f) OR …”. Running an algorithm (or reasoning about Boolean information) is a process in time: it is a process of abstract represented states through time, that has its counterpart in the physical world, of a machine changing its physical status through time.

This is why I don’t feel this rewriting of Boolean expressions is a good way to model reasoning. It does not take advantage of the fact that on the one hand AND, OR and NOT have a meaning in our real world, and on the other hand so does time, and duration. Maybe it is because of abstracting so much from these real-world phenomena, that we still haven’t been able to prove what is arguably the central point of informatics: that ‘hard’ algorithmic tasks (problems that are in NP but not in P), exist. In other words, we haven’t been able to formalize properly that logical processes fundamentally take time.

We just observe that computation takes time, but leave the question of why this is so to philosophers. I like philosophy. In fact, I think the philosophical implications of informatics are even more fascinating than the technological ones. Informatics is at the stage where biology was before Darwin: biologists were studying species, but not their origin. In informatics, we are studying computational complexity, but we do not study its origin. Hence the title of this blogpost. :)

Causality

In the real world, we are familiar with causality: the future is partially unpredictable, meaning there are several candidate outcomes for the future, when viewed from the present. In quantum physics such uncertainty is sometimes called superposition, but we know it also as the logical ‘OR’ operator (real semantic ‘OR’ this time, not the syntactical ‘OR’ from Boolean expressions).

Causality is the relationship between cause and effect. It describes how causes in the relative past interrelate with effects in the relative future. Memory (retaining information from the past) is part of the same relationship between past and present as causality.

Unlike the future, the past can be remembered, meaning information accumulates. Knowing several things at the same time is what we know as the semantic ‘AND’ operator. Accumulating multiple bits of information this way is exactly what an algorithm does when it ‘runs’. It converts ‘OR’ information (future) into ‘AND’ information (past). Moving the present forward into the future, while putting the past behind you, is what we know as the passing of time. Although not everybody realises this, and many people will give a more mundane definition of informatics, this is what informatics is about - at least for me. It studies this fascinating relationship between information and time.

Knowledge items

To express the fact that we are accumulating knowledge, we need to interpret the syntactical ‘AND’ operator from our Boolean expressions as the semantic ‘AND’ of causality/memory. This means the entries in a truth table should be treated as collectable items. I call them knowledge items. You can collect them, and they add up to your knowledge. You can throw them away if you don’t need them anymore. This is exactly what an algorithm will generally attempt to do when it is rewriting expressions: accumulate useful knowledge items, which are arguments of the expression’s outer ‘AND’ operator.

At the same time, an algorithm will attempt to resolve uncertainty by resolving ‘OR’ terms. Note that the total information state does not change. The algorithm’s knowledge about the thing it’s modeling (the problem instance) does not change during the process. So in a way, when you give a problem instance to a reasoning machine, you could say the machine already ‘knows’ the answer when it starts. It does not receive any further information from the outside during the reasoning process. It is only rewriting the information into a format that, as time advances, has more ANDs and less ORs.

In a truth table, the zeroes are the knowledge points. This is because when you combine two truth tables with the ‘AND’ operator, the resulting truth table is found by putting a zero wherever there was one in at least one of the earlier truth tables. The ones are irrelevant in this process, look:

p q r a b a AND b

0 0 0 1 1 1

0 0 1 0 1 0

0 1 0 1 1 1

0 1 1 0 1 0

1 0 0 1 0 0

1 0 1 1 0 0

1 1 0 1 1 1

1 1 1 1 1 1

So the zeroes in a truth table are the knowledge items we collect as time progresses.

Non-deterministic deduction

A logical deduction is a linear sequence of collections of knowledge items. At each step, new knowledge items can be formed from ones that are already in the collection, using for instance the ‘AND’ operator as shown in the last example. The choice of which knowledge items to combine next, is not part of this process. The deduction is an exact path - a proof, not an algorithm. The class of NP problems are problems whose instances are solved by such proofs, such that the number of steps in the proof is polynomial in the length of the description of the problem instance. Polynomial means that the way the length of the shortest proof increases, is bounded above by a function that is a polynomial function of the length of the description of the problem instance. For instance if the length of the proof is never greater than the number of bytes in the problem instance description, to the power of 7, then we say the problem’s non-deterministic computational complexity is O(n^7).

But non-deterministic algorithms are pretty useless to engineers. they don’t tell you which recipe to follow to arrive from question to answer. They are only the proofs of logical deductions, not the reasoning processes that lead to obtaining these proofs.

Deterministic tree traversal

To obtain a proof, one option is depth-first tree traversal: simply try out all possible proofs until you find one that answers the question you wanted to answer. But usually, algorithms will use a combination of trying out options (elements of an ‘OR’ operator in the logical expression that we are rewriting), and heuristics: strategies that are a function from the current collection of knowledge items, to a choice of which step to take next. These heuristics are what we express in computer programs.

Since usually the expression we are rewriting contains a lot of ‘OR’ operators, and there is no way to try out all possible options without ‘visiting’ them, deterministic algorithms often take exponential time in the length of the description of the problem instance. Exponential functions grow a lot faster than polynomial functions, which is why even on fast computers, some computations can take seconds, or hours, or years, depending on how big the logical expression is that the computation rewrites.

The relation between time and logic

Nobody in informatics has so far been able to prove that solving problems like 3-SAT is fundamentally impossible to do in polynomial time. It will be up to the smart theorists to keep trying to come up with a demonstration that P is not NP. This is one of the reasons informatics is not always recognized as a ‘science’ that studies phenomena like information and time in a semantic way. Instead, our field is often categorized a mere result of mathematics rather than a science that studies something: just modeling, in a syntactical way without interpreting the results as part of our real-world surroundings.

That’s why I hope that with this blogpost I’ve been able to share some of my fascination for my field of science. We are Informatics. We study the relation between time and logic.