Learn why automatic data wrangling is so difficult, learn the meaning of cognitive computing, and get a view from the perspective of cognitive ergonomics.
Most probably anyone who is even remotely aware of the nature of the contemporary data science landscape will recognize the truth of the following two statements:
Data wrangling is necessary with almost every new project.
Data wrangling is difficult and tedious.
No, no, we don’ t want this… you know the drill.
Can’ t we expect our super-smart algorithms to infer automatically this type of common knowledge and expectations among data analysts upon being given a command of a type « build me a multinomial regression model from Y as a criterion and select all meaningful data as predictors; iterate model selection until the best model is selected »? It turns out that what is solved by mere project specification and some bare intuition in your mind — before it starts taking long hours of coding — presents a rather difficult riddle when posed as computational, algorithmic problem. Why is that so?
Let’s assume that we want to solve the problem by imposing a set of formal constraints upon the eligible data types that can enter the model. In R, continuous predictors would fall under the double type. However, sometimes, the integer needs to be treated as continuous in regression; character, factor, and integer would do as categorical predictors. In a discrete model, the dependent is always categorical.
This is extremely easy to automate, but it would help only for the datasets where the variable semantics are all set. In other words, having the problem of letting the algorithm decide what variables do and what do not do makes sense as predictors are what really make the automation of data wrangling difficult.
Obviously, we would need to build a semantic model, a structured knowledge repository that would be addressed by our automation of data wrangling in order to inspect all variable names and descriptions and see which of them match some predefined schema — the schema that defines what is allowable and what not in building a particular statistical model. Our task would then be to define the binding of all columns from our SQL tables to a set of abstract variables from our semantic model to perform the appropriate selection and then easily build a statistical model in the desired programming language.
We can probably solve this kind of data wrangling automation for a more or less wide class of data science projects; but can we solve the general case that would do for any given relational database and a wide class of statistical models? We are now well aware of the scope of the problem: its solution would almost be a true artificial intelligence. That fact is what takes up to 80% of your daily work routine.
I was motivated to write up this short summary of the data wrangling automation problem a long time ago, maybe because my background as a cognitive psychologist makes me think about similar problems in the cognitive ergonomics of computer programming more often then it sparks the imagination of my colleagues with a background in software engineering and similar. But the motivation for this very blog post came as a consequence of reading some recent discussions on how to define cognitive computing properly.
What is cognitive computing? On one hand, we are being told that it is essentially programming computers to perform cognitive operations in a way similar to what minds naturally do. But at least half of the typical data scientist’s toolkit comes from people with a background in cognitive sciences. These people would be able to list a dozen of fundamental research areas that have spawned mathematical models used by data scientists nowadays but that were initially developed in order to understand the workings of the human mind.
On the other hand, sometimes the explanation cognitive computing seems to be tightly related to the aspects of UX/UI design, i.e. cognitive computing means computers being able to react adaptively to our natural language or motor inputs and manage their outputs to match our original intentions. I guess the automation of data wrangling as I have discussed it falls close to this second connotation.
I have sometimes encountered that cognitive computing is not the same as AI because the former is of a probabilistic nature while the latter is not, which is really true only if you put an equality sign between AI and the old classic AI research program based on the idea of rule-guided behavior (cognitive psychologists have started writing about the “probabilistic turn” in the study of human cognition more than ten years ago, not to mention the study of probabilistic causal networks that has it roots back in the 80s) . The whole contemporary discourse of cognitive computing is obviously motivated by some recent developments that have created the need to redefine the meaning of the term, but the redefinition in itself seems to be taking too much time and struggle with fine-grained distinctions from similar terms; it seems to be so edgy that even rumors on cognitive computing being just another marketing hype started appearing.
One take-home message is that cognitive computing is certainly not a marketing hype in itself; as I have tried to illustrate above on the example of data wrangling automation problem, the problems it may address are real and many would benefit from their solution. A realistic research and development program (semantic modeling plus you-name-it-probabilistic-learning-approach) is available to address a more or less wide classes of typical problems of the similar type, and the application of such programs is well under way. The most likely source of too much uncertainty in the discussion on what computing is and what it is not is probably the natural relation of this term to the possibility of obtaining general solutions for wide classes of problems similar to my illustration.
We should maybe start making a distinction between cognitive computing in a general and a narrow sense: the former addressing the typical fundamental questions of AI research (irrespective of whether the specific approach under discussion is deterministic or probabilistic) , and the later reserved for cognitive applications that solve a constrained class of problems that prevent the user to interact with the computer in cognitively ergonomic way.
Another final remark to keep us in line with the nature of problem was exemplified in the beginning of this post. In a similar way that we have tried to discover why data wrangling is difficult, we could ask and try to understand why coding in general is hard. Every cognitive psychologist can testify that the human mind does not exhibit too much of a preference for abstraction. In our everyday lives, we rarely organize our thinking around general categories and abstract concepts.