Automation in Data Science
In my work as a data scientist, I have noticed that many tasks that used to be difficult keep getting easier because of automation.
For example, AutoML promises to automate the entire model-building process.
While that is amazing, the work of a data scientist is much more than just implementing a machine learning model.
As it turns out, the aspects of data science that sound the sexiest will be the first to be automated, and the ones that are the hardest to automate are the ones you would least expect.
Overview
What most people focus on when you talk about data science is AI and machine learning. But data scientist actually spend most of their time on very different kinds of work.
This article will attempt to list all the types of work that a good data scientist should be able to do. For each of them, I will investigate how well it can be automated. Where appropriate, I will list some tools that can help with automation.
If you think there is a (non-commercial) product that works well to automate something, or if you think I missed an important aspect of data science work, then just send me a message and I will include it in the list.
1. Talking to the client
Status of automation: Impossible
The first and arguably most important step in the work of a data scientist is talking to the client.
This does not just mean "ask the client what the problem is". There is a vast gulf in understanding between a business man and a data scientist.
The client generally does not know what the data scientist can do, and the data scientist generally does not know what the client wants.
It is extremely important to be able to bridge this gap in understanding.
Many companies actually employ managers to act as a go-between between data scientists and clients. This is better than letting a purely technical data scientist try to figure out the client's needs. But a data scientist who understands the business context of the client is much better still, because it cuts out the middle man and reduces the risk that something important will get lost in translation.
Talking to the client will not improve the performance of the machine learning model (and like that, about half the technical readership of this article has lost interest). Instead, understanding the client's pain points ensures that you work on building the right kind of model in the first place. The most accurate and best-performing model in the world is useless if the client can't actually use it to drive his profits.
A one-hour discussion with the client can completely redefine the project, and increase the project's monetary worth tenfold.
I will never forget the look on the face of one of my clients when I told him that I could easily extend my model to break down the data by a dozen different categories, analyze all of them, and then only report on the ones that had an anomaly. It turned out that the client had been performing this exact task manually for years, and it was costing him an enormous amount of time. He didn't ask us to automate this, because he didn't know it was possible at all until I brought it up. Without that discussion, we would have wasted months of work on less profitable tasks.
On a large scale, talking to clients ensures that you build the right kind of model, and is absolutely critical to ensure that the project leads to profit in the end.
On a smaller scale, talking to clients also has some immediate benefits that are no less important. For example, picking the most useful metric to train your model on. Many data scientists don't think about this at all and simply go with Accuracy, or L2-loss, or whatever they were taught to use in university. A five-minute discussion with the client might show that their profit actually stems only from the top-5 results, or something like that. If you don't account for that by altering the metric you use, you optimize your model in the wrong direction.
Now for the terrible news:
Virtually none of this can be effectively automated.
Talking to people is AI-complete. If anyone can actually figure out how to automate this task, then the robot rebellion will be just a few days behind.
In other words, the least technical part of a data scientist's work is the one that is the hardest to automate.
In fact, this is an area where technological progress is actually likely to make things harder: Working from home and only communicating via text messages makes it really unlikely that you will discover what the customer really wants.
The single most useful thing I have seen to help solve this problem was not an program, or a company policy, but simply a coffee machine placed in a central location of the office. Both the clients and the data scientists met each other occasionally for a coffee, and that led to better goal-alignment than anything else we did.
2. Data preparation and data cleaning
Status of automation: Partially automated. Too subjective, and too many edge cases to automate completely.
Data scientists spend about 80% of their time on data preparation and cleaning.
This involves:
Getting access to the data.
Putting the data in the right format.
Connecting the data.
Identifying and fixing errors and/or anomalies in the data.
All of this is very time-consuming, and extraordinarily boring.
How well can this be automated?
It depends.
Some parts of this can not be automated at all, because they rely on client interactions. Sometimes, cleaning a table with erroneous data requires calling the data owner and asking for clarification.
But much of it can be automated.
Most tasks in data preparation and cleaning can be solved by applying a small number of simple heuristics to the data until everything is fixed.
Example heuristics are:
If a table has dates, then check if their distribution is noticeably influenced by weekends, holidays, or other regular events that may be relevant.
If a table has categorical columns that were entered by hand, then check if they contain any typos and correct them.
If a column is numerical, check if it ever takes on a value that falls outside a sensible range.
These heuristics tend to be very simple, but the problem is that there are thousands of them and most of them are highly subjective.
To my knowledge, there is no existing comprehensive list of all heuristics one should check, nor is there a comprehensive method to suggest how to parameterize the subjective heuristics in a useful way.
Instead of having a checklist to work through, data scientists are expected to look at the situation and come up with the most relevant heuristics to test on their own, based on experience and intuition.
To make matters worse, some artifacts in the data are specific to the client. For example, an item in the database may have been reclassified a year ago, but older entries still use the old classification system. You can only recognize this problem if you understand the business context and talk to the client about it. This can not be fully automated. The best you can do in such a case is to automatically point out anomalies in the data, to save the data scientist time in running the initial tests.
To summarize:
It is possible to automate data preparation and data cleaning in principle. Many simpler things have already been automated, or at least made very easy, by popular libraries.
There are thousands of edge cases that you can't effectively cover with a single automated system.
There are some tasks that can not be automated effectively, because they rely on interactions with clients.
Nevertheless, there are still low-hanging fruits here. I expect we will see progress in this area in the coming years.
Tools for automating data cleansing
There are many, many small tools and libraries that can be used to clean your data. Each of them usually only saves you a little bit of time, but it does add up. You can find a tool for most common types of data cleaning by just googling for it.
You can also find lists of many data cleaning programs, which can give you a good overview of what kind of cleaning tasks have already been solved. Check the lists here, here, here and here.
Be careful when using tools like this. It's easy to accidentally damage your data if you just run automated cleaning tools naively and rely on the defaults. What may be a mistake that needs cleaning in one dataset may be a deliberate choice in another.
There is also elody.com. It is still in its infancy, but it should be very useful for data cleaning once it is grown. The website is free to use. (Note: Elody is my own website. I built it specifically to help automate data science. It is not very powerful yet, but will hopefully grow larger with time.)
Elody is like a github that lets you execute programs rather than just store them. It is also like Wikipedia, in that anyone can contribute and the programs can be linked to each other. Developers can upload programs and connect them to each other through formal rules. Endusers can run the uploaded programs directly on the website without the need to download or install anything.
The end result is that you can start a scenario with just one click, and elody will automatically apply all data cleaning programs to your data that make sense to use, and will use earlier results to decide what to try next. Once the website is mature, this crowdsourced approach will be able to handle the core problem of automating data cleansing: There are a lot of edge cases and picking the right ones is highly subjective.
3. Data exploration and feature engineering
Status of automation: Partially automated, promising future
After the data has been prepared and cleaned, it is finally time to start with the interesting parts of data science.
Data exploration and feature engineering are all about examining the data under a microscope and extracting insight from it.
This phase has two aspects:
Data exploration is about understanding the data.
Feature engineering is about understanding the problem, and relating the data to it.
Data exploration can be automated very well. Just a small number of basic visualizations are enough to quickly get a good overview of most types of data. There are many tools that are designed to help us do so.
However, this still requires a data scientist to spend time looking at the graphics. Just as with data preparation, there are a thousand different kinds of anomalies that you should look for.
There is no system (yet) that tries to summarize them all. While we can automate the creation of graphics that make data exploration easier, the actual exploration itself still needs to be done by a data scientist.
Note that some of your findings during data exploration can require you to talk to the client for clarification and may even force you to go back to data preparation and cleansing. You might for example notice that a data distribution changes abruptly on a specific date. When you ask the client about it, they explain that it's because they changed the way data was generated on that day, and forgot to tell you. Suddenly you have to go back and adapt all your data to take this into account.
Situations like these can't be handled automatically, but an automated system could still help you to identify these anomalies more quickly.
Finally, all of the insights you gathered during data exploration need to be combined with your understanding of the task to perform effective feature engineering.
The surprising thing here is that a lot of this can actually be automated, because of two simple observations:
Most, if not all, features that are usually created during feature engineering are simply the result of applying basic functions to existing features.
It is possible to measure how useful a feature is and discard those features that prove useless, which keeps the data small enough to handle.
As a result, it is actually possible to automate feature engineering almost completely.
So far the theory, anyway.
I have not heard of an example where automated feature engineering outperformed manual feature engineering by experts. It looks like, for the moment at least, expert knowledge still has these automated algorithms beat.
(If you are reading this and know of an example where automated feature engineering did succeed in beating experts, please let me know.)
Despite its current shortcomings, I expect that automated feature engineering will soon become competitive.
Note that automating feature engineering means that half the reason for data exploration is gone, but that data exploration remains important regardless:
Data exploration is still useful for other tasks, like finding anomalies in the data. Finding these can sometimes be extremely valuable.
Eight times out of ten, an anomaly will be irrelevant. The ninth time, it will mean that your entire dataset is flawed and you dodged a bullet by realizing it early. The tenth time it will spark an amazing new insight that starts an entire spin-off usecase to solve a completely different problem and make five times as much money as the original project.
Tools for automating data exploration and feature engineering
For data exploration, if you are using python, use numpy, pandas for the analysis and matplotlib, seaborn and/or plotly for the visualizations.
These packages allow you to filter, group and visualize pretty much anything you want with ease. The only problem is that you still need to manually write what you want.
Using lens saves you a bit more time by automating all the basic exploration that you should always do.
Note that this will not automate everything. There are still a lot of common cases that you need to remember to check on your own because they are not easy to read from the default visualizations. For example: Are dates affected by weekdays, holidays, etc? If you have location date, in which countries or cities do the locations lie? If you have multiple tables, how are they connected?
For feature engineering, use featuretools to automatically generate new features.
Be aware that you should be careful when using featuretools. It is great for generating a vast number of features from your data, but it still has its flaws:
It obviously doesn't understand the data, and so will not be able to create features that are obvious to humans but require an unusual combination of functions to produce algorithmically. For example, it is easily able to generate a feature like "How many movies the customer watched in total" from raw data, but something intricate like "How often the customer watched a movie within the first week of its release" gets harder to generate. You can tell featuretools to go deeper and create more features, but then you will run into the second problem:
Featuretools can easily create a huge number of features. This can lead to overfitting, so you should investigate the trained model and check if it makes sense. You can use shap to make this easy.
My own aforementioned project elody.com will also be able to handle data exploration and feature engineering once it is mature.
Elody allows developers to automate the entire pipeline from data cleansing, to data exploration, to feature engineering, to model building, and visualization. Combining formal rules with user feedback means that Elody will always automatically use the most appropriate and trustworthy program to solve any given problem.
For example, this scenario performs a generic data exploration, but if any geographic data is found, either in the form of a latitude/longitude pair or because a column contains city names, it will visualize it on Google Maps. As more developers add features, Elody will get ever more userful.
4. Modelling
Status of automation: Fully automated, and competitive with expert data scientists
Model building is the core part of the work of a data scientist.
This is where we take all our precious data and turn it into results, which our clients then turn into profit.
This is the part that data scientists spend the most time learning about in university, so naturally this is also the first part of data science that got completely automated.
Model building is an extremely difficult process, with lots of different parameters. However, it is also a very rigid process, and all parameters are mathematically well defined and independent of real-world fuzziness. This makes it possible to automate model building much more effectively than any other aspect of data science.
Modelling can be further split into different subtasks.
Model construction
Model validation
Hyperparameter optimization
(You could arguably also count feature engineering as a part of modelling. Some people do, but I choose to treat it as a separate part because feature engineering still has some non-mathematical aspects to it that require understanding the business context, while modelling is purely a math problem.)
Libraries that automate specific parts of these subtasks have existed for years, but it still took expertise to use them properly.
Now, there are libraries that can automate the entire modelling process. These libraries just take your data as it is and build a complete model from it in one go, with no human interaction needed.
In May 2019, Google's AutoML finished second place in a KaggleDays Hackathon. Many competitors were on the master or grandmaster level according to Kaggle's progression system.
A completely automated system beat several human experts who gathered in a competition specifically to prove their skills. If this system can win such a competition, then how much better can it perform compared to the average data scientist, who doesn't participate in competitions?
Admittedly, the system is not perfect yet. The competition lasted only 8.5 hours. That is more than enough for an automated system to run, but for the human competitors it was barely enough to get started. It is unlikely that AutoML would have performed as well if the competition lasted longer.
But automated model building tools are being improved every day. It is only a matter of time until they win Kaggle competitions with regularity.
I fully expect that data scientists won't need to build their own models anymore within 10 years time.
Tools for automating model building
Automating parts of the modelling process
This article lists tools for automating different parts of the modelling process.
Even if you don't want to automate your entire modelling process, I would highly recommend taking a look at these tools. They can save you a lot of time.
Note that this article treats feature engineering as a part of modelling, so it also lists some libraries that can make feature engineering easier for you.
Automating the entire modelling process
Running AutoML costs money.
The Open Source community has created Auto-Keras to replace it.
Both Auto-Keras and Google's AutoML use neural networks as the basis of the models they build.
There are also other automated machine learning libraries that use different types of models:
This article compares the performance of four automatic machine learning libraries auto_ml, auto-sklearn, TPOT, and H2O’s AutoML solution.
The results: Auto-sklearn performs the best on the classification datasets and TPOT performs the best on regression datasets.
However, note that this article was published more than half a year ago and doesn't include AutoML or Auto-Keras. All of these libraries are under continuous development. By the time you are done researching which of them is best, half of them will probably have had a major update again.
5. Presentation for clients and documentation
Status of automation: Impossible
Once the modelling is done and we have our results, we need to present those results to our clients. This is a science all of its own.
As I mentioned before, data scientists and their clients speak completely different languages. What may be obvious to one, may be confusing to the other.
You need to take all that data you spent months working with and condense it into a single graphic that can be understood by the type of person who needs to look at the keyboard while typing. That is no small feat.
It's easy to get clients to nod along to your results, but if you want them to actually understand it, you need to be really clear, and your documentation must be both concise and comprehensive.
This is a small task, but very important.
Even the best project becomes completely worthless if the client does not understand how it is supposed to be used. Nothing is worse than solving a problem perfectly, only to see it thrown out because its intended users don't understand how to use it.
Just like the initial client interaction that is needed to clarify requirements at the start of the project, this task is impossible to automate.
6. Deployment to production
Status of automation: Depends on company. May be completely automated, or may be impossible.
Some data science projects are only about building prototypes.
Other projects require us to turn the finished project into a completely automated end-to-end solution that can be deployed to production.
Whether or not this can be automated is largely out of the data scientist's hands.
It depends almost entirely on the technology stack used by the client, and the requirements they set.
I have had a project where I could literally just wrap my project in a script, and it was accepted for production. But I have also had a project where the entire codebase needed to be rewritten because the manager in charge wanted to use another programming language and failed to tell us about this in advance.
Smaller companies will likely just take your code as it is, because they have no guidelines for deployment for production yet.
Larger companies that are new to data science will make you jump through hoops to get your code deployed.
Larger companies that have worked with data scientists enough will have set up a system (cloud, data warehouse, data lake, etc.). In this case, the process is largely automated.
Summary
To summarize:
Talking to the client, and explaining results to the client, are the two aspects of work that you would least associate with data science, but they are also the hardest ones to automate.
Data preparation and cleaning are partially automated, but are hard to automate completely because there are thousands of special cases that need to be accounted for.
Data exploration and feature engineering are partially automated and it looks like feature engineering will become more heavily automated soon.
Model building is already fully automated and competitive with experts.
Deployment to production can be automated, but it's up to the client to do so.
Conclusion
What can we learn from this?
If you are a data scientist and you want to ensure your skills will remain in demand for the decades to come, then you should not focus on the machine learning aspects (unless you are actually a researcher in that area).
Instead, you should focus on learning how to understand business, so that you can communicate more effectively with clients.
Once the technical aspects of our work are automated, the best way to have an impact is to bridge the gap between the technical and the non-technical.