The Plight of the Data Scientist

Eduard H. van Kleef Ph.D.
7 min readFeb 20, 2022

--

Photo by Annie Spratt on Unsplash

It was Monday morning. Louise was looking at her monitor. During the past months she had been working on her Machine Learning (ML) Model. After much hard work and even sacrificing part of the past weekend it now looked as if she was ready to present results. Her boss had asked her to use ML to help the sales team find the most promising customer enquiries. That had been six months ago.

Photo by charlesdeluvio on Unsplash

She had originally contacted the sales team to find out what sales-related data they had that might be useful for such a model. She had not exactly encountered a lot of enthusiasm for her request. They told her this was very sensitive data, the more so since the introduction of GDPR privacy laws. They didn’t even want to discuss what kind of data they had with her. Only after she had fed the situation back to her boss and he had met with the head of sales to remind the latter that they had actually discussed this, did the shop floor agree to give her anonymized data on a USB-stick.

They probably would have exploded if they had found out she was analyzing it from home on her Windows desktop machine that the company had made available to her, but with a private internet connection. „Real“ IT-people were rumoured not to use Windows. They used the Linux operating system only. Louise didn’t care. She had never used Linux. Python and many other tools worked fine under Windows. The fact that she worked from home meant that she had no problems with firewalls etc. installing Python updates on her computer, although getting administration rights for it had been another battle some time ago.

Anyway, the reports that she wrote for her boss and higher management were expected to be created using the MS Office suite and she had learned to use that under Windows.

Photo by Lukas Blazek on Unsplash

Now the model finally was somewhat ready. Actually, in her experience, a model never was ready. But it was good enough to present the results to her „stakeholders“, those who had a business interest in the results she was about to present.

She had carefully „wrangled“ with and cleaned the data. She had considered the border between real data and outliers. She had split the data randomly in train- test- and holdout data. She had tried a number of models and hyperparameters. She had visualized inputs, intermediate results and outputs where useful. She had carefully crafted a PowerPoint storyline that underlined the insights she had gathered from the model. And now she was going to present it all back to those people who had originally been so hostile. To show them how it would actually help them do their jobs better and increase sales.

Photo by LinkedIn Sales Solutions on Unsplash

After the remote meeting/presentation was over, she leaned back in her chair and didn’t really know what to think or how to feel. After an introduction by her boss and the head of sales, it had gone really well. She had shown some preliminary results which had piqued interest. After that she had given a high-level introduction to the data processing process and to the methodology, how such results were derived. Although she had seen some blank stares in the audience, several seemed to have grasped the concept and were actually quite enthusiastic now.

The trouble had started thereafter. The head of sales had asked what steps would be needed to get this „tool“ on all her staff’s desktops. Louise didn’t think that was a fair question. She was a data scientist, not an automation specialist. Her boss had answered for her that they would need to look into it and would get back to sales with the answer. Afterwards he had told her to investigate and make a preliminary proposal.

Photo by Icons8 Team on Unsplash

A tale of three pipelines

Louise had thought about her boss’s request for a while. She had done some internet research and set up a meeting with some of the sales people who were to be the benefactors of her „tool“ to do some „requirements gathering“. As she had already feared, these people were not about to install Python, start it from a command line and enter parameters through the keyboard. They wanted a “proper application” with a Graphical User Interface (GUI).

After further investigation and discussions, she came to the conclusion that creating a proper application in a corporate environment would actually involve virtually the same work as creating an app for the outside world:

  • Creating the GUI or the “frontend”
  • Creating the core application including the webserver and the (connection to the) ML model, or the “backend”
  • Connecting the backend with a number of data sources such as a simple user database (so sales people would need to log in before handling sensitive customer data) and a connection to the sales’ system.

Thinking about the latter, she realised that it actually would be nice to have such a connection to the sales system herself, so she wouldn’t have to transport her training data with a USB stick the next time she needed to re-train the model.

Photo by Marvin Meyer on Unsplash

Creating the application was well beyond her abilities, so she requested and obtained a budget to have it developed by the in-company developer team. Working closely with these people, she noticed that they had a very professional way of working which made them very efficient as a team:

  • Work was done according to “agile” principles, which meant that Louise (and some salespeople which she wisely co-invited) could, on a regular basis, view the progress of the application and give feedback as to what they liked and didn’t like. Also, the individual tasks were distributed in a very transparent way, which meant that everyone could see what everyone else was working on and developers could select the tasks to which they felt most suited.
  • They used a common “code versioning” system which meant that they could work on different parts of the code at the same time without sabotaging each others’ work. Also, should a change in code prove faulty or unwanted, they could always go back to an older (working) version of their code and start again from there.
  • Once they filed (‘committed’) some new code, some magic happened: the code was immediately and automatically tested, built, integrated into the larger project, tested again and finally deployed to a test-server from which the progress of the application could be inspected. A principle which was apparently known as “continuous integration / continuous deployment/delivery” (CI/CD).

The whole thing was known as the “DevOps pipeline”. So far, Louise had always thought of the preparatory data processing for her models as a “pipeline”. Apparently, there was more than one sort of pipeline in computing. Also, she realised that there were many aspects of the DevOps pipeline which she could gainfully employ in the development of ML models. Versioning code and even data could be useful if she ever needed to eliminate a recently introduced bus or demonstrate the working of a model at a certain point in time to auditors. Automatic tests certainly sounded nice…

Photo by Mike Benna on Unsplash

Unbeknown to her, she was about to get acquainted with a further sort of pipeline. Meeting up with her old friend Sylvia, a colleague that she hadn’t seen for ages, Sylvia told her that she now worked in preparing data analytical reports for management. Apparently Sylvia’s work was in many ways similar to her own. Data Analysts took data from company operations’ databases (again such as the sales database, but also finance, supply chain etc.) and condensed that data to reports that gave management an overview of the business situation in near-real-time. To ensure the latter, data was overnight taken from operational systems and, after some processing, stored in a Data Warehouse (DWH). A process known as “Extraction-Transformation-Load”. From there it could be used to display through further online systems, known as “Business Intelligence” systems. Louise realised that with a direct connection between her model and a relevant DWH she could retrain her model at any time and, moreover, immediately apply the model to make predictions….

Conclusion

Louise’s situation is not uncommon for many data scientists. Creating a superb ML model is no longer enough. These models need to generate a return on the investment made in them, and therefore need to be deployed in a way so that others may use them. Many Data Scientists have in recent years already learnt a range of skills described above. All depend on the cooperation with others in adjacent fields to be effective. Many of these, data engineers, analysts, developers, etc. themselves only see parts of the whole and should be interested in obtaining a better overview of the whole and the place of their work therein. Therefore it will be profitable for all of these to think outside of their respective boxes and learn about the work going on in neighboring departments.

If you liked this article, please don’t forget to leave some claps, a comment and/or follow. If you would like me to continue Louise’s story, please leave a comment saying so. Finally, I have published some more business-oriented text on the subject:

Before investing in an Artificial Intelligence project, think of these Business Considerations.

--

--