
Do you recall how many times you’ve read articles titled “This is what a Data Scientist does” or “Differences between a Data Scientist and a Data Analyst”? Such articles usually come with various colourful (and sometimes funnily shaped) Venn diagrams, arbitrarily presenting the overlap of the various data professions and highlighting the distribution of different activities (e.g. ML modelling, data storing, data visualisations) across the different data roles. That’s usually fine for the average reader to acquire a high level, rough overview of the various data roles, but do we, as data professionals, know in detail what other data roles comprise of? Wouldn’t it be great if we could extract all this (and more) information directly from actual data and get more detailed and less biased results?
The present analysis makes use of the data collected from the 2019 Kaggle ML & DS Survey and attempts to build a profile around 6 key data roles, shedding some light into their activities and preferences and unravelling some urban myths.
- Does a Data Engineer spend time doing Computer Vision?
- Is Machine Learning something that a Data Analyst does?
- Which data professions use linear regression more than any other ML algorithm? (spoiler alert: ALL of them)
- Is anyone still reading academic publications or do we all just learn from blogs?
Let’s find out!
Methodology
The analysis focuses on 6 data roles: Data Scientist, Data Analyst, Research Scientist, Business Analyst, Data Engineer and Statistician. A key component of each role’s profile is the data-related activities professionals practice as part of their job. Based on the data provided, 7 key areas are taken into consideration:
- Data Analysis – Analysing and understanding data to influence product or business decisions
- Data Visualisation – Using data visualisation libraries and tools on a regular basis
- Data Infrastructure – Building and/or running the data infrastructure that the company/organisation is using
- Applied Machine Learning – Building or iterating over ML models to improve existing products/workflows or applying ML on new problems
- Machine Learning Research – Doing research that advances the state of the art of ML
- Computer Vision – Using computer vision methods on a regular basis
- Natural Language Processing – Using NLP methods on a regular basis
Information about most data activities can be extracted directly from the answers provided to the question “Select any activities that make up an important part of your role at work”. However, some of them are being inferred indirectly from answers provided to relevant questions. E.g. if someone’s answer to the question “Which categories of computer vision methods do you use on a regular basis?” is anything but None, then one can infer that this individual is practicing Computer Vision, at least to some extent.
- It is important to highlight that the subsequent radar charts are not an indication of how skilled people from each profession are. They show the proportion of people from that role practicing each data activity.
Profiles are additionally complemented by information about:
- Salaries – How much does each profession earn per year (focusing on US salaries).
- Additional data from the 2018 Kaggle ML & DS Survey and the 2017 Kaggle ML & DS Survey are used here.
- Education / Learning – Academic degrees, online learning platforms and media sources.
- Tools and Algorithm Preferences – Algorithms, programming languages and other tools.
Profiles
Data Scientist
Data Scientist
-
Data Scientists are the highest paid group. Given that they are very active across all 7 categories, this should be no surprise.
-
More than 70% of Data Scientists do applied Machine Learning, but just over 20% claim to do ground breaking ML research. Another piece of evidence that you don’t need a PhD to join the club.
-
Almost 40% build or run Data Infrastructure in their organisations. Is this the best way to use a Data Scientist’s time? Maybe management needs to understand the importance of hiring Data Engineers.
Data Analyst
Data Analyst
-
Yes, quite a few Data Analysts actually do Machine Learning.
-
Local development environments and programming languages seem to have been established as the main analysis tools, replacing spreadsheets.
Research Scientist
Research Scientist
-
Obviously, the most academically inclined group: 58% has a PhD and it’s the only group to have Journal Publications within their top 3 media sources for Data Science topics.
-
Also, the only group where MATLAB makes it in the top 3 in terms of programming languages.
Business Analyst
Business Analyst
-
Just when I thought we were done with using spreadsheets as the main data analysis tool…
-
Business Analysts, as the title implies, are usually closely involved in other, more business-related activities which are not being captured here, hence the relatively small polygon area.
Data Engineer
Data Engineer
-
Who knew that 20% of Data Engineers are involved in NLP and computer vision activities. And almost 50% of them work on Machine Learning applications too!
-
This is the second highest paid group, after the Data Scientists.
Statistician
Statistician
-
Statisticians love R. This is the only group where R is the dominant programming language.
-
It is also the only group that uses statistical software like SPSS and SAS as one of their main data analysis tools.
-
Statisticians are the second most PhD crowded group. Interestingly though, academic journals are not amongst their top 3 media sources for Data Science.
Profile Similarity
Observing the radar charts of data activities for the different roles, we can already form a mental image of how similar these roles are in reality. However, to acquire a more holistic view, let’s plot everything on a 2-D “map”. We currently have 7 dimensions (the 7 data activities), so we will use PCA to reduce them down to 2. PCA’s top 2 components explain 91.7% of the variance observed in the data, so we have managed to retain quite a lot of information:
-
Data Scientists and Business Analysts are the two most dissimilar professions.
-
The Data Scientist profession is actually much more similar to the Data Engineer than it is to the Data Analyst.
-
Data Analysts, Business Analysts and Statisticians form a little cluster, meaning that their jobs, in terms of data-related activities, have many similarities.
-
Data Scientists and Research Scientists seem to have the most “unique” activity patterns, meaning that their individual professions are quite different to all the rest.
Salary Comparison
It was previously mentioned that, on average, a Data Scientist is the highest paid data profession. A more comprehensive comparison across all role is presented below.
Of course, not knowing the exact internal distributions of seniority level and experience within each role, we have to take the below with a pinch of salt, but it still makes for an interesting high-level comparison.
As salary responses are given as a range, each individual’s salary is assumed to be the average of that range.
Cases where the response to the annual salary question was “> $500,000” have been omitted, as they could be anything from 500K to a couple of millions.
- Statistician – Average salary: 111K $ – Median salary: 95K $
- Research Scientist – Average salary: 117K $ – Median salary: 112K $
- Data Scientist – Average salary: 137K $ – Median salary: 137K $
- Data Engineer – Average salary: 125K $ – Median salary: 112K $
- Data Analyst – Average salary: 86K $ – Median salary: 85K $
- Business Analyst – Average salary: 86K $ – Median salary: 85K $
- Data Scientists are the highest paid professionals, followed by Data Engineers and Research Scientists.
Thoughts / Discussion
Data science is a team sport. We work together, we learn from each other’s blogs and Kaggle kernels and we attend each other’s talks in meetups and conferences. I strongly believe that by raising awareness of what other data professionals do, we can collaborate better, as well as build more efficient and comprehensive data teams.
I hope you enjoyed reading through this analysis as much as I enjoyed building it. I also hope that you learnt something you didn’t know or reconsidered about something you though you knew. I know I certainly did.
Some thoughts:
-
Linear/Logistic regression and tree-based methods are the most popular choices of algorithms across all professions. Is this due to their ease of use and interpretability or have we done something wrong when it comes to teaching and making accessible more advanced algorithms?
-
It seems that many Data Scientists spend time building data infrastructure and many Data Engineers spend time doing data analysis. Is this the best use of their time? Couldn’t this lead to employees being unhappy? How do we raise awareness of this issue?
-
Even though many professionals across different roles are practicing machine learning, the fields of computer vision and NLP seem to be uncharted territory for many, even though they are becoming increasingly more important in this age of big data. How do we close this gap?
-
Data visualisation was the dominant data activity across all roles and for a good reason. It enables us to understand our data, tell stories, discover insights and engage audiences. So, next time, before you fit that fancy neural network, make sure you visualise your dataset first 🙂
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });
This article provides a great overview of the different data professions that exist in today’s job market. It’s impressive to see the variety of roles and skills required in this field, ranging from data analysts to machine learning engineers. As the world becomes more data-driven, these professions are becoming increasingly important, and it’s exciting to see how they can help businesses make more informed decisions. Thanks for sharing this insightful piece!
Thanks so much for the comment Rohit, we’re glad you found it helpful.
thanks for info