
We spoke to our data science team Saleh, Vyron and Tom, about how data science and AI have changed over the last ten years. In short – lots!
The volume of data, technological advancements, the required skill set, data privacy and governance, the overall increase in data literacy and the ethics of AI are key themes that have changed over the last decade.
From small data sets to billions of records
Now, the vastness of data is unfathomable, and data modelling occurs continuously, in near real-time. With the explosion of cloud computing, data models can now be continuously updated with real-time data—something that would’ve been too expensive ten years ago in the early stages of cloud computing services.
This, in turn, has made data science more valuable. It can now continuously generate insights in rapidly changing environments for a variety of stakeholders, compared to the batch analysis norm of a decade ago, which involved one-off projects with limited analysis.
“We have moved from one-time analytics to implementing proper data science projects, from working with a small set of data to big data with billions of records.”
Saleh Seyedzadeh, Principal Data Scientist at The Data Lab
Better infrastructure, data collection and processing power
Beyond the big data, lies infrastructure like cloud computing which has become affordable and made large-scale data processing possible. With cloud computing platforms taking over back-end IT management, data scientists are able to focus on core modelling tasks.
The Internet has also become faster, and it’s become cheaper and easier to do Internet of Things (IoT) remotely, allowing for seamless data collection from a variety of sources, like sensors and cameras. Data collection technologies are just growing and producing billions of records!
The advent of GPUs (graphics processing units), pioneered by the company NVIDIA, has also driven rapid changes in the data science landscape.
Originally designed to speed up graphics in computers, GPUs have become more flexible and programmable over time and allow for many pieces of data to be processed at the same time. It accelerated computing and has greatly driven down costs and improved the computational power needed for large-scale AI and machine learning (ML) models.
Ultimately, GPUs led to the huge adoption of ChatGPT. Without GPUs, mass adoption of Large Language Models (LLMs) like ChatGPT, Gemini and Claude wouldn’t have been possible.
Data science in the C-suite
As a result of this growth in scale, data scientists’ roles and skills have become more specialised, and team members have multiplied. Many organisations now employ data scientists at the C-suite level. According to Foundry’s AI Priorities Study 2023, 11% of midsize to large organisations have already appointed a chief AI officer (CAIO) role, and another 21% are actively seeking to fill this position.
However, specialist data skills are in short supply in the UK! According to a UK Parliament research briefing published last year, there is an increasing demand for people with specialist data skills with around 178,000 unfilled data specialist roles but only around 10,000 new data scientists graduating from universities each year.
Data governance, quality and privacy
Data governance, including data quality and privacy, is now a key consideration, putting data management at the forefront of the data science field.
Stricter data privacy regulations, such as GDPR, require robust data governance and transparency throughout data pipelines.
As data models and analyses are now being continuously updated, this impacts the data quality and accuracy of data science products. So, data scientists need to make sure that the assumptions made when the analysis was first created still hold true for the new data.
Ethics of data and AI: from niche to central
Ethics in AI and data science is another recent priority due to the wider societal impact of AI technologies. With the advent of racist and biased generative AI outputs, safety, equality, diversity, and inclusion have never been more important.
For example, companies like Clearview AI and Amazon’s Rekognition use facial recognition technology to provide law enforcement with databases containing information scraped from the internet. Although facial recognition tech is more accurate than ten years ago, it has higher error rates for women and people of colour, leading to misidentifications such as the publicised Porcha Woodruff, an eight-month pregnant black woman falsely accused of carjacking.
In financial services, some models may discriminate against certain demographics by penalising attributes like postcodes, which indirectly reflect socioeconomic biases. However, due to the opacity of how these decisions are made, addressing the existing bias and discrimination is challenging.
GenAI tools have also been known to produce misleading outputs while presenting them as facts. LLMs like ChatGPT can provide incorrect answers, leading to misinformation, and GenAI tools can create deepfakes and false representations, including images, videos, and text, that can be used to spread propaganda and misinformation.
Ultimately, addressing ethical challenges in AI requires human oversight by socially representative groups, stronger governance frameworks such as ethical AI guidelines, and the promotion of transparency and explainability in AI decision-making.
“Ten years ago, biases weren’t really considered as our data science models didn’t have any huge impacts on people’s lives.”
Tom Lowe, Data Scientist at The Data Lab
Data and AI literacy
Data science and AI now affect us all and are not considered the elusive “magic” they once were to the general public.
Early adopters included the financial and pharmaceutical industries, but now we’re even seeing traditional sectors like manufacturing catch up – though there are still challenges with cultural shifts and concerns around data sharing.
The growth of open source – freely available source code that encourages open collaboration – has democratised data and AI. Open-source libraries such as Meta’s PyTorch and TensorFlow and LLMs like Meta’s Llama have increased accessibility and made it easier for people to engage with data and AI, allowing almost anyone to code and make machine learning models just by having a laptop. As a result, more professionals outside of traditional data science roles now use accessible tools for basic analytics.
“Open source has really lowered the barriers to data science. It’s becoming more important day by day, especially with LLMs where we get the added advantage of transparency – you can see the code, how they work and how they have been interpreted”
Vyron Christodoulou, Data Scientist at The Data Lab
Data and AI literacy is becoming a key skill in society. The risk is now protecting those who aren’t data literate, so they’re not influenced maliciously, for example, by targeted ads, insurance, or loans.
The bottom line
Ultimately, the scale of data science has increased, the skill sets are changing, and there have been technological advancements and also more regulation. But let’s make sure we protect vulnerable groups and prioritise ethics to ensure we create a society where everyone thrives from the adoption of data and AI.
10-Year Data Science Timeline
2014
- Big data and cloud storage become more prevalent and cheaper thereby enabling the storage and processing of large amounts of data in the cloud
- The release of GANs (Generative Adversarial Networks) enabled the creation of highly realistic synthetic data, opening new possibilities in image generation and multiple domains.
2015
- TensorFlow, a library that is used to write machine learning models is released, democratizing deep learning
- Smart devices (IoT) expansion leads to exponential growth in data production
2017
- Kubernetes, a system that automates deployment of applications into computer clusters is released. Kubernetes helped manage containerized applications at scale, which was crucial for running data-heavy applications with complex processing needs efficiently.
- The use of specific hardware to train ML/AI models becomes more prevalent and companies like Google and NVIDIA release powerful chips (the second generation of TPUs) that allowed computers to process data much faster.
2018
- Natural language processing (NLP) breakthroughs with models like BERT by Google enabled the processing of more complex, unstructured data from various sources, enhancing variety.
- GDPR and CCPA becomes to take effect introducing data privacy regulations and making everyone more aware of the ubiquity of data emphasizing transparency and ethical sourcing of data.
2020
- ML gains more popularity by helping to analyze medical data during COVID-19 pandemic
2021
- Large language model advancements like GPT-3 released highlighting the ability to work with diverse and large-scale unstructured data, advancing both variety and value.
2022
- AI tools become more common and are used to write code, create ads, manage customer service (through chatbots), help ideate and design products.
2023-2024
- As AI models became more widespread, frameworks focusing on bias reduction, fairness, and transparency gain importance. Governments across the world publish responsible AI frameworks. For example, AI Risk Management Framework in the US and a pro-innovation approach to AI regulation: government response in UK)