Who is a Data Scientist? Are they statisticians who have a solid understanding of data and modeling, and love commenting on the shape of bell curves? Are they scientists, who can design experiments and test hypothesis? Are they programmers who eat code for lunch and can process vast amounts of data in their sleep? Or are they business analysts who understand what the relevant questions to ask are when looking at a data set?
The Data Science job market is still fairly nascent and the question on what makes a Data Scientist has not been answered conclusively. This was when we conceived the AdzunaDataBot! The project was initiated with a two-fold goal of building an understanding of the Data Science job market in the UK and monitoring it on a continuous basis, as well as building expertise in product development on one of the major cloud solutions provider by doing a pilot project. AdzunaDataBot gathers jobs data from Adzuna, a UK based job boards aggregation website, stores and processes them on a cloud platform, and presents them visually in an easily interpretable format for interested users. Adzuna’s data store can be accessed through the Adzuna’s web API, which can be queried by keywords, and provides a rich variety of information regarding each job ad posted on the different job boards adzuna queries.
While it may be true that a ‘complete’ data scientist would (should?) have all the skills mentioned above, few among us can claim to have reached that pinnacle. However, all aspiring Data Scientists begin at one of those careers in one area of speciality and build their skills in the other areas as they progress in their career. But how do we prioritize these skills by the order in which they are valued in the job market today? We, at The Data Lab, looked to take a Data Science approach to this Data Science problem by analysing at actual jobs data to see what the market says, and voilà , AdzunaDataBot!
Not only will this data from AdzunaDataBot be useful to individuals who want to make smarter career choices, it will be very useful for program coordinators at universities, skills academies and bootcamps, to correctly identify the different kinds of data science positions, and tailor each of their programs to better provide the required skills to their students. And this goes to the heart of the core mission of The Data Lab, which is to drive collaboration betweeen Scottish industry, public sector and academia, to exploit the value of Data Science together. Training people up with the right skill sets is the first step in ensuring Scottish industry is in the best position to be able to exploit the techniques of Data Science effectively.
A public preview of the AdzunaDataBot is available here.
An API a day, and with a cloud solution to play, makes an easy data product today
APIs, or Application Programming Interfaces to give them their full credentials, are becoming increasingly common on the web, with all kinds of services wanting to build an ecosystem around their product by enticing developers with the ‘cool-factor’ of their API. Web APIs offer a well-defined way to programmatically access the underlying data which power many of the services we use on the web today. The drive towards a more open data culture has further pushed the drive towards open APIs. While Facebook and Twitter might be the first ones to come to mind, they are by no means the only ones to offer access to their data mines. All kinds of services offer API access to their data including flight pricing engines, job boards, hotel booking services and auction websites, among many others. Developers from yesteryears might nostalgically look back on their days scraping data from HTML pages, but the Data Scientists are not complaining! The ability to access clean datasets from the APIs now allows us to spend more time building the data product, which after all is the more interesting/ potentially lucrative bit.
Cloudy days ahead
So with this in mind, we started this project with the aim of collecting data from the Adzuna web API, storing it, and building a data product around it. And we decided to build this solution on the cloud to guarantee consistency and interoperability between platforms. A few different cloud solutions were evaluated including PaaS solutions like IBM BlueMix and Heroku and IaaS solution, AWS EC2. While each platform has its advantages, we chose AWS as our cloud solution as we wanted to start with a simple solution without too many fancy services attached to it. AWS EC2 cloud is simple to configure and get started, and it has great documentation to get unfamiliar developers up to speed quickly. The AWS free tier, which is available to anyone for a 12-month introductory period, was sufficient for this task.
Implementation
The AdzunaDataBot was implemented completely in R. The infrastructure components from AWS included a free tier EC2 linux box and a MySQL database to store the data. To configure the EC2 environment for running R, we followed this very easy to blog post by Amazon. The API call returns a JSON object which can be easily read into an R dataframe. Since making a call to the API and returning a dataframe object is a core functionality which can be leveraged across many different applications, it was implemented into an R package, called adzunar, which has been released separately to Github. This allowed us to experiment with the search terms and the results and abstract away the details of actually making the API calls. We setup a job to query the Adzuna API on a daily basis, with the keywords data science, and store the results in the MySQL DB. The free tier of the MySQL DB on Amazon RDS comes with 750hrs of usage and 20GB of storage. This is more more than sufficient for the prototype which we built. This data was then queried on a daily basis to render an HTML page using flexdashboard, a super cool publishing tool available for R. Flexdashboard gives the ability for non-web developers (like us) to simply render R plots (like ggplot2) onto a beautiful HTML page with just a few lines of code!
The complete code for this implementation is available on Github for any interested Data scientists out there. This project is still a work in progress, so contributions are most welcome!
Nuggets from AdzunaDataBot Programmers are in demand
The development version of the app has already yielded some pretty useful information regarding the Data Science job market in the UK.
We know that:
- The top five skills mentioned are Python, statistics, java, hadoop and spark. Programmers/ Data Engineers are clearly in demand
- London forms the most significant hub for Data Science jobs in the UK
- The median salary on offer is £49k per annum
- There is a large variance in the salaries on offer, starting from £20k all the way to £200k
- The most popular buzz word in use among all the job adverts is analytics
For future work, we can look to split the analysis by experience level to identify the skills required for entry-level data scientists vs those required for experienced hires.