The Data Lab recently sent representatives to the Strata Data Conference in London. This is the UK leg of one of the world’s largest data conference series, presented by O’Reilly and Cloudera. The event was held at the gargantuan ExCel London facility near City Airport, and attendees numbered into the thousands.
The general structure for the conference was for training courses and tutorials on the first two days followed by keynotes and sessions on the last two. There were various other events and activities organised too, such as exhibitor stands, book signings, and a Data After Dark social gathering. As always there were lots of interesting people in attendance, and plenty of time for networking and catching up with old acquaintances.
We asked our data scientist, Richard Carter, to name his Top Three highlights of the week, and here is his response:
This was my first experience of a Strata Conference and the scale was mind-blowing. There were so many fascinating speakers covering the whole data-verse, from strategy to regulations, machine learning to systems architecture. There was so much to learn, and particularly for me coming from a technical background it was a great opportunity to hear more about some of the new advances in data science that are finding their way into industry.
Here, in no particular order, are my top three highlights of the week.
- Building Your First Big Data Application on AWS
This tutorial session on the Tuesday morning was my introduction to Strata, and what a way to start. In the company of three exceptionally talented Amazon employees (let’s be honest, have you ever met a bad one?) we got stuck straight in to Amazon Web Services. I have been using AWS quite a lot recently but my experiences there have been more around compute (EC2), databases (Aurora and DynamoDB), and storage (S3). On this course we took it up a notch to look at streaming data, distributed computing, and real-time analytics.
We started with Kinesis Producer UI to generate some dummy weblogs as the basis of our big data application. With Kinesis Firehose we then collected the logs to send along two paths. The first of these sent raw logs to S3 from which we used Athena and EMR to query and analyse them. The second path processed and aggregated web log metrics with Kinesis Analytics, then delivered the results via Kinesis Firehose to a Redshift database. At the very end of the course we then used QuickSight to visualise the logs and discover insights.
If that all sounds like a lot then it was, but what continues to amaze me about AWS is how such a powerful set of computing equipment is only a button-press away, and how user-friendly the whole experience is. The AWS website is stuffed full of detailed information for getting the most from the myriad services that are on offer, but nothing compares to the opportunity to ask questions of the people who design and use these systems on a daily basis.
- What Kaggle has Learned from Almost a Million Data Scientists
Anthony Goldbloom co-founded Kaggle in 2010 and is its CEO. Kaggle has led the way in promoting data science globally through various online data competitions. Companies and public sector organisations upload data sets around a specific challenge they are facing, and thousands of competitors then use their brainpower – either individually or collectively – to achieve the most accurate models possible. At the end of a pre-defined period the winning entrants then collect a prize, most usually in the form of money.
With over 4 million models now submitted Kaggle have been able to extract patterns for successful entries to their data science challenges. The first insight Goldbloom shared was that winners tend to follow a three-step approach: explore the data first with descriptive statistics, create and select relevant features, then tune parameters and ensemble. Interestingly he said that the second of these (the creation and selection of features) appeared to be the most important differentiator in terms of entries’ scores.
The other major take-home point was in the actual techniques that have been successful. Goldbloom split the competitions broadly into structured and unstructured data problems. For the former he said that boosting algorithms now lead the way, taking over from random forests which proved successful in the early years. For unstructured data the ubiquitous deep learning approaches show most promise, with recurrent neural networks for time series problems and convolutional neural networks for image-based challenges.
In terms of what makes a successful data scientist Goldbloom again had a three-point answer: 1. be creative, 2. avoid overfitting, and 3. use version control. All good advice and showing that it’s not necessarily as important to keep up with the latest techniques or in-vogue languages as one may think.
- A) Real-Time Intelligence Gives Uber the Edge and B) Applying Machine and Deep Learning to Unleash Value in the Automotive Industry
I am cheating here to include two talks into my final selection around a broadly common theme. The first was a keynote given by M. C. Srivas, Chief Data Architect at Uber, whilst the second was a session given by two young data scientists from BMW.
As a child growing up through the 1980s a regular TV highlight was Knight Rider. The star of the show for me was not the beautifully coiffured David Hasselhoff, but rather the Knight Industries Two Thousand Pontiac Trans-Am, or KITT for short. What I find incredible is how a talking, self-driving, intelligent car that was science-fiction just thirty years ago is now a reality on the streets of today.
Srivas talked specifically about Uber’s use of real-time data for analytics. He talked about the challenges not only of matching current supply and demand for Uber cars around the globe, but also those of predicting demand to ensure that drivers were well-positioned for where fares were likely to occur.
With an example taken from academics trying to mimic the motion of a sidewinder snake climbing laterally up a sand dune, he spoke about how incorporating feedback into machine learning algorithms helps to keep such models on a continuously self-improving trajectory.
The session by Josef Viehhauser and Dominik Schniertshauer from BMW Group encompassed the wide and varied uses of data science within the automotive giant. What surprised and impressed me was that this is not limited to just the data that modern cars generate, but also included things like the analysis of component parts through imaging-technology and methods to predict the depreciation of cars throughout their working life.
Both of these speakers were genuinely enthusiastic about their work at BMW, and it struck me how lucky we are to work in a field where important breakthroughs are achieved on a regular basis and how easy it is to include these in our own workflows to empower and derive new insights.