by Roman Popat, Data Scientist at The Data Lab
The Data Lab escaped to San Jose a few weeks ago to attend Strata+Hadoop World. What a week it was! Strata is a huge conference and so it can be quite daunting having to decide between such a wide variety of sessions. Despite this, I found a good mix of hard-core statistics and machine learning, data ethics and more high level data strategy. It was a really good overview of the state of the art in data science.
Among my favourite findings was that while computer vision with deep neural networks has made leaps and bounds, it’s still constrained by having a large, human-annotated training set of images. This means a lot of undergrad volunteers or a lot of mechanical turk to get you there. Alexei Alyosha (UC Berkeley) gave a very entertaining account of the power and limitations of computer vision. I highly recommend following the link and watching his talk. Did you know that most of the world’s visual data will never be seen by a human?
Sticking with neural networks for a minute, Stephen Merity, MetaMind gave a fantastic and insightful view on training networks. Imagine you are shown a photograph. Said photograph is then removed and you are asked a series of questions about the photograph. Wouldn’t it be useful to be shown the photograph a second time or even better to have the photograph and the query available simultaneously? The same intuition can be applied to neural networks. Merity and colleagues are developing Dynamic Memory Networks (DMN) that can incorporate memory and attention into neural networks to achieve high performance on question answering tasks in visual and text data.
A remaining challenge in data science is the deluge of ‘dark data’. A huge reservoir of untapped data is inaccessible with current software because the data lacks structure. Mike Cafarella (Assistant Prof. University of Michigan), cofounder of Apache Hadoop, is on a quest to change this with a new piece of software called ‘DeepDive’. The software uses machine learning to extract information from an unstructured source into a user defined schema. DeepDive has already been successful in identifying potential patterns in human trafficking in the sex industry by scanning adverts for escort services. There is a huge prize for perfecting access to dark data and I look forward to seeing future applications of this technology.
The power of statistics and big data can be employed to estimate the extent of hidden victims in conflict zones. Megan Price (Human Rights Data Analysis Group, HRDAG) and her team have made some stunning progress in this area. They examined reports of deaths in Syria from four different sources. By cross referencing, the team could identify not only how many deaths had been reported from each source but how many times a single death had been reported across several sources. This lends itself to a technique called Multiple Systems Estimation (MSE) or Capture-Recapture used to estimate the total size of an unseen population. By thinking about the underlying processes that generate the data and the sources of bias that skew it, Price and colleagues can provide governments and advocacy organisations a much more accurate picture of the reality and drive more effective policy decisions.
Last but not least, I am an R user, so this one is particularly exciting for me. In April 2015 Microsoft acquired a company called Revolution Analytics. This company rebuilt R with multi-threaded capabilities via BLAS/LAPACK libraries. This means faster multi-core performance within your R applications. This is now a Microsoft product and MS have integrated it into the MS Azure Stack and plan to grow support and integration in the future. David Smith (formerly Revolution Analytics and now Microsoft) gave a great overview of this journey at the Bay Area R User Group. So will R become a tool used routinely in production? Watch this space…
I came away from this trip brimming with excitement about the widespread use of analytics and the endless opportunities for collaboration. The Data Lab assembled a Scottish delegation of companies and it was very rewarding to spend time with them, understand what their ambitions were and collectively digest the experience.
Here is what some of them had to say about the trip:
Introducing new, or disruptive, technologies can be quite a challenge for any large business. Therefore, how others are building teams and environments to successfully drive innovation is certainly relevant to my team at Aggreko.
Steve Faull, Development Manager, Aggreko
…for what you’re trying to achieve I think it’s critical that the future leaders of this wave in Scotland get a chance to see the broader stage on which it’ being developed.
Paul Fleming, CIO Stirling council
It was a hugely valuable experience for me and SNAP40. Keep up the good work. The Data Lab are well and truly helping Scotland develop something cutting-edge in data-driven industry.
Stewart Whiting, Data Scientist, SNAP40
Thanks all for a great time.