• Skip to primary navigation
  • Skip to main content
The Data Lab

The Data Lab

Pruple button with the word menu
  • Business Support
        • Business Support

          We’ll help you harness the power of data so you can innovate and grow your business.

          Visit our Business Support page

        • Accessing Talent
          • Data Talent
          • Placements
        • Funding
        • Small Business Support
        • Digital Strategy
        • Academic Project Funding
        • The Data Lab Community
  • Professional Development
        • Professional Development

          We’ll help you harness the power of data so you can innovate at work and also advance your career.

          Visit our Professional Development page

        • Workshops
        • Online Courses
        • Data Skills for Work Programme
        • The Data Lab Community
  • Students
        • Students

          We’ll help you learn about the power of data and gain real-world experience and career-focused qualifications.

          Visit our Students page

        • The Data Lab Academy
        • PhD
        • TDL Academy Placements
        • Scholarships
        • The Data Lab Community
  • Partner With Us
        • Partner With Us

          We work in partnership with companies to help them gain maximum benefit from the strategic use of data.

          Visit our Partner With Us page

        • Collaborate With Specialists
        • Partnerships
  • About Us
        • About Us

          We discover opportunities, connect people and ideas, develop knowledge and expertise and bring game-changing data projects to fruition.

          About Us

        • Our Team
        • Careers With Us
        • Academic Opportunities
        • The Data Lab Community
        • Case Studies
        • News & Podcasts
        • DataFest
        • Scottish AI Alliance
        • Contact us

Which Data Science Platform is Best? The Challenges of Explainable ML and AI

Tech blog 31/08/2018

Recently, I have finished a project working on testing Machine Learning algorithm performances in different data science platforms with an explicit focus on explainability. In this post, I shall describe some of the criteria and the platforms that were used in this project.

In the era of the internet where vast amounts of data are being generated by different sectors ranging from pharmaceuticals to molecular biology, the need for automated tools that allow for analysing of these data as well as insights that are critical for business is higher today than it has ever been.

Data science platforms as a software hub that allows for integrating data, building and deploying models have seen an exponential rise both in supply and demand, but which of these platforms is ‘best’?

Without getting into a philosophical debate, the term best would probably have a very different meaning to different individuals and organisations. However, platforms that offer interpretable Machine Learning (ML) and Artifical intelligence (AI) outputs are prefered than those of ‘black box’.

Platforms that met our criteria

The number of available tools is ever increasing as the friction in the market for the data science services disappears. How would one choose the right platform for data science tasks?
In the development of this project, we first set out the foundations for what would make a data science platform. For a platform to have made it to our list, it should have allowed for data pre-processing, feature selection, classifier choice, parameter tuning and support for open source use. Of the 41 platforms Identified R and Python were set as the benchmark and the other five that satisfied the selection criterion were chosen to be studied further and to perform supervised and unsupervised ML models.

R is an open source statistical platform that allows users to build very advanced ML models. The functionality of R is very widely known by the industry users as well as academics. R features different libraries which are a fully open source and each function within those libraries are fully transparent and explained mathematically. R is arguably the best visualisation platform available. R runs on Windows, Mac OS X and Linux and it is compatible with different data frames such as Microsoft Excel, Microsoft Access, MySQL, SQLite, Oracle. One of the remarkable features of the R language is its adaptability. Due to R’s popularity and its expressive power and transparency, R developers keep on building creative and inexhaustible interfaces to software that complements Rs strengths.R’s memory management has been a drawback. However, recently there has been the advancement in techniques which allows developers to understand R’s memory management and ultimately make functions and loops run faster.

Python is a widely used DS platform and programming language. Python is also widely used for web and game developing. It is an object-oriented language. The Python programming language is used in many different software packages and sectors ranging from academia to pharmaceutical. Python is capable of powering the Googles search engine, YouTube, DropBox, Reddit, Quora, Disqus and FriendFeed. NASA, IBM and search browsers such as Mozilla rely mostly on Python as a programming language. Due to its ability to allow for the integration of systems quickly and effectively and being open source is very attractive, python is exceptionally appealing to startups and smaller companies.

H2O is an open source, in-memory, distributed ML platform. H20 runs on Java such that inside H2O a key-values distributed storage is used that enables the data, models and other objects to be used across different machines. H2O uses map reduce distributed framework and allows for the java join framework. The data is transformed in an h2O data frame which is distributed across all clusters and stored in memory. H2O’s intelligent data parser can guess the schema of the incoming datasets and supports data ingest from multiple sources in various formats. H2O’s API enables access script via JSON over HTTP. The API is used by H2Os web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python).

BigML allows developers and enterprises to create ML algorithms. BigML offers an abstract, simple interface to a wide range of ML algorithms that can be used in isolation at a very high level and also combined, by means of DSLs, into new, more complex, algorithmic workflows; so one can cover the gamut from users that barely know the particulars of an algorithm they are invoking to savvy data scientists that can combine many of them in complex ways. BigML enables the users to perform their task more effectively by tapping the functionality of the platform without having to use proprietary API’s (an API whose methods and outcomes are public and usable by anyone without any kind of reverse engineering). BigML connects to R via the ’Bigml’ package which contains the Bigml API. However, this package is old and has not been updated. The package includes methods that provide straightforward access to basic API functionality, as well as methods that accommodate local R data types and concepts. BigML also offers many other BigML language bindings that are all open source such as python, java, ruby and clojure.

RapidMiner offers data mining and ML procedures including data loading and transformation, data preprocessing and visualisation, modelling, evaluation, and deployment. RapidMiner is written in the Java programming language. It also integrates learning schemes and attributes evaluators of the Weka machine learning environment and statistical modelling schemes of the R-Project. This platform benefits from an extensive built-in library which also integrates with existing databases and most common open source DS programming languages such as R and Python. The Auto ML function of this platform is an automated lifecycle to build ML algorithms.

Dataikus Data Science Studio (DSS) platform allows connection to any data store, eliminating integration stages. DSS detects wrong entries while automatically cleansing, transforming, and enriching data. Visualisation features make it easy to find correlations, variables, and patterns to predict future outcomes and trends with certainty. DSS also has features that support collaborative data science which makes the job of different teams such as data engineers, business analysts, business stakeholders, hardcore coders, R users and Python users more collaborative. This, in turn, provides an efficient way of making the needs of these different roles to work together on DS projects. This platform runs python in memory.

Azure ML studio allows users to develop models in the cloud. Azure is also integrated with R and Python environments. This feature makes it possible for data scientists to write and run R and Python programs on the cloud as well.

So, which platform is best?

One of the remarkable features of the chosen platforms is that they have massive support for collaborative data science at scale as well as allowing for integration with the benchmark platforms.
This project lays out the path for an era to attract more work towards platform comparison via algorithm performance as compared to just algorithm testing. The need for the automated
ML and AI will see an even more increasing rise and having research in this area will enable industry and academia to use the trade-offs between different platform to decide what might be most suited for their purpose. All in all, there is no one fit all platform that solves every problem. The measures in this project show that the chosen platforms just as the benchmark platforms provide similar functionality and results.

For more information on the criteria, platforms not included in here, ML algorithms and the results of this project please visit my thesis and reference appropriately if used.

If the works of this project are of interest to you please get in touch.

Tags: AI, ML

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Innovate • Support • Grow • Respect

Get in touch

t: +44 (0) 131 651 4905

info@thedatalab.com

Follow us on social

  • Twitter
  • YouTube
  • Instagram
  • LinkedIn
  • TikTok

The Data Lab is part of the University of Edinburgh, a charitable body registered in Scotland with registration number SC005336.

  • Website Accessibility
  • Privacy Policy
  • Terms & Conditions

© 2023 The Data Lab

We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
Cookie SettingsReject AllAccept All
Manage consent

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-advertisement1 yearSet by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent1 yearRecords the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
CookieDurationDescription
_ga2 yearsThe _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_DPXX4XJSJ82 yearsThis cookie is installed by Google Analytics.
_gat_gtag_UA_54851888_11 minuteSet by Google to distinguish users.
_gat_UA-54851888-11 minuteA variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au3 monthsProvided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid1 dayInstalled by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT2 yearsYouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
CookieDurationDescription
personalization_id2 yearsTwitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE5 months 27 daysA cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSCsessionYSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devicesneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-idneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
Others
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
CookieDurationDescription
cl-bypass-cache1 hourNo description
muc_ads2 yearsNo description
SAVE & ACCEPT
Powered by CookieYes Logo