‘Data is the new science, Big Data holds the answers.’ - Pat Gelsinger, CEO, VMware.
It is should come as no surprise that companies are quickly gearing up towards adopting technologies that can make big data projects profitable. Careers in big data analysis are emerging across industries and this has compelled professionals to upgrade their skills in order to exploit this opportunity.
Before jumping into any career, it is best to research on the kinds of projects that you will need to work on, in the job role. This will allow you to be fully equipped with necessary skills and it will help reduce the time required to be productive at your chosen job role.
Below is a comprehensive and non-exhaustive list of widely used open Data Science projects that are created by contributors from around the world. Depending upon the license of the respective projects, you can use their source code for your personal and commercial work. This article aims to enhance the technical tools in your arsenal to deal with Big Data problems.
1. Apache Mahout
Apache Mahout is an environment which can be used to quickly create machine learning applications, which are scalable and efficient. Mahout provides an environment for building scalable algorithms by incorporating components like H2O algorithms, Scala and Spark and oop MapReduce algorithms. The latest version of Apache Mahout is 0.11.1, which was released on 6th Nov 2015.
Apache Mahout supports a math environment called Samsara, which can be used to create your own mathematical solution along with some standard algorithms. Mahout Samsara is well versed in General Linear Algebra, Statistical Operations and Data Structures. It can be customized in Scala using Mahout extensions or can be used as a library. Mahout Algorithms has lots of new implementations which are built for speedy operations on Mahout-Samsara. Some of the algorithms that can be used here are Naive Bayes Classifier, Matrix Decomposition, collaborative filters and Neural Networks. The new spark-item similarity enables high quality of co-occurrence recommenders that can use entire user click streams and context in order to make recommendations.
Apache Mahout Github
2. Apache Spark
Apache Spark is an open source Data processing engine for real time analysis of Big Data. Spark is currently being deployed by Yahoo, Tencent and Baidu and it is collectively processing petabytes of data on clusters of more than 8000 nodes. In Github, Apache Spark as has had over 750 contributors making it one of the largest open source project in Data Processing.
Apache Spark can execute programs 100 times faster than Hadoop MapReduce in memory and 10 times faster when executed on disk. This disparity in speed, is because of Spark’s Direct Acyclic Graph Engine and its enhanced capability of in-memory computing. Spark offers easy to use API for programming on large datasets. You can quickly write application in Java, Python, Scala or R and Spark offers close to 100 high-level operators which can be used to build parallel applications.
Spark Ecosystem. Image Source: Databricks
Apache Spark offers a stack of powerful libraries for machine learning, Spark Streaming and GraphX. This list includes libraries for DataFrames and SQL. These standard libraries can be used to enhance the application performance and productivity. Spark can be executed on a variety of bases; For example it can run on a standalone cluster, on Hadoop YARN, on EC2 or on Apache Mesos. Apache Spark can be used to access data from sources like Hive, HBase, Tachyon, Cassandra or HDFS.
Apache Spark Github
3. Apache Storm
Apache Storm processes unbounded streams of data and is a reliable computation system for real-time processing of Big Data. It is open source and can be used with all the mainstream high level programming languages, which makes it easy to use. Apache Storm is being used for applications like online machine learning, continuous computation, real-time analytics, ETL, Distributed RPC, etc. Apache Storm offers many benefits like easy setup and usability, fault tolerance and scalability. It is so fast that it can process million tuples per second per node.
The latest version of Apache Storm was released on 5th November 2015, which is Storm 0.9.6.
Any Storm cluster has three types of nodes; Nimbus Node, Zookeeper Node and Supervisor Node. Nimbus node is similar to Hadoop JobTracker and it is used to upload computations, distribute code across the cluster and to monitor computation. Zookeeper node coordinates with the Storm Cluster and Supervisor node control workers according to instruction from Nimbus Node.
Apache Storm GitHub
4. NTLK (Natural Language Processing Toolkit)
NTLK is a package suite used to build Python based applications for interaction with Natural Human Language. Remember how easily Jarvis interacted with Tony Stark in the movies - Jarvis was NTLK on steroids.
NTLK has in-built text processing libraries for operations like tokenization, classification, tagging, stemming, semantic reasoning and parsing. Along with that it offers wrappers for high quality NLP libraries and a large learning community of contributors. It is a powerful tool for computational and empirical linguistics, artificial intelligence, cognitive science, machine learning and information retrieval - using Python Language.
Text prediction is one of the use cases of NTLK. Using a partial input of text, we can generate an intelligent iterative search by predicting a list of highly likely matches. Other basic applications of NTLK are text categorization, postal address parsing and intelligent operation using voice commands.
Natural Language Processing Toolkit GitHub
Machine Learning Open Source Software (mLoss) is a single platform which offers an aggregate collection of open source software. It is a venue for the collection of Open Source Software, which are peer reviewed along with a short description about the project. mLoss itself is not a software but a website to support the open source movement for Machine Learning Applications.
The software listed on the mLoss website are compliant with the 48 different types of Open Source License. There are more than 1100 authors who have contributed in this open source movement. It is by far the largest collection of Machine Learning software, currently supporting 107 data types and related applications across 26 operating systems using 51 high-level programming languages.
Some of the popular software available on mLoss website are:
- dlib ml : C++ toolkit for Machine Learning Algorithms
- R-Cran-Caret : Classification and Regression Training Toolkit
- Shogun : Machine Learning Toolbox for Python, Matlab, Octave and R with focus on SVM
- Armadillo Library: C++ Linear Algebra Library
- MLPY : Machine Learning module for Python built on top of NumPy/SciPy
- MyMediaLite : Multipurpose library for recommender algorithm system
6. Julia Statistics
Julia is a high-level dynamic programming language which is used for technical computing. Even though it has a similar syntax to other technical computing environments, it is still not used in a major scale. Julia programming language offers distributed parallel executing and an elaborate mathematical function library which is optimized to obtain numerical accuracy.
JuliaStats is an open source collection of machine learning and statistical tools, aimed to create scalable and efficient implementation in Julia. Some of the packages available under JuliaStats are mentioned below:
- StatsBase: It provides basic functionalities for statistics like, descriptive statistics and moments, sampling functions, counting, sorting, cross-correlation, auto-correlation, and weighted statistics.
- DataArrays: It creates arrays which allows for missing data to optimize computation of arrays with repetitive values.
- DataFrames: It is used in Julia to represent tabular datasets. It provides functionalities like database indexing and merging, creating formula and model frames.
- Distribution: This package enables the calculation of probability distribution in Julia. It includes univariate, multivariate distributions, pdf, cdf and maximum likelihood estimation.
- Multivariate Stats: This Julia package is required for multivariate statistical analysis. Operations like Dimensionality Reduction, Linear Regression, LDA and Multidimensional Scaling can be performed using this package.
- MLBase : This package includes functions for Machine Learning Algorithms. It includes operations like Data Pre-processing, Model Selection, Cross Validation.
- Clustering: The algorithms required for data clustering can be incorporated using this package. It has k-means, k-medoids and various other clustering algorithms along with evaluation techniques based on their error correction.
These were just few of the JuliaStatistic packages that can be used for Data Analysis and Machine Learning applications. There are some other packages like, HypothesisTests, KernelDensity, Distances, NMF, GLM, MCMC and TimeSeries which can also be used depending upon the requirement. The source code for all these packages are available on the GitHub page of Julia Statistics.
Julia Statistics GitHub
Scikit-Learn is an open source machine learning library for Python programming language. It is an efficient tool for data analysis and data mining and is based upon SciPy, NumPy and Matplotlib. Scikit-Learn is open source and available for commercial and personal use.
Scikit-Learn is mostly used for -
- Classification: Identifying the category of objects in the data set can be done using this tool. Algorithms such as Nearest Neighbour, SVM and Random Forest are available in this toolkit for applications like Image Recognition, Spam Detection, etc.
- Regression: In order to predict a continuous valued parameter linked with a data object; algorithms like Ridge Regression, SVR and Lasso can be used. Regression can be used in predicting the Stock prices, Weather Reports and Electrical Load.
- Dimensionality Reduction: Scikit-Learn offers algorithms like PCA, Feature Selection, Cholesky Decomposition, Matrix Factorization - to reduce the number of random variables in any operation.
- Data Processing: Feature extraction and data pre-processing are some of the modules available in the Scikit-Learn Library which can be used to transform input data into usable format for further machine learning applications.
These are some of the most popular projects in GitHub and the reason is obvious. These are the tools, software that people use in their jobs to solve real industry big data problems. Do let us know if we missed any crucial project that you feel should be included in this list.