A curated list of awesome Machine Learning frameworks, libraries and software.
Go to file
joseph misiti bd647adab2 Merge pull request #25 from tmills/master
Added cleartk and ctakes to java section.
2014-07-16 07:42:37 -04:00
README.md Merge pull request #25 from tmills/master 2014-07-16 07:42:37 -04:00

A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. Other awesome lists can be found in the awesome-awesomeness list.

If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti

C++

Computer Vision

  • CCV - C-based/Cached/Core Computer Vision Library, A Modern Computer Vision Library
  • OpenCV - OpenCV has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS. It has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS.

General-Purpose Machine Learning

Neural Nets

Clojure

General-Purpose Machine Learning

  • Clojure Toolbox - A categorised directory of libraries and tools for Clojure

Go

Natural Language Processing

  • go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm.
  • paicehusk - Golang implementation of the Paice/Husk Stemming Algorithm
  • snowball - Snowball Stemmer for Go.

General-Purpose Machine Learning

  • Go Learn - Machine Learning for Go
  • go-pr - Pattern recognition package in Go lang.
  • bayesian - Naive Bayesian Classification for Golang.
  • go-galib - Genetic Algorithms library written in Go / golang

Data Analysis / Data Visualization

  • go-graph - Graph library for Go/golang language.
  • SVGo - The Go Language library for SVG generation

Java

Natural Language Processing

  • [CoreNLP] (http://nlp.stanford.edu/software/corenlp.shtml) - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words
  • [Stanford Parser] (http://nlp.stanford.edu/software/lex-parser.shtml) - A natural language parser is a program that works out the grammatical structure of sentences
  • [Stanford POS Tagger] (http://nlp.stanford.edu/software/tagger.shtml) - A Part-Of-Speech Tagger (POS Tagger
  • [Stanford Name Entity Recognizer] (http://nlp.stanford.edu/software/CRF-NER.shtml) - Stanford NER is a Java implementation of a Named Entity Recognizer.
  • [Stanford Word Segmenter] (http://nlp.stanford.edu/software/segmenter.shtml) - Tokenization of raw text is a standard pre-processing step for many NLP tasks.
  • Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
  • Stanford Phrasal: A Phrase-Based Translation System
  • Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translation system, written in Java.
  • Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens, which roughly correspond to "words"
  • Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions.
  • Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterative fashion
  • Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to perform analysis on datasets
  • Twitter Text Java - A Java implementation of Twitter's text processing library
  • MALLET - A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
  • OpenNLP - a machine learning based toolkit for the processing of natural language text.
  • LingPipe - A tool kit for processing text using computational linguistics.
  • ClearTK - ClearTK provides a framework for developing statistical natural language processing (NLP) components in Java and is built on top of Apache UIMA.
  • Apache cTAKES - Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text.

General-Purpose Machine Learning

  • MLlib in Apache Spark - Distributed machine learning library in Spark
  • Mahout - Distributed machine learning
  • Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one of k classes.
  • Weka - Weka is a collection of machine learning algorithms for data mining tasks
  • ORYX - Simple real-time large-scale machine learning infrastructure.
  • H2O - ML engine that supports distributed learning on data stored in HDFS.

Data Analysis / Data Visualization

  • Hadoop - Hadoop/HDFS
  • Spark - Spark is a fast and general engine for large-scale data processing.
  • Impala - Real-time Query for Hadoop

Javascript

Natural Language Processing

  • Twitter-text-js - A JavaScript implementation of Twitter's text processing library
  • NLP.js - NLP utilities in javascript and coffeescript

Data Analysis / Data Visualization

General-Purpose Machine Learning

  • Convnet.js - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]
  • Clustering.js - Clustering algorithms implemented in Javascript for Node.js and the browser
  • Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm
  • Node-fann - FANN (Fast Artificial Neural Network Library) bindings for Node.js
  • Kmeans.js - Simple Javascript implementation of the k-means algorithm, for node.js and the browser
  • LDA.js - LDA topic modeling for node.js
  • Learning.js - Javascript implementation of logistic regression/c4.5 decision tree
  • Machine Learning - Machine learning library for Node.js
  • Node-SVM - Support Vector Machine for nodejs
  • Brain - Neural networks in JavaScript
  • Bayesian-Bandit - Bayesian bandit implementation for Node and the browser.

Julia

General-Purpose Machine Learning

  • PGM - A Julia framework for probabilistic graphical models.
  • DA - Julia package for Regularized Discriminant Analysis
  • Regression - Algorithms for regression analysis (e.g. linear regression and logistic regression)
  • Local Regression - Local regression, so smooooth!
  • Naive Bayes - Simple Naive Bayes implementation in Julia
  • Mixed Models - A Julia package for fitting (statistical) mixed-effects models
  • Simple MCMC - basic mcmc sampler implemented in Julia
  • Distance - Julia module for Distance evaluation
  • Decision Tree - Decision Tree Classifier and Regressor
  • Neural - A neural network in Julia
  • MCMC - MCMC tools for Julia
  • GLM - Generalized linear models in Julia
  • Online Learning
  • GLMNet - Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet
  • Clustering - Basic functions for clustering data: k-means, dp-means, etc.
  • SVM - SVM's for Julia
  • Kernal Density - Kernel density estimators for julia
  • Dimensionality Reduction - Methods for dimensionality reduction
  • NMF - A Julia package for non-negative matrix factorization
  • ANN - Julia artificial neural networks

Natural Language Processing

Data Analysis / Data Visualization

  • Graph Layout - Graph layout algorithms in pure Julia

  • Data Frames Meta - Metaprogramming tools for DataFrames

  • Julia Data - library for working with tabular data in Julia

  • Data Read - Read files from Stata, SAS, and SPSS

  • Hypothesis Tests - Hypothesis tests for Julia

  • Gladfly - Crafty statistical graphics for Julia.

  • Stats - Statistical tests for Julia

  • RDataSets - Julia package for loading many of the data sets available in R

  • DataFrames - library for working with tabular data in Julia

  • Distributions - A Julia package for probability distributions and associated functions.

  • Data Arrays - Data structures that allow missing values

  • Time Series - Time series toolkit for Julia

  • Sampling - Basic sampling algorithms for Julia

Misc Stuff / Presentations

Matlab

Computer Vision

  • Contourlets - MATLAB source code that implements the contourlet transform and its utility functions.
  • Shearlets - MATLAB code for shearlet transform
  • Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed to represent images at different scales and different angles.
  • Bandlets - MATLAB code for bandlet transform

Natural Language Processing

  • NLP - An NLP library for Matlab

General-Purpose Machine Learning

Data Analysis / Data Visualization

  • matlab_gbl - MatlabBGL is a Matlab package for working with graphs.
  • gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGL's mex functions.

Python

Natural Language Processing

  • NLTK - A leading platform for building Python programs to work with human language data.
  • Pattern - A web mining module for the Python programming language. It has tools for natural language processing, machine learning, among others.
  • TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
  • jieba - Chinese Words Segmentation Utilities.
  • SnowNLP - A library for processing Chinese text.
  • loso - Another Chinese segmentation library.
  • genius - A Chinese segment base on Conditional Random Field.
  • nut - Natural language Understanding Toolkit
  • Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)

General-Purpose Machine Learning

  • Bayesian Methods for Hackers - Book/iPython notebooks on Probabilistic Programming in Python
  • MLlib in Apache Spark - Distributed machine learning library in Spark
  • scikit-learn - A Python module for machine learning built on top of SciPy.
  • graphlab-create - A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.
  • BigML - A library that contacts external servers.
  • pattern - Web mining module for Python.
  • NuPIC - Numenta Platform for Intelligent Computing.
  • Pylearn2 - A Machine Learning library based on Theano.
  • hebel - GPU-Accelerated Deep Learning Library in Python.
  • gensim - Topic Modelling for Humans.
  • PyBrain - Another Python Machine Learning Library.
  • Crab - A flexible, fast recommender engine.
  • python-recsys - A Python library for implementing a Recommender System.
  • thinking bayes - Book on Bayesian Analysis
  • Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python. [DEEP LEARNING]
  • Bolt - Bolt Online Learning Toolbox
  • CoverTree - Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree
  • nilearn - Machine learning for NeuroImaging in Python
  • Shogun - The Shogun Machine Learning Toolbox
  • Pyevolve - Genetic algorithm framework.

Data Analysis / Data Visualization

  • SciPy - A Python-based ecosystem of open-source software for mathematics, science, and engineering.
  • NumPy - A fundamental package for scientific computing with Python.
  • Numba - Python JIT (just in time) complier to LLVM aimed at scientific Python by the developers of Cython and NumPy.
  • NetworkX - A high-productivity software for complex networks.
  • Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
  • Open Mining - Business Intelligence (BI) in Python (Pandas web interface)
  • PyMC - Markov Chain Monte Carlo sampling toolkit.
  • zipline - A Pythonic algorithmic trading library.
  • PyDy - Short for Python Dynamics, used to assist with workflow in the modeling of dynamic motion based around NumPy, SciPy, IPython, and matplotlib.
  • SymPy - A Python library for symbolic mathematics.
  • statsmodels - Statistical modeling and econometrics in Python.
  • astropy - A community Python library for Astronomy.
  • matplotlib - A Python 2D plotting library.
  • bokeh - Interactive Web Plotting for Python.
  • plotly - Collaborative web plotting for Python and matplotlib.
  • vincent - A Python to Vega translator.
  • d3py - A plottling library for Python, based on D3.js.
  • ggplot - Same API as ggplot2 for R.
  • Kartograph.py - Rendering beautiful SVG maps in Python.
  • pygal - A Python SVG Charts Creator.
  • pycascading

Misc Scripts / iPython Notebooks / Codebases

Kaggle Competition Source Code

Ruby

Natural Language Processing

  • Treat - Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit Ive encountered so far for Ruby
  • Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language. It includes a generic language-independant front end, a module for mapping language codes into language names, and a module which contains various English-language utilities.
  • Stemmer - Expose libstemmer_c to Ruby
  • Ruby Wordnet - This library is a Ruby interface to WordNet
  • Raspel - raspell is an interface binding for ruby
  • UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing
  • Twitter-text-rb - A library that does auto linking and extraction of usernames, lists and hashtags in tweets

General-Purpose Machine Learning

Data Analysis / Data Visualization

  • rsruby - Ruby - R bridge
  • data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visualisation with Ruby
  • ruby-plot - gnuplot wrapper for ruby, especially for plotting roc curves into svg files
  • plot-rb - A plotting library in Ruby built on top of Vega and D3.
  • scruffy - A beautiful graphing toolkit for Ruby
  • SciRuby
  • Glean - A data management tool for humans
  • Bioruby
  • Arel

Misc

R

General-Purpose Machine Learning

Data Analysis / Data Visualization

Scala

Natural Language Processing

  • ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries.
  • Breeze - Breeze is a numerical processing library for Scala.
  • Chalk - Chalk is a natural language processing library.
  • FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.

Data Analysis / Data Visualization

  • MLlib in Apache Spark - Distributed machine learning library in Spark
  • Scalding - A Scala API for Cascading
  • Summing Bird - Streaming MapReduce with Scalding and Storm
  • Algebird - Abstract Algebra for Scala
  • xerial - Data management utilities for Scala
  • simmer - Reduce your data. A unix filter for algebird-powered aggregation.
  • PredictionIO - PredictionIO, a machine learning server for software developers and data engineers.
  • BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis.

General-Purpose Machine Learning

  • Conjecture - Scalable Machine Learning in Scalding
  • brushfire - decision trees for scalding
  • ganitha - scalding powered machine learning
  • adam - A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.
  • bioscala - Bioinformatics for the Scala programming language
  • BIDMach - CPU and GPU-accelerated Machine Learning Library.

Credits

  • Some of the python libraries were cut-and-pasted from vinta
  • The few go reference I found where pulled from this page