README.md |
A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. Other awesome lists can be found in the awesome-awesomeness list.
If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti
C++
Computer Vision
- CCV - C-based/Cached/Core Computer Vision Library, A Modern Computer Vision Library
- OpenCV - OpenCV has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS. It has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS.
General-Purpose Machine Learning
- MLPack
- DLib
- ecogg
- shark
- Vowpal Wabbit (VW) - A fast out-of-core learning system.
- sofia-ml - Suite of fast incremental algorithms.
Neural Nets
=======
Clojure
General-Purpose Machine Learning
- Clojure Toolbox - A categorised directory of libraries and tools for Clojure
Go
Natural Language Processing
- go-porterstemmer - A native Go clean room implementation of the Porter Stemming algorithm.
- paicehusk - Golang implementation of the Paice/Husk Stemming Algorithm
- snowball - Snowball Stemmer for Go.
General-Purpose Machine Learning
- Go Learn - Machine Learning for Go
- go-pr - Pattern recognition package in Go lang.
- bayesian - Naive Bayesian Classification for Golang.
- go-galib - Genetic Algorithms library written in Go / golang
Data Analysis / Data Visualization
Java
Natural Language Processing
- [CoreNLP] (http://nlp.stanford.edu/software/corenlp.shtml) - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words
- [Stanford Parser] (http://nlp.stanford.edu/software/lex-parser.shtml) - A natural language parser is a program that works out the grammatical structure of sentences
- [Stanford POS Tagger] (http://nlp.stanford.edu/software/tagger.shtml) - A Part-Of-Speech Tagger (POS Tagger
- [Stanford Name Entity Recognizer] (http://nlp.stanford.edu/software/CRF-NER.shtml) - Stanford NER is a Java implementation of a Named Entity Recognizer.
- [Stanford Word Segmenter] (http://nlp.stanford.edu/software/segmenter.shtml) - Tokenization of raw text is a standard pre-processing step for many NLP tasks.
- Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
- Stanford Phrasal: A Phrase-Based Translation System
- Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translation system, written in Java.
- Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens, which roughly correspond to "words"
- Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions.
- Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterative fashion
- Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to perform analysis on datasets
- Twitter Text Java - A Java implementation of Twitter's text processing library
- MALLET - A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
- OpenNLP - a machine learning based toolkit for the processing of natural language text.
- LingPipe - A tool kit for processing text using computational linguistics.
- ClearTK - ClearTK provides a framework for developing statistical natural language processing (NLP) components in Java and is built on top of Apache UIMA.
- Apache cTAKES - Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text.
General-Purpose Machine Learning
- MLlib in Apache Spark - Distributed machine learning library in Spark
- Mahout - Distributed machine learning
- Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one of k classes.
- Weka - Weka is a collection of machine learning algorithms for data mining tasks
- ORYX - Simple real-time large-scale machine learning infrastructure.
- H2O - ML engine that supports distributed learning on data stored in HDFS.
Data Analysis / Data Visualization
- Hadoop - Hadoop/HDFS
- Spark - Spark is a fast and general engine for large-scale data processing.
- Impala - Real-time Query for Hadoop
Javascript
Natural Language Processing
- Twitter-text-js - A JavaScript implementation of Twitter's text processing library
- NLP.js - NLP utilities in javascript and coffeescript
Data Analysis / Data Visualization
General-Purpose Machine Learning
- Convnet.js - ConvNetJS is a Javascript library for training Deep Learning models[DEEP LEARNING]
- Clustering.js - Clustering algorithms implemented in Javascript for Node.js and the browser
- Decision Trees - NodeJS Implementation of Decision Tree using ID3 Algorithm
- Node-fann - FANN (Fast Artificial Neural Network Library) bindings for Node.js
- Kmeans.js - Simple Javascript implementation of the k-means algorithm, for node.js and the browser
- LDA.js - LDA topic modeling for node.js
- Learning.js - Javascript implementation of logistic regression/c4.5 decision tree
- Machine Learning - Machine learning library for Node.js
- Node-SVM - Support Vector Machine for nodejs
- Brain - Neural networks in JavaScript
- Bayesian-Bandit - Bayesian bandit implementation for Node and the browser.
Julia
General-Purpose Machine Learning
- PGM - A Julia framework for probabilistic graphical models.
- DA - Julia package for Regularized Discriminant Analysis
- Regression - Algorithms for regression analysis (e.g. linear regression and logistic regression)
- Local Regression - Local regression, so smooooth!
- Naive Bayes - Simple Naive Bayes implementation in Julia
- Mixed Models - A Julia package for fitting (statistical) mixed-effects models
- Simple MCMC - basic mcmc sampler implemented in Julia
- Distance - Julia module for Distance evaluation
- Decision Tree - Decision Tree Classifier and Regressor
- Neural - A neural network in Julia
- MCMC - MCMC tools for Julia
- GLM - Generalized linear models in Julia
- Online Learning
- GLMNet - Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet
- Clustering - Basic functions for clustering data: k-means, dp-means, etc.
- SVM - SVM's for Julia
- Kernal Density - Kernel density estimators for julia
- Dimensionality Reduction - Methods for dimensionality reduction
- NMF - A Julia package for non-negative matrix factorization
- ANN - Julia artificial neural networks
Natural Language Processing
- Topic Models - TopicModels for Julia
- Text Analysis - Julia package for text analysis
Data Analysis / Data Visualization
-
Graph Layout - Graph layout algorithms in pure Julia
-
Data Frames Meta - Metaprogramming tools for DataFrames
-
Julia Data - library for working with tabular data in Julia
-
Data Read - Read files from Stata, SAS, and SPSS
-
Hypothesis Tests - Hypothesis tests for Julia
-
Gladfly - Crafty statistical graphics for Julia.
-
Stats - Statistical tests for Julia
-
RDataSets - Julia package for loading many of the data sets available in R
-
DataFrames - library for working with tabular data in Julia
-
Distributions - A Julia package for probability distributions and associated functions.
-
Data Arrays - Data structures that allow missing values
-
Time Series - Time series toolkit for Julia
-
Sampling - Basic sampling algorithms for Julia
Misc Stuff / Presentations
- JuliaCon Presentations - Presentations for JuliaCon
- SignalProcessing - Signal Processing tools for Julia
- Images - An image library for Julia
Matlab
Computer Vision
- Contourlets - MATLAB source code that implements the contourlet transform and its utility functions.
- Shearlets - MATLAB code for shearlet transform
- Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet transform designed to represent images at different scales and different angles.
- Bandlets - MATLAB code for bandlet transform
Natural Language Processing
- NLP - An NLP library for Matlab
General-Purpose Machine Learning
- Training a deep autoencoder or a classifier on MNIST digits - Training a deep autoencoder or a classifier on MNIST digits[DEEP LEARNING]
- t-Distributed Stochastic Neighbor Embedding - t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
- Spider - The spider is intended to be a complete object orientated environment for machine learning in Matlab.
- LibSVM - A Library for Support Vector Machines
- LibLinear - A Library for Large Linear Classification
- Machine Learning Module - Class on machine w/ PDF,lectures,code
- Caffe - A deep learning framework developed with cleanliness, readability, and speed in mind.
Data Analysis / Data Visualization
- matlab_gbl - MatlabBGL is a Matlab package for working with graphs.
- gamic - Efficient pure-Matlab implementations of graph algorithms to complement MatlabBGL's mex functions.
Python
Natural Language Processing
- NLTK - A leading platform for building Python programs to work with human language data.
- Pattern - A web mining module for the Python programming language. It has tools for natural language processing, machine learning, among others.
- TextBlob - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.
- jieba - Chinese Words Segmentation Utilities.
- SnowNLP - A library for processing Chinese text.
- loso - Another Chinese segmentation library.
- genius - A Chinese segment base on Conditional Random Field.
- nut - Natural language Understanding Toolkit
- Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
General-Purpose Machine Learning
- Bayesian Methods for Hackers - Book/iPython notebooks on Probabilistic Programming in Python
- MLlib in Apache Spark - Distributed machine learning library in Spark
- scikit-learn - A Python module for machine learning built on top of SciPy.
- graphlab-create - A library with various machine learning models (regression, clustering, recommender systems, graph analytics, etc.) implemented on top of a disk-backed DataFrame.
- BigML - A library that contacts external servers.
- pattern - Web mining module for Python.
- NuPIC - Numenta Platform for Intelligent Computing.
- Pylearn2 - A Machine Learning library based on Theano.
- hebel - GPU-Accelerated Deep Learning Library in Python.
- gensim - Topic Modelling for Humans.
- PyBrain - Another Python Machine Learning Library.
- Crab - A flexible, fast recommender engine.
- python-recsys - A Python library for implementing a Recommender System.
- thinking bayes - Book on Bayesian Analysis
- Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python. [DEEP LEARNING]
- Bolt - Bolt Online Learning Toolbox
- CoverTree - Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree
- nilearn - Machine learning for NeuroImaging in Python
- Shogun - The Shogun Machine Learning Toolbox
- Pyevolve - Genetic algorithm framework.
- Caffe - A deep learning framework developed with cleanliness, readability, and speed in mind.
Data Analysis / Data Visualization
- SciPy - A Python-based ecosystem of open-source software for mathematics, science, and engineering.
- NumPy - A fundamental package for scientific computing with Python.
- Numba - Python JIT (just in time) complier to LLVM aimed at scientific Python by the developers of Cython and NumPy.
- NetworkX - A high-productivity software for complex networks.
- Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
- Open Mining - Business Intelligence (BI) in Python (Pandas web interface)
- PyMC - Markov Chain Monte Carlo sampling toolkit.
- zipline - A Pythonic algorithmic trading library.
- PyDy - Short for Python Dynamics, used to assist with workflow in the modeling of dynamic motion based around NumPy, SciPy, IPython, and matplotlib.
- SymPy - A Python library for symbolic mathematics.
- statsmodels - Statistical modeling and econometrics in Python.
- astropy - A community Python library for Astronomy.
- matplotlib - A Python 2D plotting library.
- bokeh - Interactive Web Plotting for Python.
- plotly - Collaborative web plotting for Python and matplotlib.
- vincent - A Python to Vega translator.
- d3py - A plottling library for Python, based on D3.js.
- ggplot - Same API as ggplot2 for R.
- Kartograph.py - Rendering beautiful SVG maps in Python.
- pygal - A Python SVG Charts Creator.
- pycascading
Misc Scripts / iPython Notebooks / Codebases
- pattern_classification
- thinking stats 2
- hyperopt
- numpic
- 2012-paper-diginorm
- ipython-notebooks
- decision-weights
- Sarah Palin LDA - Topic Modeling the Sarah Palin emails.
- Diffusion Segmentation - A collection of image segmentation algorithms based on diffusion methods
- Scipy Tutorials - SciPy tutorials. This is outdated, check out scipy-lecture-notes
- Crab - A recommendation engine library for Python
- BayesPy - Bayesian Inference Tools in Python
- scikit-learn tutorials - Series of notebooks for learning scikit-learn
- sentiment-analyzer - Tweets Sentiment Analyzer
- group-lasso - Some experiments with the coordinate descent algorithm used in the (Sparse) Group Lasso model
- mne-python-notebooks - IPython notebooks for EEG/MEG data processing using mne-python
- pandas cookbook - Recipes for using Python's pandas library
Kaggle Competition Source Code
- wiki challange - An implementation of Dell Zhang's solution to Wikipedia's Participation Challenge on Kaggle
- kaggle insults - Kaggle Submission for "Detecting Insults in Social Commentary"
- kaggle_acquire-valued-shoppers-challenge - Code for the Kaggle acquire valued shoppers challenge
- kaggle-cifar - Code for the CIFAR-10 competition at Kaggle, uses cuda-convnet
- kaggle-blackbox - Deep learning made easy
- kaggle-accelerometer - Code for Accelerometer Biometric Competition at Kaggle
- kaggle-advertised-salaries - Predicting job salaries from ads - a Kaggle competition
- kaggle amazon - Amazon access control challenge
- kaggle-bestbuy_big - Code for the Best Buy competition at Kaggle
- kaggle-bestbuy_small
- Kaggle Dogs vs. Cats - Code for Kaggle Dovs vs. Cats competition
- Kaggle Galaxy Challenge - Winning solution for the Galaxy Challenge on Kaggle
- Kaggle Gender - A Kaggle competition: discriminate gender based on handwriting
- Kaggle Merck - Merck challenge at Kaggle
- Kaggle Stackoverflow - Predicting closed questions on Stack Overflow
- kaggle_acquire-valued-shoppers-challenge - Code for the Kaggle acquire valued shoppers challenge
- wine-quality - Predicting wine quality
Ruby
Natural Language Processing
- Treat - Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
- Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby objects in any language. It includes a generic language-independant front end, a module for mapping language codes into language names, and a module which contains various English-language utilities.
- Stemmer - Expose libstemmer_c to Ruby
- Ruby Wordnet - This library is a Ruby interface to WordNet
- Raspel - raspell is an interface binding for ruby
- UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing
- Twitter-text-rb - A library that does auto linking and extraction of usernames, lists and hashtags in tweets
General-Purpose Machine Learning
- Ruby Machine Learning - Some Machine Learning algorithms, implemented in Ruby
- Machine Learning Ruby
- jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby.
- CardMagic-Classifier - A general classifier module to allow Bayesian and other types of classifications.
- Neural Networks and Deep Learning - Code samples for my book "Neural Networks and Deep Learning" [DEEP LEARNING]
Data Analysis / Data Visualization
- rsruby - Ruby - R bridge
- data-visualization-ruby - Source code and supporting content for my Ruby Manor presentation on Data Visualisation with Ruby
- ruby-plot - gnuplot wrapper for ruby, especially for plotting roc curves into svg files
- plot-rb - A plotting library in Ruby built on top of Vega and D3.
- scruffy - A beautiful graphing toolkit for Ruby
- SciRuby
- Glean - A data management tool for humans
- Bioruby
- Arel
Misc
R
General-Purpose Machine Learning
- Clever Algorithms For Machine Learning
- Machine Learning For Hackers
- Machine Learning Task View on CRAN - A list of ML packages in R, grouped by algorithm type.
- caret - Unified interface to ~150 ML algorithms in R.
- SuperLearner and subsemble - Multi-algorithm ensemble learning packages.
- Introduction to Statistical Learning
Data Analysis / Data Visualization
- Learning Statistics Using R
- ggplot2 - A data visualization package based on the grammar of graphics.
Scala
Natural Language Processing
- ScalaNLP - ScalaNLP is a suite of machine learning and numerical computing libraries.
- Breeze - Breeze is a numerical processing library for Scala.
- Chalk - Chalk is a natural language processing library.
- FACTORIE - FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.
Data Analysis / Data Visualization
- MLlib in Apache Spark - Distributed machine learning library in Spark
- Scalding - A Scala API for Cascading
- Summing Bird - Streaming MapReduce with Scalding and Storm
- Algebird - Abstract Algebra for Scala
- xerial - Data management utilities for Scala
- simmer - Reduce your data. A unix filter for algebird-powered aggregation.
- PredictionIO - PredictionIO, a machine learning server for software developers and data engineers.
- BIDMat - CPU and GPU-accelerated matrix library intended to support large-scale exploratory data analysis.
General-Purpose Machine Learning
- Conjecture - Scalable Machine Learning in Scalding
- brushfire - decision trees for scalding
- ganitha - scalding powered machine learning
- adam - A genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.
- bioscala - Bioinformatics for the Scala programming language
- BIDMach - CPU and GPU-accelerated Machine Learning Library.