工具清單

Cortical.io - Retina: an API performing complex NLP operations (disambiguation, classification, streaming text filtering, etc...) as quickly and intuitively as the brain.
CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words
Stanford Parser - A natural language parser is a program that works out the grammatical structure of sentences
Stanford POS Tagger - A Part-Of-Speech Tagger (POS Tagger
Stanford Name Entity Recognizer - Stanford NER is a Java implementation of a Named Entity Recognizer.
Stanford Word Segmenter - Tokenization of raw text is a standard pre-processing step for many NLP tasks.
Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
Stanford Phrasal: A Phrase-Based Translation System
Stanford English Tokenizer - Stanford Phrasal is a state-of-the-art statistical phrase-based machine translation system, written in Java.
Stanford Tokens Regex - A tokenizer divides text into a sequence of tokens, which roughly correspond to "words"
Stanford Temporal Tagger - SUTime is a library for recognizing and normalizing time expressions.
Stanford SPIED - Learning entities from unlabeled text starting with seed sets using patterns in an iterative fashion
Stanford Topic Modeling Toolbox - Topic modeling tools to social scientists and others who wish to perform analysis on datasets
Twitter Text Java - A Java implementation of Twitter's text processing library
MALLET - A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
OpenNLP - a machine learning based toolkit for the processing of natural language text.
LingPipe - A tool kit for processing text using computational linguistics.
ClearTK - ClearTK provides a framework for developing statistical natural language processing (NLP) components in Java and is built on top of Apache UIMA.
Apache cTAKES - Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text.
ClearNLP - The ClearNLP project provides software and resources for natural language processing. The project started at the Center for Computational Language and EducAtion Research, and is currently developed by the Center for Language and Information Research at Emory University. This project is under the Apache 2 license.
CogcompNLP - This project collects a number of core libraries for Natural Language Processing (NLP) developed in the University of Illinois' Cognitive Computation Group, for example illinois-core-utilities which provides a set of NLP-friendly data structures and a number of NLP-related utilities that support writing NLP applications, running experiments, etc, illinois-edison a library for feature extraction from illinois-core-utilities data structures and many other packages.

通用機器學習

aerosolve - A machine learning library by Airbnb designed from the ground up to be human friendly.
Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applications
ELKI - Java toolkit for data mining. (unsupervised: clustering, outlier detection etc.)
Encog - An advanced neural network and machine learning framework. Encog contains classes to create a wide variety of networks, as well as support classes to normalize and process data for these neural networks. Encog trains using multithreaded resilient propagation. Encog can also make use of a GPU to further speed processing time. A GUI based workbench is also provided to help model and train neural networks.
FlinkML in Apache Flink - Distributed machine learning library in Flink
H2O - ML engine that supports distributed learning on Hadoop, Spark or your laptop via APIs in R, Python, Scala, REST/JSON.
htm.java - General Machine Learning library using Numenta’s Cortical Learning Algorithm
java-deeplearning - Distributed Deep Learning Platform for Java, Clojure,Scala
Mahout - Distributed machine learning
Meka - An open source implementation of methods for multi-label classification and evaluation (extension to Weka).
MLlib in Apache Spark - Distributed machine learning library in Spark
Neuroph - Neuroph is lightweight Java neural network framework
ORYX - Lambda Architecture Framework using Apache Spark and Apache Kafka with a specialization for real-time large-scale machine learning.
Samoa SAMOA is a framework that includes distributed machine learning for data streams with an interface to plug-in different stream processing platforms.
RankLib - RankLib is a library of learning to rank algorithms
rapaio - statistics, data mining and machine learning toolbox in Java
RapidMiner - RapidMiner integration into Java code
Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one of k classes.
SmileMiner - Statistical Machine Intelligence & Learning Engine
SystemML - flexible, scalable machine learning (ML) language.
WalnutiQ - object oriented model of the human brain
Weka - Weka is a collection of machine learning algorithms for data mining tasks
LBJava - Learning Based Java is a modeling language for the rapid development of software systems, offers a convenient, declarative syntax for classifier and constraint definition directly in terms of the objects in the programmer's application.

語音識別

CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech recognition library.

資料分析/資料視覺化

Flink - Open source platform for distributed stream and batch data processing.
Hadoop - Hadoop/HDFS
Spark - Spark is a fast and general engine for large-scale data processing.
Storm - Storm is a distributed realtime computation system.
Impala - Real-time Query for Hadoop
DataMelt - Mathematics software for numeric computation, statistics, symbolic calculations, data analysis and data visualization.
Dr. Michael Thomas Flanagan's Java Scientific Library

深度學習

Deeplearning4j - Scalable deep learning for industry with parallel GPUs