Date Posted: November 21, 2006
Update: November 27, 2007 New algorithms; input and output files now in XML format; support for sparse data files; matlab functions for writing PML data files and reading PML models; improved performance; new data access modes.
What is IBM Parallel Machine Learning Toolbox?
Large data sets are common in Web applications, bioinformatics, and speech and image processing. Many sophisticated machine learning algorithms cannot process such large amounts of data on a single node. IBM® Parallel Machine Learning Toolbox (PML) can do so by distributing the computations. This distribution speeds up computations and expedites training by several orders of magnitude: for example, from several weeks on a single node to days or even hours running on multiple nodes.
The toolbox enables the application of machine learning tools to large data sets by distributing the required computations to computing nodes in a parallel fashion. The toolbox can work on various types of architecture, from multi-core machines to Blue Gene®.
PML contains many commonly-used machine learning algorithms and includes an API for incorporating additional algorithms. Standard supported algorithms include the following:
- Classification: Support-vector machine (SVM) and linear least squares
- Clustering: k-means, fuzzy k-means, kernel k-means, and Iclust
- Feature reduction: Principal Component Analysis (PCA) and kernel PCA.
The toolbox runs on Windows®, Linux®, and UNIX®.
How does it work?
PML can be used in two modes:
- The built-in algorithms can be run using a simple textual interface. Users specify the location of the data and then select the algorithm and parameters to run. The output is provided in a text file.
- Algorithms can be added by making use of a simple API. This mode makes it possible for researchers to test their own algorithms, using PML as the basis for distributing the computations in an easy, reliable way.
PML uses the standard MPICH2 library for low-level communications. Use of this library means that PML can be run on widely-varying types of architecture, such as a single-node machine, small clusters, grids, and BlueGene. After being initialized, the toolbox allows for computations to be distributed to multiple computing nodes and results returned to a master node, which then conducts the necessary updates. These updated results are then returned to the computing nodes, and this process is repeated several times until results converge based on the pre-specified parameters.
About the technology author(s)
Udi Aharoni is a research staff member in the Machine Learning group at the IBM Haifa Research Laboratory.
Amol Ghoting, Ph.D., is a research staff member in the Mathematical Sciences Department at the IBM T. J. Watson Research Center. He is interested in data mining, database systems, high performance computing, and architecture-conscious algorithms.
Edwin Pednault, Ph.D., is a research staff member in the Mathematical Sciences Department at the IBM T. J. Watson Research Center. He is the inventor of the transform regression algorithm in DB2 Intelligent Miner Modeling and is the architect of the parallelization scheme used in the Parallel Machine Learning Toolkit. Dr. Pednault's research interests center on high-performance predictive modeling algorithms and their applications.
Dan Pelleg, Ph.D., is a research staff member in the Machine Learning group at the IBM Haifa Research Laboratory. In the past, he has worked on bioinformatics, Web search, clustering, and accelerated data-mining algorithms.
Ramesh Natarajan, Ph.D., is a member of the Data Abstraction group at the IBM T. J. Watson Research Center. He works in the areas of statistical data mining and databases. Dr. Natarajan is the recipient of two IBM research division awards for his contributions to the IBM SP-2 parallel computer project.
Elad Yom-Tov, Ph.D., is a research staff member on the Machine Learning Team at IBM Haifa Research Laboratory, where he works on the applications of machine learning to search technologies, autonomic computing, bioinformatics, and hardware verification (among other projects). Dr. Yom-Tov is the author (with David Stork) of the Computer Manual to Accompany Pattern Classification, a book and Matlab toolbox on pattern classification.
