Running Hama K-Means in 5 minutes

Edward J. Yoon <edwardyoon AT apache DOT org>

Already you might know, the Apache Hama project provides a set of machine learning algorithms which can be applied in applications with very large scale data in multiple domains.

In this article, I explain how to run BSP-based K-Means algorithm using Apache Hama, assume that you have already installed Hama cluster and you have tested it.

1. Download a Iris data set [Data set Information].

2. Then, run KMeans using (TRUNK version is recommended):
  % $HAMA_HOME/bin/hama jar hama-examples-x.x.x.jar kmeans /tmp/kmeans.txt /tmp/result 10 3
  ...
  setosa: [5.1, 3.5, 1.4, 0.2] belongs to cluster 2
  setosa: [4.9, 3.0, 1.4, 0.2] belongs to cluster 2
  setosa: [4.7, 3.2, 1.3, 0.2] belongs to cluster 2
  setosa: [4.6, 3.1, 1.5, 0.2] belongs to cluster 2
  setosa: [5.0, 3.6, 1.4, 0.2] belongs to cluster 2
  ...