Tuesday, November 8, 2016

Running k-Means Clustering on Spark with Cloudera in your Machine

Here are some steps to start using Spark. You can download a VirtualBox and a Cloudera Hadoop distribution and start testing it locally on your machine.


Download kmeans.py example that uses MLLIB furnished by Spark.

Create a kmeans_data.txt file that looks like this:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

Download VirtualBox.

Download Cloudera CDH5 trial version.
Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it.

Inside VirtualBox:

1 - (needs internet access) Install python numpy library. In a terminal, type:

$ sudo yum install numpy

2 - Copy kmeans_data.txt and kmeans.py to /home/cloudera/ (or wherever you want)

3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command:

$ sudo cloudera-manager --force --enterprise

4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that:

user: cloudera
password: cloudera

5 - Start HDFS on ClouderaManager Webinterface (on your browser)

6 - Start Spark on ClouderaManager Webinterface (on your browser)

7 - Put the kmeans_data.txt into HDFS. Run:

$ hadoop fs -put kmeans_data.txt

8 - Run the Spark job kmeans.py locally with 2 threads:

$ spark-submit --master local[2] kmeans.py

7 - Get the result from HDFS, and put it in your current directory:

$ hadoop fs -get KMeansModel/*

8 - The result will be stored in parquet. Read the result with parquet-tools:

$ parquet-tools cat KMeansModel/data/part-r-000..

Here is an example output of what this command should give:

Small note: While running these steps,  errors might appear in some part of the process due to initialization timing issues. I know that is a annoying advice, but if that happens just try running the command again in a couple of minutes. Also, you have to change the location of the kmeans_data.txt file inside kmeans.py to point it to your data, and also maybe change where the output will be written (target/org/apache/spark/PythonKMeansExample/KMeansModel).


  1. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    hadoop training institute in tambaram | big data training institute in tambaram | hadoop training in chennai tambaram | big data training in chennai tambaram

  2. Very interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this Python tutorial for updated knowledge on Python.https://www.youtube.com/watch?v=HcsvDObzW2U