Omnipotent DBA: Running Hadoop on Ubuntu in Pseudo-Distributed Mode

I tried to run Hadoop on Ubuntu in pseudo-distributed mode today, following are the detailed steps:

Install Ubuntu 11.10 i386 in VirtualBox. In this release, JDK is located in <strong>/usr/lib/jvm/java-6-openjdk</strong> by default.

Add a dedicated Hadoop user account for running Hadoop
<pre lang="bash">sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop</pre>
Configure SSH for Hadoop user
<pre lang="bash">sudo apt-get install ssh
su - hadoop
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys</pre>
Download latest stable release of Hadoop from Hadoop's homepage. I downloaded release 1.0.2 in a gzipped tar file (hadoop-1.0.2-tar.gz). Then uncompress the hadoop-1.0.2.tar.gz.
<pre lang="bash">tar zxvf hadoop-1.0.2.tar.gz
mv hadoop-1.0.2 hadoop</pre>
Configure Hadoop

The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files for Hadoop.

hadoop-env.sh
<pre lang="bash">export JAVA_HOME=/usr/lib/jvm/java-6-openjdk</pre>
core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Format the HDFS filesystem
<pre lang="bash">
bin/hadoop namenode -format
</pre>
Start your single-node cluster
<pre lang="bash">
bin/start-all.sh
</pre>
Run the WordCount example job
<pre lang="bash">
bin/hadoop fs -copyFromLocal /home/hadoop/test_wc.txt test_wc.txt
bin/hadoop fs -ls
bin/hadoop jar hadoop-examples-1.0.2.jar wordcount test_wc.txt test_wc-output
bin/hadoop fs -cat test_wc-output/part-r-00000
bin/hadoop fs -copyToLocal test_wc-output /home/hadoop/test_wc-output
</pre>

Stop your single-node cluster
<pre lang="bash">
bin/stop-all.sh
</pre>
References:

Hadoop: The Definitive Guide
Running Hadoop On Ubuntu Linux (Single-Node Cluster)

Omnipotent DBA

Saturday, September 22, 2012

Running Hadoop on Ubuntu in Pseudo-Distributed Mode

No comments:

Post a Comment