Thursday, April 14, 2011

Running hadoop in Windows (Pseudo-Distributed Mode):

In previous post I have explained about starting hadoop as standalone service in windows.In this post I will explain how to start hadoop in pseudo distributed mode.In pseudo-distributed mode Hadoop daemon runs in a separate Java process in localhost

Configuration:
Add below mentioned configuration to start services required to mimic distribution mode

conf/core-site.xml:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>


conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
We need to setup passphraseless ssh authentication to connect to localhost .Try $ ssh localhost and this should connect to your local machine with out prompting password/passphrase . If not please read below to setup openssh server/client and password less authentication

openssh client/server programs can be installed as part of cygwin installation by selecting "openssh" package during cygwin installation.
1)open cygwin console and run "ssh-host-config -y". This will generate configuration files required to start ssh server,setup local windows user account and creates windows service(sshd)
2)Now ssh service can be started either with cygwin command (cygrunsrv -S sshd) or standard windows command (net start sshd)


Note:Sometimes the service will not start and rebooting the system may help.

Run the following steps in cygwin console to setup public key authentication for ssh local connection

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys


Execution of hadoop in distribution mode:

Format a new distributed-filesystem:
$ bin/hadoop namenode -format

Start the hadoop daemons (this start name node and job tracker also)
$ bin/start-all.sh

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

Stop the daemons with:
$ bin/stop-all.sh

Sounds simple.isn't it ??
How ever setting up public key authentication troubled me a lot due to issues with cygwin installation.I have deleted my previous installation files (without ssh server) and re installed cygwin along with open ssh client/server programs and followed the steps to start ssh server and public key authentication.

When I tried to connect to localhost via ssh i got password prompt which should not be the case.I deleted all the key pairs and followed the same steps to setup sshd service and public key authentication.But this it seems the correct keys were picked by ssh program and connection was closed by sshd and got below error message (Connection closed by ::1)

I decided to clean removal and installation of cygwin without any customization.

Follow below steps to remove cygwin in clean way:

1)Remove sshd configuration steps (cygrunsrv -E sshd and cygrunsrv -R sshd)
2)
delete the folder (default -->c:\cygwin) and all its sub-folders
3)remove the Environment Variable CYGWIN and PATH variables if defined
4)Remove the following entries completely from registery (regedit)
  • HKEY_CURRENT_USER/Software/Cygnus Solutions
  • HKEY_CURRENT_USER/Software/cygwin
  • HKEY_LOCAL_MACHINE/Software/Cygnus Solutions
  • HKEY_LOCAL_MACHINE/Software_/cygwin
5)Remove local user/group sshd (compmgmt.msc)
6)Search for all public/private key files (id_rsa/id_dsa/id_dsa.pub/id_rsa.pub/authorized_keys) and delete them

Re-install cygwin in default path and make sure to select openssh packages as part of installtion and setup ssh server and password less authentication as indicated above.This time public key authentication worked for me and should work for you as well.

Now the important step is completed and happily executed command to start hadoop daemons.unfortunately I ended up with below error

localhost: Error: JAVA_HOME is not set.


Not sure why is this error occurred even after defining JAVA_HOME environment variable and this is not the case when I started hadoop in stand alone mode.I just followed apache documentation to fix this error.
edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java
installation.


Finally I'm able to start hadoop in pseudo distribution mode in windows system and able to access Name node and job tracker via web url

Reference:
1) Apache Hadoop documentation
2) Setup Hadoop

No comments:

Enter your Comments