How-to: Install CDH on Mac OSX 10.9 Mavericks

This overview will cover the basic tarball setup for your Mac.

If you’re an engineer building applications on CDH and becoming familiar with all the rich features for designing the next big solution, it becomes essential to have a native Mac OSX install. Sure, you may argue that your MBP with its four-core, hyper-threaded i7, SSD, 16GB of DDR3 memory are sufficient for spinning up a VM, and in most instances — such as using a VM for a quick demo — you’re right. However, when experimenting with a slightly heavier workload that is a bit more resource intensive, you’ll want to explore a native install.

In this post, I will cover setup of a few basic dependencies and the necessities to run HDFS, MapReduce with YARN, Apache ZooKeeper, and Apache HBase. It should be used as a guideline to get your local CDH box setup with the objective to enable you with building and running applications on the Apache Hadoop stack.

Note: This process is not supported and thus you should be comfortable as a self-supporting sysadmin. With that in mind, the configurations throughout this guideline are suggested for your default bash shell environment that can be set in your ~/.profile.

Dependencies

Install the Java version that is supported for the CDH version you are installing. In my case for CDH 5.1, I’ve installed JDK 1.7 u67. Historically the JDK for Mac OSX was only available from Apple, but since JDK 1.7, it’s available directly through Oracle’s Java downloads. Download the .dmg (in the example below, jdk-7u67-macosx-x64.dmg) and install it.

Verify and configure the installation:

Old Java path: /System/Library/Frameworks/JavaVM.framework/Home
New Java path: /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home

export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home"

Note: You’ll notice that after installing the Oracle JDK, the original path used to manage versioning /System/Library/Frameworks/JavaVM.framework/Versions, will not be updated and you now have the control to manage your versions independently.

Enable ssh on your mac by turning on remote login. You can find this option under your toolbar’s Apple icon > System Preferences > Sharing.

Check the box for Remote Login to enable the service.
Allow access for: “Only these users: Administrators”
Note: In this same window, you can modify your computer’s hostname.

Enable password-less ssh login to localhost for MRv1 and HBase.

Open your terminal.
Generate an rsa or dsa key.

ssh-keygen -t rsa -P ""
Continue through the key generator prompts (use default options).

Test: ssh localhost

Homebrew

Another toolkit I admire is Homebrew, a package manager for OSX. While Xcode developer command-line tools are great, the savvy naming conventions and ease of use of Homebrew get the job done in a fun way.

I haven’t needed Homebrew for much else than for installing dependencies required for building native Snappy libraries for Mac OSX and ease of install of MySQL for Hive. Snappy is commonly used within HBase, HDFS, and MapReduce for compression and decompression.

CDH

Finally, the easy part: The CDH tarballs are very nicely packaged and easily downloadable from Cloudera’s repository. I’ve downloaded tarballs for CDH 5.1.0.

Download and explode the tarballs in a lib directory where you can manage latest versions with a simple symlink as the following. Although Mac OSX’s “Make Alias” feature is bi-directional, do not use it, but instead use your command-line ln -s command, such as ln -s source_file target_file.

/Users/jordanh/cloudera/
cdh5.1/

hadoop -> /Users/jordanh/cloudera/lib/hadoop-2.3.0-cdh5.1.0
hbase -> /Users/jordanh/cloudera/lib/hbase-0.98.1-cdh5.1.0
hive -> /Users/jordanh/cloudera/lib/hive-0.12.0-cdh5.1.0
zookeeper -> /Users/jordanh/cloudera/lib/zookeeper-3.4.5-cdh4.7.0

ops/

dn
logs/hadoop, logs/hbase, logs/yarn
nn/
pids
tmp/
zk/

You’ll notice above that you’ve created a handful of directories under a folder named ops. You’ll use them later to customize the configuration of the essential components for running Hadoop. Set your environment properties according to the paths where you’ve exploded your tarballs.

~/.profile

CDH="cdh5.1"
export HADOOP_HOME="/Users/jordanh/cloudera/${CDH}/hadoop"
export HBASE_HOME="/Users/jordanh/cloudera/${CDH}/hbase"
export HIVE_HOME="/Users/jordanh/cloudera/${CDH}/hive"
export HCAT_HOME="/Users/jordanh/cloudera/${CDH}/hive/hcatalog"

export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${ZK_HOME}/bin:${HBASE_HOME}/bin:${HIVE_HOME}/bin:${HCAT_HOME}/bin:${M2_HOME}/bin:${ANT_HOME}/bin:${PATH}

Update your main Hadoop configuration files, as shown in the sample files below. You can also download all files referenced in this post directly from here.

<configuration>
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://localhost:8020</value>
  <description>The name of the default file system.  A URI whose
    scheme and authority determine the FileSystem implementation.  The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class.  The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/Users/jordanh/cloudera/ops/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>
<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value>
  <description>A comma-separated list of the compression codec classes that can
    be used for compression/decompression. In addition to any classes specified
    with this property (which take precedence), codec classes on the classpath
    are discovered using a Java ServiceLoader.</description>
</property>
</configuration>

<configuration>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>/Users/jordanh/cloudera/ops/nn</value>
  <description>Determines where on the local filesystem the DFS name node
    should store the name table(fsimage).  If this is a comma-delimited list
    of directories then the name table is replicated in all of the
    directories, for redundancy. </description>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>/Users/jordanh/cloudera/ops/dn/</value>
  <description>Determines where on the local filesystem an DFS data node
    should store its blocks.  If this is a comma-delimited
    list of directories, then data will be stored in all named
    directories, typically on different devices.
    Directories that do not exist are ignored.
  </description>
</property>
<property>
  <name>dfs.datanode.http.address</name>
  <value>localhost:50075</value>
  <description>
    The datanode http server address and port.
    If the port is 0 then the server will start on a free port.
  </description>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
  </description>
</property>
</configuration>

I attribute the YARN and MRv2 configuration and setup from the CDH 5 installation docs. I will not digress into the specifications of each property or the orchestration and details of how YARN and MRv2 operate, but there’s some great information that my colleague Sandy has already shared for developers and admins.

Be sure to make the necessary adjustments per your system’s memory and CPU constraints. Per the image below, it is easy to see how these parameters will affect your machine’s performance when you execute jobs.

Next, edit the following files as shown.

<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
  <description>the valid service name should only contain a-zA-Z0-9_ and can not start with numbers</description>
</property>
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
  <description>Whether to enable log aggregation</description>
</property>
<property>
  <name>yarn.nodemanager.remote-app-log-dir</name>
  <value>hdfs://localhost:8020/tmp/yarn-logs</value>
  <description>Where to aggregate logs to.</description>
</property>
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>8192</value>
  <description>Amount of physical memory, in MB, that can be allocated
    for containers.</description>
</property>
<property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
  <value>4</value>
  <description>Number of CPU cores that can be allocated
    for containers.</description>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1024</value>
  <description>The minimum allocation for every container request at the RM,
    in MBs. Memory requests lower than this won't take effect,
    and the specified value will get allocated at minimum.</description>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
  <description>The maximum allocation for every container request at the RM,
    in MBs. Memory requests higher than this won't take effect,
    and will get capped to this value.</description>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-vcores</name>
  <value>1</value>
  <description>The minimum allocation for every container request at the RM,
    in terms of virtual CPU cores. Requests lower than this won't take effect,
    and the specified value will get allocated the minimum.</description>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>2</value>
  <description>The maximum allocation for every container request at the RM,
    in terms of virtual CPU cores. Requests higher than this won't take effect,
    and will get capped to this value.</description>
</property>
</configuration>

<configuration>
<property>
  <name>mapreduce.jobtracker.address</name>
  <value>localhost:8021</value>
</property>
<property>
  <name>mapreduce.jobhistory.done-dir</name>
  <value>/tmp/job-history/</value>
  <description></description>
</property>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>The runtime framework for executing MapReduce jobs.
  Can be one of local, classic or yarn.
  </description>
</property>
<property>
  <name>mapreduce.map.cpu.vcores</name>
  <value>1</value>
  <description>
      The number of virtual cores required for each map task.
  </description>
</property>
<property>
  <name>mapreduce.reduce.cpu.vcores</name>
  <value>1</value>
  <description>
      The number of virtual cores required for each reduce task.
  </description>
</property>
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>1024</value>
  <description>Larger resource limit for maps.</description>
</property>
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>1024</value>
  <description>Larger resource limit for reduces.</description>
</property>
<property>
  <name>mapreduce.map.java.opts</name>
  <value>-Xmx768m</value>
  <description>Heap-size for child jvms of maps.</description>
</property>
<property>
  <name>mapreduce.reduce.java.opts</name>
  <value>-Xmx768m</value>
  <description>Heap-size for child jvms of reduces.</description>
</property>
<property>
  <name>yarn.app.mapreduce.am.resource.mb</name>
  <value>1024</value>
  <description>The amount of memory the MR AppMaster needs.</description>
</property>
</configuration>

# Where log files are stored.  $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR="/Users/jordanh/cloudera/ops/logs/hadoop"
export YARN_LOG_DIR="/Users/jordanh/cloudera/ops/logs/yarn"

# The directory where pid files are stored when processes run as daemons. /tmp by default.
export HADOOP_PID_DIR="/Users/jordanh/cloudera/ops/pids"
export YARN_PID_DIR=${HADOOP_PID_DIR}

You can configure HBase to run without separately downloading Apache ZooKeeper. Rather, it has a bundled package that you can easily run as a separate instance or as standalone mode in a single JVM. I recommend using either distributed or standalone mode instead of a separately downloaded ZooKeeper tarball on your machine for ease of use, configuration, and management.

The primary difference with configuration between running HBase in distributed or standalone mode is with the hbase.cluster.distributed property in hbase-site.xml. Set the property to false for launching HBase in standalone mode or true to spin up separate instances for services such as HBase’s ZooKeeper and RegionServer. Update the following configurations for HBase as specified to run it per this type of configuration.

Note regarding hbase-site.xml: Property hbase.cluster.distributed is set to false by default and will launch in standalone mode. Also, hbase.zookeeper.quorum is set to localhost by default and does not need to be overridden in our scenario.

<configuration>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false for standalone mode and true for distributed mode.  If
      false, startup will run all HBase and ZooKeeper daemons together
      in the one JVM.
    </description>
  </property>
<property>
    <name>hbase.tmp.dir</name>
    <value>/Users/jordanh/cloudera/ops/tmp/hbase-${user.name}</value>
    <description>Temporary directory on the local filesystem.
    Change this setting to point to a location more permanent
    than '/tmp' (The '/tmp' directory is often cleared on
    machine restart).
    </description>
  </property>
<property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/Users/jordanh/cloudera/ops/zk</value>
    <description>Property from ZooKeeper's config zoo.cfg.
    The directory where the snapshot is stored.
    </description>
  </property>
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:8020/hbase</value>
    <description>The directory shared by region servers and into
    which HBase persists.  The URL should be 'fully-qualified'
    to include the filesystem scheme.  For example, to specify the
    HDFS directory '/hbase' where the HDFS instance's namenode is
    running at namenode.example.org on port 9000, set this value to:
    hdfs://namenode.example.org:9000/hbase.  By default HBase writes
    into /tmp.  Change this configuration else all data will be lost
    on machine restart.
    </description>
  </property>
</configuration>

Note regarding $HBASE_HOME/conf/hbase-env.sh: By default HBASE_MANAGES_ZK is set as true and is listed below only for explicit definition.

# Where log files are stored.  $HBASE_HOME/logs by default.
# Where log files are stored.  $HBASE_HOME/logs by default.
export HBASE_LOG_DIR="/Users/jordanh/cloudera/ops/logs/hbase"

# The directory where pid files are stored. /tmp by default.
export HBASE_PID_DIR="/Users/jordanh/cloudera/ops/pids"

# Tell HBase whether it should manage its own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=true

Pulling it All Together

By now, you should have accomplished setting up HDFS, YARN, and HBase. Hadoop setup and configuration is quite tedious, much less managing it over time (thus Cloudera Manager, which is unfortunately not available for Macs).

These are the bare essentials for getting your local machine ready for running MapReduce jobs and building applications on HBase. In the next few steps, we will start/stop the services and provide examples to ensure each service is operating correctly. The steps are listed in the specific order for initialization in order to adhere to dependencies. The order could be reversed for halting the services.

Service HDFS

NameNode

format: hdfs namenode -format

start: hdfs namenode

stop: Ctrl-C

url: http://localhost:50070/dfshealth.html

DataNode

start: hdfs datanode

stop: Ctrl-C

url: http://localhost:50075/browseDirectory.jsp?dir=%2F&nnaddr=127.0.0.1:8020

Test

hadoop fs -mkdir /tmp

hadoop fs -put /path/to/local/file.txt /tmp/

hadoop fs -cat /tmp/file.txt

Service YARN

ResourceManager

start: yarn resourcemanager

stop: Ctrl-C

url: http://localhost:8088/cluster

NodeManager

start: yarn nodemanager

stop: Ctrl-C

url: http://localhost:8042/node

MapReduce Job History Server

start: mapred historyserver, mr-jobhistory-daemon.sh start historyserver

stop: Ctrl-C, mr-jobhistory-daemon.sh stop historyserver

url: http://localhost:19888/jobhistory/app

Test Vanilla YARN Application

hadoop jar $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.0.jar -appname DistributedShell -jar $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.0.jar -shell_command "ps wwaxr -o pid,stat,%cpu,time,command | head -10" -num_containers 2 -master_memory 1024

Test MRv2 YARN TestDFSIO

hadoop org.apache.hadoop.fs.TestDFSIO -write -nrFiles 5 -size 1GB
hadoop org.apache.hadoop.fs.TestDFSIO -read -nrFiles 5 -size 1GB

Test MRv2 YARN Terasort/Teragen

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar teragen 100000000 /tmp/eval/teragen
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar terasort /tmp/eval/teragen /tmp/eval/terasort

Test MRv2 YARN Pi

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar pi 100 100

Service HBase

HBase Master/RegionServer/ZooKeeper

start: start-hbase.sh

stop: stop-hbase.sh

logs: /Users/jordanh/cloudera/ops/logs/hbase/

url: http://localhost:60010/master-status

Test

hbase shell
create 'URL_HITS', {NAME=>'HOURLY'},{NAME=>'DAILY'},{NAME=>'YEARLY'}
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090110', '10'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090111', '5'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090112', '30'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090113', '80'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090114', '7'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'DAILY:20140901', '10012'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'YEARLY:2014', '93310101'

scan 'URL_HITS'

Kite SDK Test

Get familiar with the Kite SDK by trying out this example that loads data to both HDFS and then HBase. Note that there are a few common issues on your OSX that may surface when running through the Kite SDK example. They can be easily resolved with additional setup/config as specified below.

Problem: NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/NoSuchObjectException

Resolution: Fix your classpath by making sure to set HIVE_HOME and HCAT_HOME in your environment.

export HIVE_HOME="/Users/jordanh/cloudera/${CDH}/hive"
export HCAT_HOME="/Users/jordanh/cloudera/${CDH}/hive/hcatalog"

Problem: InvocationTargetException Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path

Resolution: Snappy libraries are not compiled for Mac OSX out of the box. A Snappy Java port was introduced in CDH 5 and likely will require to be recompiled on your machine.

git clone https://github.com/xerial/snappy-java.git
cd snappy-java
make 

cp target/snappy-java-1.1.1.3.jar $HADOOP_HOME/share/hadoop/common/lib/asnappy-java-1.1.1.3.jar

Landing Page

Creating a landing page will help consolidate all the HTTP addresses of the services that you’re running. Please note that localhost can be replaced with your local hostname (such as jakuza-mbp.local).

Service Apache HTTPD

start: sudo -s launchctl load -w /System/Library/LaunchDaemons/org.apache.httpd.plist

stop: sudo -s launchctl unload -w /System/Library/LaunchDaemons/org.apache.httpd.plist

logs: /var/log/apache2/

url: http://localhost/index.html

Create index.html (edit /Library/WebServer/Documents/index.html, which you can download here).

It will look something like this:

Conclusion

With this guide, you should have a locally running Hadoop cluster with HDFS, MapReduce, and HBase. These are the core components for Hadoop, and are good initial foundation for building and prototyping your applications locally.

I hope this will be a good starting point on your dev box to try out more ways to build your products, whether they are data pipelines, analytics, machine learning, search and exploration, or more, on the Hadoop stack.

Jordan Hambleton is a Solutions Architect at Cloudera.