This overview will cover the basic tarball setup for your Mac.
If you’re an engineer building applications on CDH and becoming familiar with all the rich features for designing the next big solution, it becomes essential to have a native Mac OSX install. Sure, you may argue that your MBP with its four-core, hyper-threaded i7, SSD, 16GB of DDR3 memory are sufficient for spinning up a VM, and in most instances — such as using a VM for a quick demo — you’re right. However, when experimenting with a slightly heavier workload that is a bit more resource intensive, you’ll want to explore a native install.
In this post, I will cover setup of a few basic dependencies and the necessities to run HDFS, MapReduce with YARN, Apache ZooKeeper, and Apache HBase. It should be used as a guideline to get your local CDH box setup with the objective to enable you with building and running applications on the Apache Hadoop stack.
Note: This process is not supported and thus you should be comfortable as a self-supporting sysadmin. With that in mind, the configurations throughout this guideline are suggested for your default bash shell environment that can be set in your ~/.profile.
Dependencies
Install the Java version that is supported for the CDH version you are installing. In my case for CDH 5.1, I’ve installed JDK 1.7 u67. Historically the JDK for Mac OSX was only available from Apple, but since JDK 1.7, it’s available directly through Oracle’s Java downloads. Download the .dmg (in the example below, jdk-7u67-macosx-x64.dmg
) and install it.
Verify and configure the installation:
Old Java path: /System/Library/Frameworks/JavaVM.framework/Home
New Java path: /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home"
Note: You’ll notice that after installing the Oracle JDK, the original path used to manage versioning /System/Library/Frameworks/JavaVM.framework/Versions
, will not be updated and you now have the control to manage your versions independently.
Enable ssh on your mac by turning on remote login. You can find this option under your toolbar’s Apple icon > System Preferences > Sharing.
- Check the box for Remote Login to enable the service.
- Allow access for: “Only these users: Administrators”
Note: In this same window, you can modify your computer’s hostname.
![macos-f1]()
Enable password-less ssh login to localhost for MRv1 and HBase.
- Open your terminal.
- Generate an
rsa
or dsa
key.
ssh-keygen -t rsa -P ""
- Continue through the key generator prompts (use default options).
- Test:
ssh localhost
Homebrew
Another toolkit I admire is Homebrew, a package manager for OSX. While Xcode developer command-line tools are great, the savvy naming conventions and ease of use of Homebrew get the job done in a fun way.
I haven’t needed Homebrew for much else than for installing dependencies required for building native Snappy libraries for Mac OSX and ease of install of MySQL for Hive. Snappy is commonly used within HBase, HDFS, and MapReduce for compression and decompression.
CDH
Finally, the easy part: The CDH tarballs are very nicely packaged and easily downloadable from Cloudera’s repository. I’ve downloaded tarballs for CDH 5.1.0.
Download and explode the tarballs in a lib
directory where you can manage latest versions with a simple symlink as the following. Although Mac OSX’s “Make Alias” feature is bi-directional, do not use it, but instead use your command-line ln -s
command, such as ln -s source_file target_file
.
/Users/jordanh/cloudera/
cdh5.1/
- hadoop ->
/Users/jordanh/cloudera/lib/hadoop-2.3.0-cdh5.1.0
- hbase ->
/Users/jordanh/cloudera/lib/hbase-0.98.1-cdh5.1.0
- hive ->
/Users/jordanh/cloudera/lib/hive-0.12.0-cdh5.1.0
- zookeeper ->
/Users/jordanh/cloudera/lib/zookeeper-3.4.5-cdh4.7.0
ops/
dn
logs/hadoop, logs/hbase, logs/yarn
nn/
pids
tmp/
zk/
You’ll notice above that you’ve created a handful of directories under a folder named ops
. You’ll use them later to customize the configuration of the essential components for running Hadoop. Set your environment properties according to the paths where you’ve exploded your tarballs.
~/.profile
CDH="cdh5.1"
export HADOOP_HOME="/Users/jordanh/cloudera/${CDH}/hadoop"
export HBASE_HOME="/Users/jordanh/cloudera/${CDH}/hbase"
export HIVE_HOME="/Users/jordanh/cloudera/${CDH}/hive"
export HCAT_HOME="/Users/jordanh/cloudera/${CDH}/hive/hcatalog"
export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${ZK_HOME}/bin:${HBASE_HOME}/bin:${HIVE_HOME}/bin:${HCAT_HOME}/bin:${M2_HOME}/bin:${ANT_HOME}/bin:${PATH}
Update your main Hadoop configuration files, as shown in the sample files below. You can also download all files referenced in this post directly from here.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/jordanh/cloudera/ops/tmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value>
<description>A comma-separated list of the compression codec classes that can
be used for compression/decompression. In addition to any classes specified
with this property (which take precedence), codec classes on the classpath
are discovered using a Java ServiceLoader.</description>
</property>
</configuration>
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/jordanh/cloudera/ops/nn</value>
<description>Determines where on the local filesystem the DFS name node
should store the name table(fsimage). If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy. </description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/jordanh/cloudera/ops/dn/</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>localhost:50075</value>
<description>
The datanode http server address and port.
If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
I attribute the YARN and MRv2 configuration and setup from the CDH 5 installation docs. I will not digress into the specifications of each property or the orchestration and details of how YARN and MRv2 operate, but there’s some great information that my colleague Sandy has already shared for developers and admins.
Be sure to make the necessary adjustments per your system’s memory and CPU constraints. Per the image below, it is easy to see how these parameters will affect your machine’s performance when you execute jobs.
![macos-f2]()
Next, edit the following files as shown.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>the valid service name should only contain a-zA-Z0-9_ and can not start with numbers</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>Whether to enable log aggregation</description>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://localhost:8020/tmp/yarn-logs</value>
<description>Where to aggregate logs to.</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
<description>Amount of physical memory, in MB, that can be allocated
for containers.</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
<description>Number of CPU cores that can be allocated
for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>The minimum allocation for every container request at the RM,
in MBs. Memory requests lower than this won't take effect,
and the specified value will get allocated at minimum.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
<description>The maximum allocation for every container request at the RM,
in MBs. Memory requests higher than this won't take effect,
and will get capped to this value.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>The minimum allocation for every container request at the RM,
in terms of virtual CPU cores. Requests lower than this won't take effect,
and the specified value will get allocated the minimum.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>2</value>
<description>The maximum allocation for every container request at the RM,
in terms of virtual CPU cores. Requests higher than this won't take effect,
and will get capped to this value.</description>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:8021</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/tmp/job-history/</value>
<description></description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The runtime framework for executing MapReduce jobs.
Can be one of local, classic or yarn.
</description>
</property>
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>1</value>
<description>
The number of virtual cores required for each map task.
</description>
</property>
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>1</value>
<description>
The number of virtual cores required for each reduce task.
</description>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
<description>Larger resource limit for maps.</description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
<description>Larger resource limit for reduces.</description>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx768m</value>
<description>Heap-size for child jvms of maps.</description>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx768m</value>
<description>Heap-size for child jvms of reduces.</description>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1024</value>
<description>The amount of memory the MR AppMaster needs.</description>
</property>
</configuration>
# Where log files are stored. $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR="/Users/jordanh/cloudera/ops/logs/hadoop"
export YARN_LOG_DIR="/Users/jordanh/cloudera/ops/logs/yarn"
# The directory where pid files are stored when processes run as daemons. /tmp by default.
export HADOOP_PID_DIR="/Users/jordanh/cloudera/ops/pids"
export YARN_PID_DIR=${HADOOP_PID_DIR}
You can configure HBase to run without separately downloading Apache ZooKeeper. Rather, it has a bundled package that you can easily run as a separate instance or as standalone mode in a single JVM. I recommend using either distributed or standalone mode instead of a separately downloaded ZooKeeper tarball on your machine for ease of use, configuration, and management.
The primary difference with configuration between running HBase in distributed or standalone mode is with the hbase.cluster.distributed
property in hbase-site.xml
. Set the property to false for launching HBase in standalone mode or true to spin up separate instances for services such as HBase’s ZooKeeper and RegionServer. Update the following configurations for HBase as specified to run it per this type of configuration.
Note regarding hbase-site.xml
: Property hbase.cluster.distributed
is set to false by default and will launch in standalone mode. Also, hbase.zookeeper.quorum
is set to localhost by default and does not need to be overridden in our scenario.
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false for standalone mode and true for distributed mode. If
false, startup will run all HBase and ZooKeeper daemons together
in the one JVM.
</description>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/Users/jordanh/cloudera/ops/tmp/hbase-${user.name}</value>
<description>Temporary directory on the local filesystem.
Change this setting to point to a location more permanent
than '/tmp' (The '/tmp' directory is often cleared on
machine restart).
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/jordanh/cloudera/ops/zk</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8020/hbase</value>
<description>The directory shared by region servers and into
which HBase persists. The URL should be 'fully-qualified'
to include the filesystem scheme. For example, to specify the
HDFS directory '/hbase' where the HDFS instance's namenode is
running at namenode.example.org on port 9000, set this value to:
hdfs://namenode.example.org:9000/hbase. By default HBase writes
into /tmp. Change this configuration else all data will be lost
on machine restart.
</description>
</property>
</configuration>
Note regarding $HBASE_HOME/conf/hbase-env.sh
: By default HBASE_MANAGES_ZK
is set as true and is listed below only for explicit definition.
# Where log files are stored. $HBASE_HOME/logs by default.
# Where log files are stored. $HBASE_HOME/logs by default.
export HBASE_LOG_DIR="/Users/jordanh/cloudera/ops/logs/hbase"
# The directory where pid files are stored. /tmp by default.
export HBASE_PID_DIR="/Users/jordanh/cloudera/ops/pids"
# Tell HBase whether it should manage its own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=true
Pulling it All Together
By now, you should have accomplished setting up HDFS, YARN, and HBase. Hadoop setup and configuration is quite tedious, much less managing it over time (thus Cloudera Manager, which is unfortunately not available for Macs).
These are the bare essentials for getting your local machine ready for running MapReduce jobs and building applications on HBase. In the next few steps, we will start/stop the services and provide examples to ensure each service is operating correctly. The steps are listed in the specific order for initialization in order to adhere to dependencies. The order could be reversed for halting the services.
Service HDFS
NameNode
format: hdfs namenode -format
start: hdfs namenode
stop: Ctrl-C
url: http://localhost:50070/dfshealth.html
DataNode
start: hdfs datanode
stop: Ctrl-C
url: http://localhost:50075/browseDirectory.jsp?dir=%2F&nnaddr=127.0.0.1:8020
Test
hadoop fs -mkdir /tmp
hadoop fs -put /path/to/local/file.txt /tmp/
hadoop fs -cat /tmp/file.txt
Service YARN
ResourceManager
start: yarn resourcemanager
stop: Ctrl-C
url: http://localhost:8088/cluster
NodeManager
start: yarn nodemanager
stop: Ctrl-C
url: http://localhost:8042/node
MapReduce Job History Server
start: mapred historyserver, mr-jobhistory-daemon.sh start historyserver
stop: Ctrl-C, mr-jobhistory-daemon.sh stop historyserver
url: http://localhost:19888/jobhistory/app
Test Vanilla YARN Application
hadoop jar $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.0.jar -appname DistributedShell -jar $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.0.jar -shell_command "ps wwaxr -o pid,stat,%cpu,time,command | head -10" -num_containers 2 -master_memory 1024
Test MRv2 YARN TestDFSIO
hadoop org.apache.hadoop.fs.TestDFSIO -write -nrFiles 5 -size 1GB
hadoop org.apache.hadoop.fs.TestDFSIO -read -nrFiles 5 -size 1GB
Test MRv2 YARN Terasort/Teragen
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar teragen 100000000 /tmp/eval/teragen
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar terasort /tmp/eval/teragen /tmp/eval/terasort
Test MRv2 YARN Pi
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar pi 100 100
Service HBase
HBase Master/RegionServer/ZooKeeper
start: start-hbase.sh
stop: stop-hbase.sh
logs: /Users/jordanh/cloudera/ops/logs/hbase/
url: http://localhost:60010/master-status
Test
hbase shell
create 'URL_HITS', {NAME=>'HOURLY'},{NAME=>'DAILY'},{NAME=>'YEARLY'}
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090110', '10'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090111', '5'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090112', '30'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090113', '80'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090114', '7'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'DAILY:20140901', '10012'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'YEARLY:2014', '93310101'
scan 'URL_HITS'
Kite SDK Test
Get familiar with the Kite SDK by trying out this example that loads data to both HDFS and then HBase. Note that there are a few common issues on your OSX that may surface when running through the Kite SDK example. They can be easily resolved with additional setup/config as specified below.
Problem: NoClassDefFoundError
: org/apache/hadoop/hive/metastore/api/NoSuchObjectException
Resolution: Fix your classpath by making sure to set HIVE_HOME and HCAT_HOME
in your environment.
export HIVE_HOME="/Users/jordanh/cloudera/${CDH}/hive"
export HCAT_HOME="/Users/jordanh/cloudera/${CDH}/hive/hcatalog"
Problem: InvocationTargetException
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
Resolution: Snappy libraries are not compiled for Mac OSX out of the box. A Snappy Java port was introduced in CDH 5 and likely will require to be recompiled on your machine.
git clone https://github.com/xerial/snappy-java.git
cd snappy-java
make
cp target/snappy-java-1.1.1.3.jar $HADOOP_HOME/share/hadoop/common/lib/asnappy-java-1.1.1.3.jar
Landing Page
Creating a landing page will help consolidate all the HTTP addresses of the services that you’re running. Please note that localhost can be replaced with your local hostname (such as jakuza-mbp.local
).
Service Apache HTTPD
start: sudo -s launchctl load -w /System/Library/LaunchDaemons/org.apache.httpd.plist
stop: sudo -s launchctl unload -w /System/Library/LaunchDaemons/org.apache.httpd.plist
logs: /var/log/apache2/
url: http://localhost/index.html
Create index.html (edit /Library/WebServer/Documents/index.html, which you can download here).
It will look something like this:
![macos-f3]()
Conclusion
With this guide, you should have a locally running Hadoop cluster with HDFS, MapReduce, and HBase. These are the core components for Hadoop, and are good initial foundation for building and prototyping your applications locally.
I hope this will be a good starting point on your dev box to try out more ways to build your products, whether they are data pipelines, analytics, machine learning, search and exploration, or more, on the Hadoop stack.
Jordan Hambleton is a Solutions Architect at Cloudera.