Hello, Apache Hadoop 2.4.0

April 11, 2014, 6:55 am

≫ Next: How-to: Run a Simple Apache Spark App in CDH 5

≪ Previous: Cloudera Enterprise 5 is Now Generally Available!

The community has voted to release Apache Hadoop 2.4.0.

Hadoop 2.4.0 includes myriad improvements to HDFS and MapReduce, including (but not limited to):

ACL Support in HDFS — which allows, among other things, easier access to Apache Sentry-managed data by components that use it (already shipping in CDH 5.0.0)
Native support for rolling upgrades in HDFS (equivalent functionality already shipping inside CDH 4.5.0 and later)
Usage of protocol-buffers for HDFS FSImage for smooth operational upgrades
Complete HTTPS support in HDFS
Automatic Failover for ResourceManager HA in YARN
Preview version of the YARN Timeline Server for storing and serving generic application history

Congratulations to everyone who contributed! See full release notes here.

Justin Kestelyn is Cloudera’s developer outreach director.

↧

How-to: Run a Simple Apache Spark App in CDH 5

April 14, 2014, 8:47 am

≫ Next: Cloudera Live: The Instant Apache Hadoop Experience

≪ Previous: Hello, Apache Hadoop 2.4.0

Getting started with Spark (now shipping inside CDH 5) is easy using this simple example.

Apache Spark is a general-purpose, cluster computing framework that, like MapReduce in Apache Hadoop, offers powerful abstractions for processing large datasets. For various reasons pertaining to performance, functionality, and APIs, Spark is already becoming more popular than MapReduce for certain types of workloads. (For more background about Spark, read this post.)

In this how-to, you’ll learn how to write, compile, and run a simple Spark program written in Scala on CDH 5 (in which Spark ships and is supported by Cloudera). The full code for the example is hosted at https://github.com/sryza/simplesparkapp.

Writing

Our example app will be a souped-up version of WordCount, the classic MapReduce example. In WordCount, the goal is to learn the distribution of letters in the most popular words in our corpus. That is, we want to:

Read an input set of text documents
Count the number of times each word appears
Filter out all words that show up less than a million times
For the remaining set, count the number of times each letter occurs

In MapReduce, this would require two MapReduce jobs, as well as persisting the intermediate data to HDFS in between them. In constrast, in Spark, you can write a single job in about 90 percent fewer lines of code.

Our input is a huge text file where each line contains all the words in a document, stripped of punctuation. The full Scala program looks like this:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SparkWordCount {
  def main(args: Array[String]) {
    val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
    val threshold = args(1).toInt

    // split each document into words
    val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))

    // count the occurrence of each word
    val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)

    // filter out words with less than threshold occurrences
    val filtered = wordCounts.filter(_._2 >= threshold)

    // count characters
    val charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)).reduceByKey(_ + _)

    System.out.println(charCounts.collect().mkString(", "))
  }
}

Spark uses “lazy evaluation”, meaning that transformations don’t execute on the cluster until an “action” operation is invoked. Examples of action operations are collect, which pulls data to the client, and saveAsTextFile, which writes data to a filesystem like HDFS.

It’s worth noting that in Spark, the definition of “reduce” is slightly different than that in MapReduce. In MapReduce, a reduce function call accepts all the records corresponding to a given key. In Spark, the function passed to reduce, or reduceByKey function call, accepts just two arguments – so if it’s not associative, bad things will happen. A positive consequence is that Spark knows it can always apply a combiner. Based on that definition, the Spark equivalent of MapReduce’s reduce is similar to a groupBy followed by a map.

For those more comfortable with Java, here’s the same program using Spark’s Java API:

import java.util.ArrayList;
import java.util.Arrays;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.SparkConf;
import scala.Tuple2;

public class JavaWordCount {
  public static void main(String[] args) {
    JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Spark Count"));
    final int threshold = Integer.parseInt(args[1]);

    // split each document into words
    JavaRDD<String> tokenized = sc.textFile(args[0]).flatMap(
      new FlatMapFunction<String, String>() {
        public Iterable<String> call(String s) {
          return Arrays.asList(s.split(" "));
        }
      }
    );

    // count the occurrence of each word
    JavaPairRDD<String, Integer> counts = tokenized.map(
      new PairFunction<String, String, Integer>() {
        public Tuple2<String, Integer> call(String s) {
          return new Tuple2(s, 1);
        }
      }
    ).reduceByKey(
      new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer i1, Integer i2) {
          return i1 + i2;
        }
      }
    );

    // filter out words with less than threshold occurrences
    JavaPairRDD<String, Integer> filtered = counts.filter(
      new Function<Tuple2<String, Integer>, Boolean>() {
        public Boolean call(Tuple2<String, Integer> tup) {
          return tup._2 >= threshold;
        }
      }
    );

    // count characters
    JavaPairRDD<Character, Integer> charCounts = filtered.flatMap(
      new FlatMapFunction<Tuple2<String, Integer>, Character>() {
        public Iterable<Character> call(Tuple2<String, Integer> s) {
          ArrayList<Character> chars = new ArrayList<Character>(s._1.length());
          for (char c : s._1.toCharArray()) {
            chars.add(c);
          }
          return chars;
        }
      }
    ).map(
      new PairFunction<Character, Character, Integer>() {
        public Tuple2<Character, Integer> call(Character c) {
          return new Tuple2(c, 1);
        }
      }
    ).reduceByKey(
      new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer i1, Integer i2) {
          return i1 + i2;
        }
      }
    );

    System.out.println(charCounts.collect());
  }
}

Because Java doesn’t support anonymous functions, the program is considerably more verbose, but it still requires a fraction of the code needed in an equivalent MapReduce program.

Compiling

We’ll use Maven to compile our program. Maven expects a specific directory layout that informs it where to look for source files. Our Scala code goes under src/main/scala, and our Java code goes under src/main/java. That is, we place SparkWordCount.scala in the src/main/scala/com/cloudera/sparkwordcount directory and JavaWordCount.java in the src/main/java/com/cloudera/sparkwordcount directory.

Maven also requires you to place a pom.xml file in the root of the project directory that tells it how to build the project. A few noteworthy excerpts are included below.

To compile Scala code, include:

<plugin>
  <groupId>org.scala-tools</groupId>
      <artifactId>maven-scala-plugin</artifactId>
      <executions>
        <execution>
          <goals>
            <goal>compile</goal>
            <goal>testCompile</goal>
          </goals>
        </execution>
      </executions>
</plugin>

which requires adding the scala-tools plugin repository:

<pluginRepositories>
<pluginRepository>
    <id>scala-tools.org</id>
    <name>Scala-tools Maven2 Repository</name>
    <url>http://scala-tools.org/repo-releases</url>
  </pluginRepository>
</pluginRepositories>

Then, include Spark and Scala as dependencies:

<dependencies>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.10.2</version>
  </dependency>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>0.9.0-cdh5.0.0</version>
  </dependency>
</dependencies>

Finally, to generate our app jar, simply run:

mvn package

It will show up in the target directory as sparkwordcount-0.0.1-SNAPSHOT.jar.

Running

Running your Spark application is like running any Java program: You include the application jar and its dependencies in the classpath and pass apps the main class. In a CDH installation, the Spark and Hadoop jars are included on every node. Some will recommend packaging these dependencies inside your Spark application jar itself, but Cloudera recommends referencing the locally installed jars. Doing so ensures that the client uses the same versions of these jars as the server, and means you don’t need to recompile apps when you upgrade the cluster.

The following includes local Hadoop and Spark jars in the classpath and then runs the application. Before running, place the input file into a directory on HDFS. The repository supplies an example input file in its “data” directory.

Spark’s trunk contains a script called spark-submit that abstracts away the pain of building the classpath. Its inclusion in an upcoming release will make launching an application much easier.

source /etc/spark/conf/spark-env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera

# system jars:
CLASSPATH=/etc/hadoop/conf
CLASSPATH=$CLASSPATH:$HADOOP_HOME/*:$HADOOP_HOME/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/*:$HADOOP_HOME/../hadoop-mapreduce/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/*:$HADOOP_HOME/../hadoop-yarn/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/*:$HADOOP_HOME/../hadoop-hdfs/lib/*
CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*

# app jar:
CLASSPATH=$CLASSPATH:target/sparkwordcount-0.0.1-SNAPSHOT.jar

$JAVA_HOME/bin/java -cp $CLASSPATH -Dspark.master=local com.cloudera.sparkwordcount.SparkWordCount hdfs:///somedirectory/inputfile.txt 2

-Dspark.master specifies the cluster against which to run the application; local will run all tasks in the same local process. To run against a Spark standalone cluster instead, include a URL containing the master’s address (such as spark://masterhost:7077). To run against a YARN cluster, include yarn-client — Spark will determine the YARN ResourceManager’s address from the YARN configuration file.

The output of the program should look something like this:

(e,6), (f,1), (a,4), (t,2), (u,1), (r,2), (v,1), (b,1), (c,1), (h,1), (o,2), (l,1), (n,4), (p,2), (i,1)

Congratulations, you have just run a simple Spark application in CDH 5. Happy Sparking!

Sandy Ryza is an engineer on the data science team at Cloudera. He is an Apache Hadoop committer and recently led Cloudera’s Spark development.

Spark Summit 2014 is coming (June 30 – July 2)! Register here to get 20% off the regular conference price.

↧

Cloudera Live: The Instant Apache Hadoop Experience

April 17, 2014, 12:12 pm

≫ Next: Bringing the Best of Apache Hive 0.13 to CDH Users

≪ Previous: How-to: Run a Simple Apache Spark App in CDH 5

Get started with Apache Hadoop and use-case examples online in just seconds.

Today, we announced Cloudera Live, a new online service for developers and analysts (currently in public beta) that makes it easy to learn, explore, and try out CDH, Cloudera’s open source software distribution containing Apache Hadoop and related projects. No downloads, no installations, no waiting — just point-and-play!

Try Cloudera Live (Beta)

Cloudera Live is just that: a complete, live, CDH 5 cluster with a Hue interface (based on Hue 3.5.0, the latest and greatest). It includes pre-packaged examples/patterns for using Impala, Search, Apache HBase, and many other Hadoop ecosystem components. (Note: Cloudera Live is currently read-only, so loading data via the Apache Sqoop app isn’t possible. To explore CDH with ingested data, download our QuickStart VM.)

After spending some time with Cloudera Live (within a three-hour session), you may be wondering: How did we do it? As you’ll find from the answer below, the combination of Amazon Web Services (AWS) and Cloudera Manager made it easy.

Inside Cloudera Live

Cloudera Live is hosted on four AWS m3.large instances containing Ubuntu 12.04 and 100GB storage. (If you ever try to build your own cluster on AWS for your own use and thus need less performance, one xlarge instance will be enough — or, you could install fewer services on an even smaller instance.)

We configured the security group as shown below. We allow everything between the instances (the first row — don’t forget that on multi-machine clusters!) and opened up Cloudera Manager and Hue ports to the outside.

We used Cloudera Manager to auto-install everything for us based on this guide. Moreover, post-install monitoring and configuration was greatly simplified.

The first step was to connect to one of the machines:

ssh -i ~/demo.pem ubuntu@ec2-11-222-333-444.compute-1.amazonaws.com

Next, we retrieved and started Cloudera Manager:

wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
chmod +x
sudo ./cloudera-manager.bin

After logging in with the default credentials (admin/admin), we entered all the Public DNS IP addresses (such as ec2-11-222-333-444.compute-1.amazonaws.com) on our machines in the Install Wizard and clicked Go. Et voila, Cloudera Manager set up the entire cluster automatically! Hence, Cloudera Live was born.

We hope you enjoy Cloudera Live, and we need your feedback whether you do or not! You can do that via the upstream Hue list, the Hue forum at cloudera.com/community, or by clicking on the “Feedback” tab in the demo itself.

↧

Bringing the Best of Apache Hive 0.13 to CDH Users

April 28, 2014, 7:21 am

≫ Next: How-to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager

≪ Previous: Cloudera Live: The Instant Apache Hadoop Experience

More than 300 bug fixes and stable features in Apache Hive 0.13 have already been backported into CDH 5.0.0.

Last week, the Hive community voted to release Hive 0.13. We’re excited about the continued efforts and progress in the project and the latest release — congratulations to all contributors involved!

Furthermore, thanks to continual feedback from customers about their needs, we were able to test and make more than 300 Hive 0.13 fixes and stable features generally available via CDH 5.0.0, which we released last month. Thus, Cloudera customers can confidently take advantage of them in production right now, including:

Native Parquet support
As we reported some time back, native support for Parquet, the open source, general-purpose, columnar storage format for the Apache Hadoop ecosystem, went upstream via HIVE-5783 in large part due to the efforts of Criteo engineers. Thus, users of CDH 5.0.0 can easily create Parquet tables in Hive and thus benefit from improved performance and compression.
Scale and precision support for DECIMAL datatype
Per HIVE-3976, users can now specify the scale and precision of DECIMAL when creating a table.
New CHAR datatype
CHAR datatypes are now supported in Hive (HIVE-5191), in addition to VARCHAR.
Maven refactoring
Hive is Maven-ized (see HIVE-5610 for the merge to trunk) in CDH 5.0.0 for faster, easier builds.
Public parallel testing framework
Cloudera proposed (HIVE-4739) to sponsor an open, public test cluster (on Amazon EC2) for the Hive community, and this environment is now available for users of Hive in CDH 5.0.0, as well as those of upstream Hive 0.13.
SSL encryption for LDAP username/password
Per HIVE-5351, HiveServer2 supports encrypted communications via SSL with client drivers to enable secure LDAP username/password authentication as an alternative to Kerberos.

Thanks to these ongoing backports — which give CDH users continual access to the best of upstream Hive code — it will also be much easier for those users to upgrade to future releases of Hive!

Hive: The Batch-Processing Spoke in the Enterprise Data Hub

To summarize, as part of our ongoing effort to backport upstream Hive bits into CDH, CDH 5.0.0 users have access to many of the production-ready pieces of Hive 0.13. Furthermore, that functionality is present alongside differentiated components to ensure that enterprise data hub users have access to the best possible tools for their workloads, whether it be Apache Spark for interactive analytics, Hive for batch processing, Impala for interactive SQL, or multiple other options.

Justin Kestelyn is Cloudera’s developer outreach director.

↧

How-to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager

June 20, 2014, 8:25 am

≫ Next: Cloudera Enterprise 5.1 is Now Available

≪ Previous: Bringing the Best of Apache Hive 0.13 to CDH Users

It’s been a while since we provided a how-to for this purpose. Thanks, Daan Debie (@DaanDebie), for allowing us to re-publish the instructions below (for CDH 5)!

I recently started as a Big Data Engineer at The New Motion. While researching our best options for running an Apache Hadoop cluster, I wanted to try out some of the features available in the newest version of Cloudera’s Hadoop distribution: CDH 5. Of course I could’ve downloaded the QuickStart VM, but I rather wanted to run a virtual cluster, making use of the 16GB of RAM my shiny new 15″ Retina Macbook Pro has ;)

Vagrant

There are some tutorials, and repositories available for installing a local virtualized cluster, but none of them did what I wanted to do: install the bare cluster using Vagrant, and install the Hadoop stack using the Cloudera Manager. So I created a simple Vagrant setup myself. You can find it here.

Setting up the Virtual Machines

As per the instructions from the Gitub repo:

Depending on the hardware of your computer, installation will probably take between 15 and 25 minutes.

First install VirtualBox and Vagrant.

Install the Vagrant Hostmanager plugin.

$ vagrant plugin install vagrant-hostmanager

Clone this repository.

$ git clone https://github.com/DandyDev/virtual-hadoop-cluster.git

Provision the bare cluster. It will ask you to enter your password, so it can modify your /etc/hosts file for easy access in your browser. It uses the

$ cd virtual-hadoop-cluster
$ vagrant up

Now we can install the Hadoop stack.

Installing Hadoop and Related Components

Surf to: http://vm-cluster-node1:7180.
Login with admin/admin.
Select Cloudera Express and click Continue twice.
On the page where you have to specifiy hosts, enter the following: vm-cluster-node[1-4] and click Search. Four nodes should pop up and be selected. Click Continue.
On the next page (“Cluster Installation > Select Repository”), leave everything as is and click Continue.
On the next page (“Cluster Installation > Configure Java Encryption”) I’d advise to tick the box, but only if your country allows it. Click Continue.
On this page do the following:

Login To All Hosts As: Another user -> enter vagrant
In the two password fields enter: vagrant
Click Continue.

Wait for Cloudera Manager to install the prerequisites… and click Continue.

Wait for Cloudera Manager to download and distribute the CDH packages… and click Continue.

Wait while the installer is inspecting the hosts, and Run Again if you encounter any (serious) errors (I got some that went away the second time). After this, click Finish.

For now, we’ll install everything but HBase. You can add HBase later, but it’s quite taxing for the virtual cluster. So on the “Cluster Setup” page, choose “Custom Services” and select the following: HDFS, Hive, Hue, Impala, Oozie, Solr, Spark, Sqoop2, YARN and ZooKeeper. Click Continue.

On the next page, you can select what services end up on what nodes. Usually Cloudera Manager chooses the best configuration here, but you can change it if you want. For now, click Continue.

On the “Database Setup” page, leave it on “Use Embedded Database.” Click Test Connection (it says it will skip this step) and click Continue.

Click Continue on the “Review Changes” step. Cloudera Manager will now try to configure and start all services.

And you’re Done!. Have fun experimenting with Hadoop!

↧

Cloudera Enterprise 5.1 is Now Available

July 17, 2014, 1:30 pm

≫ Next: New in CDH 5.1: Document-level Security for Cloudera Search

≪ Previous: How-to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager

Cloudera Enterprise’s newest release contains important new security and performance features, and offers support for the latest innovations in the open source platform.

We’re pleased to announce the release of Cloudera Enterprise 5.1 (comprising CDH 5.1, Cloudera Manager 5.1, and Cloudera Navigator 2.0).

Cloudera Enterprise 5, released April 2014, was a milestone for users in terms of security, performance, and support for the latest community-driven innovations, and this update includes significant new investments in those areas, as well as a host of bug fixes. Here are some of the highlights (incomplete; see the Release Notes for CDH, Cloudera Manager, and Cloudera Navigator for a full list of features and fixes):

Security

HDFS now includes support for access control lists (ACLs).
Apache Sentry (incubating) now supports GRANT/REVOKE statements.
Cloudera Search now supports document-level security (via Sentry).
Cloudera Manager has several new security-related features, such as ability to manage/deploy Kerberos configurations, integrate Kerberos with Active Directory, and manage Hadoop/CDH SSL-related configurations.
Spark Streaming is now integrated with Kerberos.
Cloudera Navigator 2.0 now provides comprehensive metadata, lineage, and auditing support across enterprise data hubs. Navigator also now includes enterprise-grade encryption and key management via Navigator Encrypt and Navigator Key Trustee, respectively.

Performance

Impala now utilizes HDFS caching for improved performance.
Impala queries and COMPUTE STATS statements are significantly faster.
HBase has improved write performance for WAL.

Support for the Latest Open Source Innovations

Cloudera Enterprise 5.1 is re-based on the latest stable component releases, including:

Apache Crunch 0.10
Apache Flume 1.5.0
Apache HBase 0.98.1
Apache Mahout 0.9.0
Apache Sentry (incubating) 1.3
Apache Spark 1.0
HBase Lily Indexer 1.5
Hue 3.6
Impala 1.4

…with new platform support for:

RHEL 6.5/CentOS 6.5
OEL 6.5 with UEK 2 and UEK3
MySQL 5.6
PostgreSQL 9.2

Furthermore, this release contains a number of enhancements in the areas of resource management (now supports three different modes of Impala RM via YARN), and SQL support (DECIMAL support across Apache Hive, Impala, Apache Avro, Apache Parquet [incubating], and text file formats; plus ORDER BY without LIMIT in Impala).

Over the next few weeks, we’ll be publishing blog posts that cover a number of these new features in detail. In the meantime:

Download Cloudera Enterprise 5.1
Explore the documentation:

As always, we value your feedback; please provide any comments and suggestions through our community forums. You can also file bugs via issues.cloudera.org.

↧

New in CDH 5.1: Document-level Security for Cloudera Search

July 23, 2014, 8:24 am

≫ Next: New in CDH 5.1: Apache Spark 1.0

≪ Previous: Cloudera Enterprise 5.1 is Now Available

Cloudera Search now supports fine-grain access control via document-level security provided by Apache Sentry.

In my previous blog post, you learned about index-level security in Apache Sentry (incubating) and Cloudera Search. Although index-level security is effective when the access control requirements for documents in a collection are homogenous, often administrators want to restrict access to certain subsets of documents in a collection.

For example, consider a simple hierarchy of increasingly restrictive security classifications: confidential, secret, and top-secret, and a user with access to view confidential and secret documents querying the corpus. Without document-level security, this query becomes unnecessarily complex. Consider two possible implementations:

You could store the confidential and secret documents in non-intersecting collections. That would require complexity at the application or client level to query multiple collections and to combine and score the results:

You could duplicate and store the confidential documents with the secret ones in a single collection. That would reduce the application-layer complexity, but add storage overhead and complexity associated with keeping multiple copies of documents in sync:

In contrast, document-level security, integrated via Sentry and now shipping in CDH 5.1, provides an out-of-the-box solution to this problem without adding extra complexity at the application/client layer or significant storage overhead. In this post, you’ll learn how it works. (Note: only access control is addressed here; other security requirements such as encryption are out of scope.)

Document-Level Security Model

You may recall from my previous post that a Sentry policy file specifies the following sections:

[groups]: maps a Hadoop group to its set of Sentry roles
[roles]: maps a Sentry role to its set of privileges (such as QUERY access on a collection “logs”)

A simple policy file specification giving every user of the hadoop group “ops” the ability to query collection “logs” would look like this:

[groups]
# Assigns each Hadoop group to its set of roles
ops = ops_role
[roles]
ops_role = collection = logs->action=Query,

In document-level security, the Sentry role names are used as the authorization tokens that specify the set of roles that can view certain documents. The authorization tokens are specified in the individual Apache Solr documents, rather than in the Sentry policy file with the index-level permissions. This separation is done for a couple of reasons:

There are many more documents than collections; specifying thousands or millions of document-level permissions per collection in a single policy file would not scale.
Because the tokens are indexed in the Solr documents themselves, we can use Solr’s built-in filtering capabilities to efficiently enforce authorization requirements.

The filtering works by having a Solr SearchComponent intercept the query and append a FilterQuery as part of the following process:

A few important considerations to note here:

Document-level authorization does not supersede index-level authorization; if a user has the ability to view a document according to document-level security rules, but not according to index-level security rules, the request will be rejected.
The document-level component adds a FilterQuery with all of the user’s roles OR’ed together (a slight simplification of the actual FilterQuery used). Thus, to be able to view the document, the document must contain at least one of the user’s roles in the authorization token field. The name of the token field (called “authField” in the image above) is configurable.
Because multiple FilterQuerys work together as an intersection, a malicious user can’t avoid the document-level filter by specify his/her own trivial FilterQuery (such as fq=*:*)
Using a FilterQuery is efficient, because Solr caches previously used FilterQuerys. Thus, when a user makes repeated queries on a collection with document-level security enabled, we only pay the cost of constructing the filter on the first query and use the cached filter on subsequent requests

Enabling Document-Level Security

By default, document-level security is disabled to maintain backward compatibility with prior versions of Cloudera Search. Enabling the feature for a collection involves small modifications to the default solrconfig.xml configuration file:

 <searchComponent name="queryDocAuthorization"
     class="org.apache.solr.handler.component.QueryDocAuthorizationComponent" >
	<!-- Set to true to enable document-level authorization -->
	<bool name="enabled">false</bool>

	<!-- Field where the auth tokens are stored in the document -->
	<str name="sentryAuthField">sentry_auth</str>
...
</searchComponent>

Simply changed enabled from “false” to “true” and if desired, change the sentryAuthField field. Then, upload the configuration and create the collection using Solrctl.

Integration with the Hue Search App

As with index-level security, document-level security is already integrated with the Hue Search App via secure impersonation in order to provide an intuitive and extensible end-user application.

Conclusion

CDH 5.1 brings fine-grain access control to Cloudera Search via the integration of Sentry’s document-level security features. Document-level security handles complex security requirements, while being simple to setup and efficient to use.

Cloudera Search is available for download with extensive documentation. If you have any questions, please contact us at the Cloudera Search Forum.

Gregory Chanan is a Software Engineer at Cloudera, and an Apache HBase committer.

↧

New in CDH 5.1: Apache Spark 1.0

July 28, 2014, 10:03 am

≫ Next: New in CDH 5.1: Hue’s Improved Search App

≪ Previous: New in CDH 5.1: Document-level Security for Cloudera Search

Spark 1.0 reflects a lot of hard work from a very diverse community.

Cloudera’s latest platform release, CDH 5.1, includes Apache Spark 1.0, a milestone release for the Spark project that locks down APIs for Spark’s core functionality. The release reflects the work of hundreds of contributors (including our own Diana Carroll, Mark Grover, Ted Malaska, Colin McCabe, Sean Owen, Hari Shreedharan, Marcelo Vanzin, and me).

In this post, we’ll describe some of Spark 1.0’s changes and new features that are relevant to CDH users.

API Incompatibilities

In anticipation of some features coming down the pipe, the release includes a few incompatible changes that will enable Spark to avoid breaking compatibility in the future. Most applications will require a recompile to run against Spark 1.0, and some will require changes in source code.

There are two changes in the core Scala API:

The cogroup and groupByKey operators now return Iterators over their values instead of Seqs. This change means that the set of values corresponding to a particular key need not all reside in memory at the same time.
SparkContext.jarOfClass now returns Option[String] instead of Seq[String].

Spark’s Java APIs were updated to accommodate Java 8 lambdas. Details on these changes are available under the Java section here.

The MLLib API contains a set of changes that allows it to support sparse and dense vectors in a unified way. Details on these changes are available here. (Note that MLLib is still a beta component, meaning that its APIs may change in the future.)

Details on the future-proofing of Spark streaming APIs are available here.

spark-submit

Most Spark programming examples focus on spark-shell, but prior to 1.0, users who wanted to submit compiled Spark applications to a cluster found it to be a convoluted process requiring guess-and-check, different invocations depending on the cluster manager and deploy mode, and oodles of boilerplate. Spark 1.0 includes spark-submit, a command that abstracts across the variety of deploy modes that Spark supports and takes care of assembling the classpath for you. A sample invocation:

spark-submit --class com.yourcompany.MainClass --deploy-mode cluster --master yarn appjar.jar apparg1 apparg2

Avro Support

We fixed a couple critical Apache Avro bugs that were preventing Spark from reading and writing Avro data. Stay tuned for a future post explaining best practices on interacting with Avro and Apache Parquet (incubating) data from Spark.

PySpark on YARN

One of the remaining items in Spark on YARN compared to other cluster managers was lack of PySpark support. Spark 1.0 allows you to launch PySpark apps against YARN clusters. PySpark currently only works in yarn-client mode. Starting a PySpark shell against a YARN installation is as simple as running:

MASTER=yarn-client pyspark

and running a PySpark script is as simple as running:

spark-submit --master yarn yourscript.py apparg1 apparg1

Spark History Server

A common complaint with Spark has been that the per-application UI, which displays task metrics and other useful information, disappears after an app completes. That leaves users in a rut when trying to debug failures. Instead, Spark 1.0 offers a History Server that displays information about applications after they have completed. Cloudera Manager provides easy setup and configuration of this daemon.

Spark SQL

Spark SQL, which deserves a blog post of its own, is a new Spark component that allows you to run SQL statements inside of a Spark application that manipulate and produce RDDs. Due to its immaturity and alpha component status, Cloudera does not currently offer commercial support for Spark SQL. However, we bundle it with our distribution so that users can try it out.

A Note on Stability

While we at Cloudera are quite bullish on Spark, it’s important the acknowledge that even its core components are not yet as stable as many of the more mature Hadoop ecosystem components. The 1.0 mark does not mean that Spark is now bug-free and ready to replace all your production MapReduce uses — but it does mean that people building apps on top of Spark core should be safe from surprises in future releases. Existing APIs will maintain compatibility, existing deploy modes will remain supported, and the general architecture will remain the same.

Conclusion

Cloudera engineers are working hard to make Spark more stable, easier to use, easier to debug, and easier to manage. Expect future releases to greater robustness, enhanced scalability, and deeper insight into what is going on inside a Spark application, both while running and after it has completed.

Sandy Ryza is a data scientist at Cloudera, and an Apache Hadoop committer.

↧

New in CDH 5.1: Hue’s Improved Search App

July 31, 2014, 8:22 am

≫ Next: New in CDH 5.1: HDFS Read Caching

≪ Previous: New in CDH 5.1: Apache Spark 1.0

An improved Search app in Hue 3.6 makes the Hadoop user experience even better.

Hue 3.6 (now packaged in CDH 5.1) has brought the second version of the Search App up to even higher standards. The user experience has been greatly improved, as the app now provides a very easy way to build custom dashboards and visualizations.

Below is a video demo-ing how to interactively explore some real Apache log data coming from the live Hue demo at cloudera.com/live. In just a few clicks, you can look for pages with errors, find the most popular Hue apps, identify the top Web browsers, or inspect user traffic on a gradient colored world map:

The main features in the new app include:

Dynamic interface updating in live
Drag-and-drop dashboard builder
Text, Timeline, Pie, Line, Bar, Map, Filters, Grid and HTML widgets
Solr Index creation wizard from a file

More is on the roadmap, such as integration with other Hue apps like Hive/HBase, export/import of results to Hadoop, and more data types to plot.

This tutorial explains how to index the Apache Log into Solr and start doing your own analytics. In the meantime, feel free to give the search dashboards a try via Hue 3.6 in CDH 5.1!

As usual, we welcome any feedback via @gethue, hue-user, or our community discussion forum.

↧

New in CDH 5.1: HDFS Read Caching

August 11, 2014, 9:01 am

≫ Next: Apache Hadoop 2.5.0 is Released

≪ Previous: New in CDH 5.1: Hue’s Improved Search App

Applications using HDFS, such as Impala, will be able to read data up to 59x faster thanks to this new feature.

Server memory capacity and bandwidth have increased dramatically over the last few years. Beefier servers make in-memory computation quite attractive, since a lot of interesting data sets can fit into cluster memory, and memory is orders of magnitude faster than disk.

For the latest release of CDH 5.1, Cloudera contributed a read caching feature to HDFS to allow applications in the Apache Hadoop ecosystem to take full advantage of the potential of in-memory computation (HDFS-4949). By using caching, we’ve seen a speedup of up to 59x compared to reading from disk, and up to 3x compared to reading from page cache.

We’ll cover performance evaluation in more detail in a future blog post. Here, we’ll focus on the motivation and design of HDFS caching.

Motivation

A form of memory caching is already present on each HDFS DataNode: the operating system page cache. The page cache automatically caches recently accessed data on the local filesystem. Because of the page cache, reading the same file more than once will often result in a dramatic speedup. However, the OS page cache falls short when considered in the setting of a distributed system.

One issue is the lack of global information about the in-memory state of each node. Given the choice of multiple HDFS replicas from which to read some data, an application is unable to schedule its tasks for cache-locality. Since the application is forced to schedule its tasks blindly, performance suffers.

When a data analyst runs a query, the application scheduler chooses one of the three block replica locations and runs its task there, which pulls the replica into the page cache (A). However, if the analyst runs the same query again, the scheduler has no way of knowing which replica is in the page cache, and thus no way to place its task for cache locality (B).

Another issue is the page cache’s replacement algorithm, which is a modified version of “least-recently used” eviction. LRU-like algorithms are susceptible to large scans that wipe out the existing contents of the cache. This happens quite commonly on shared Hadoop clusters.

Consider a data analyst running interactive queries on a memory-sized working set: If a large I/O-heavy MapReduce job runs at the same time, it will evict the data analyst’s working set from the page cache, leading to poor interactive performance. Without application-level knowledge of which dataset to keep in memory, the page cache can do no better for mixed workloads. Finally, although reading data from the page cache is faster than disk, it is still inefficient compared to reading directly from memory (so-called zero-copy reads).

Another source of inefficiency is checksum verification. These checksums are intended to catch disk and network errors, and can theoretically be skipped if the client is reading from local in-memory data that has already been checksummed. However, skipping redundant checksumming safely is impossible with the page cache since there’s no way to guarantee that a read is coming from memory. By fixing these two issues, we were able to improve read performance by up to 3x compared to reading from page cache.

Architecture

The above issues resulted in the following three design requirements:

Global knowledge of cluster cache state, so tasks can be scheduled for cache locality
Global control over cluster cache state, for predictable performance for mixed workloads
Pinning of data in local caches, to enable zero-copy reads and skipping checksums

Based on these requirements, we decided to add centralized cache management to the NameNode.

Example of an HDFS client caching a file: First, itsends a cache directive asking the NameNode to cache the file. The NameNode chooses some DataNodes to cache the requested file, with cache commands piggy-backed on the DataNode heartbeat. DataNodes respond with a cache report when the data is successfully cached.

Caching is explicit and user-driven. When a user wants something cached, they express their intent by creating a cache directive on the NameNode. A cache directive specifies the desired path to cache (meaning a file or directory in HDFS), a desired cache replication factor (up to the file’s replication factor), and the cache pool for the directive (used to enforce quotas on memory use). The system does not automatically manage cache directives, so it’s up to users to manage their outstanding cache directives based on their usage patterns.

Assuming that this cache directive is valid, the NameNode will attempt to cache said data. It will select cache locations from the set of DataNodes with the data on disk, and ask them to cache the data by piggy-backing a cache command on the DataNode heartbeat reply. This is the same way block replication and invalidation commands are sent.

When a DataNode receives a cache command, it pulls the desired data into its local cache by using mmap() and mlock() methods and then verifies its checksums. This series of operations guarantees that the data will remain resident in memory, and that it is safe to read without further checksum verification. Using the mmap() and mlock() methods has the advantage of storing the data off-heap, so large amounts of data can be cached without affecting garbage collection.

Because mlock() takes advantage of the OS page cache, if the block is already held there, we don’t need to copy it. The disadvantage of mlock is that the block must already exist in the filesystem before it can be locked in memory. So we cannot cache replicas on nodes that don’t have the replica already on disk.

DataNodes periodically send cache reports to the NameNode, which contain the state of their local cache. As soon as the NameNode knows that a block has been successfully cached on a DataNode, application schedulers can query the NameNode for this information and use it to schedule tasks for cache-locality.

Zero-copy Reads

Zero-copy read (ZCR) is the final step in efforts to improve the efficiency of the HDFS read path. Copies are one of the most obvious sources of inefficiency; the more time spent copying data, the fewer CPU cycles are left for useful work. ZCR is theoretically optimal in this regard, hence the name “zero-copy.”

The standard HDFS remote read path copies data from the kernel into the DataNode prior to sending it on to the DFSClient via a TCP socket. Short-circuit local reads eliminate this copy by “short-circuiting” the trip through the DataNode. Instead, the client simply reads the block file directly from the local filesystem.

However, even when using short-circuit reads, the DFSClient still needs to copy the data from kernel page cache into the client’s address space. ZCR, implemented in HDFS-4953, allow us to avoid that copy. Instead of copying, we use the mmap() system call to map the block from page cache directly into the client’s address space. ZCR also avoids the context switch overhead of repeated invocations of the read system call, which can be significant.

However, mmap() has some disadvantages. One difficulty is handling I/O errors. If a read() system call encounters an I/O error, it simply returns an error code. Accessing a memory-mapped segment can’t return an error, so any error results in a SIGBUS signal instead. Unless a signal handler has been installed, the calling process is terminated.

Fortunately, if a client is reading data that is cached by HDFS, it will never hit an I/O error (and thus never get a SIGBUS) — because the data is pinned in memory with mlock(). This approach lets us safely do ZCR without worrying about unexpected program termination. The client can also skip checksum verification when reading cached data, as the data is already checksummed by the datanode when it’s cached.

The ZCR API is described in HDFS-5191. In addition to a Java API, there is also a C API that allows applications such as Impala to take full advantage of zero-copy reads.

Example CLI usage

Here’s a simple example of creating a new cache pool and adding a cache directive for a file. This example assumes you’ve already configured your cluster correctly according to the official documentation.

$ hadoop fs -put myfile /
$ hadoop fs -put myfile /
$ # Add a new cache pool and cache directive
$ hdfs cacheadmin -addPool testPool
Successfully added cache pool testPool.
$ hdfs cacheadmin -addDirective -path /myfile -pool testPool
Added cache directive 1
$ # Wait for a minute or two for the NameNode to gather all datanode cache statistics. 512 of 512 bytes of our file should be cached.
$ hdfs cacheadmin -listPools -stats testPool
Found 1 result.
NAME      OWNER   GROUP   MODE            LIMIT  MAXTTL  BYTES_NEEDED  BYTES_CACHED  BYTES_OVERLIMIT  FILES_NEEDED  FILES_CACHED
testPool  andrew  andrew  rwxr-xr-x   unlimited   never           512           512                0             1
$ # Look at the datanode stats, see that our DN is using 1 page of cache
$ hdfs dfsadmin -report
...<snip>...
Live datanodes (1):
...<snip>...
Configured Cache Capacity: 64000 (62.50 KB)
Cache Used: 4096 (4 KB)
Cache Remaining: 59904 (58.50 KB)
Cache Used%: 6.40%
Cache Remaining%: 93.60%

Future Work

There are a number of further improvements we’d like to explore. For example, a current limitation of the system is that users need to manually specify what files and directories should be cached. Instead, HDFS could automatically manage what is cached based on workload patterns or hints.

Another potential improvement would be to extend HDFS caching to output files as well as input files. One potential use case for this so-called write-caching is for intermediate stages of a multi-job pipeline. Write-caching could avoid writing to disk at all, if durability is not required. This avenue of development is being pursued in HDFS-5851.

Conclusion

Due to increasing memory capacity, many interesting working sets are able to fit in aggregate cluster memory. By using HDFS centralized cache management, applications can take advantage of the performance benefits of in-memory computation. Cluster cache state is aggregated and controlled by the NameNode, allowing applications schedulers to place their tasks for cache locality. Explicit pinning of datasets allows users to isolate their working sets from other users on shared clusters. Finally, the new zero-copy read API offers substantially improved I/O performance by allowing clients to safely skip overhead from checksumming and the read()syscall.

In a follow-up post, we’ll analyze the performance of HDFS caching using a number of micro and macro benchmarks. Stay tuned!

Colin McCabe and Andrew Wang are both Software Engineers at Cloudera, and Hadoop committers/PMC members.

↧

Apache Hadoop 2.5.0 is Released

August 15, 2014, 8:33 am

≫ Next: Running CDH 5 on GlusterFS 3.3

≪ Previous: New in CDH 5.1: HDFS Read Caching

The Apache Hadoop community has voted to release Apache Hadoop 2.5.0.

Apache Hadoop 2.5.0 is a minor release in the 2.x release line and includes some major features and improvements, including:

Extended file attributes and improved Web UIs in HDFS
Security for ATS
Richer YARN REST APIs
Several bug fixes

More details can be found in the documentation and release notes.

The next minor release (2.6.0) is expected to include some major features as well, including transparent encryption in HDFS along with a key management server, work-preserving restarts of all YARN daemons, and others. Refer to the roadmap for a full, updated list.

Currently, Hadoop 2.5 is scheduled to ship inside CDH 5.2 (in late 2014).

Karthik Kambatla is Software Engineer at Cloudera and a Hadoop committer.

↧

Running CDH 5 on GlusterFS 3.3

August 18, 2014, 9:45 am

≫ Next: How-to: Install CDH on Mac OSX 10.9 Mavericks

≪ Previous: Apache Hadoop 2.5.0 is Released

The following post was written by Jay Vyas (@jayunit100) and originally published in the Gluster.org Community.

I have recently spent some time getting Cloudera’s CDH 5 distribution of Apache Hadoop to work on GlusterFS 3.3 using Distributed Replicated 2 Volumes. This is made possible by the fact that Apache Hadoop has a pluggable filesystem architecture that allows the computational components within the CDH 5 distribution to be configured to use alternative filesystems to HDFS. In this case, one can configure CDH 5 to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), which allows it to run on GlusterFS 3.3. I’ve provided a diagram below that illustrates the CDH 5 core processes and how they interact with GlusterFS.

Running a Single CDH 5 Deployment on One or More GlusterFS Volumes

Given that the CDH 5 distribution is comprised of other components besides YARN and MapReduce, I used the Apache Bigtop System Testing Framework to explicitly validate that Apache Sqoop, Apache Flume, Apache Pig, Apache Hive, Apache Oozie, Apache Mahout, Apache ZooKeeper, Apache Solr and Apache HBase also ran successfully.

Work is Still in Progress to Enable the Use of Impala

If you would like to participate in accelerating the work on Impala, please reach out to us on the Gluster mailing list.

Implementation details for this solution and the specific setup required for all the components are available on the glusterfs-hadoop project wiki. If you have additional questions, feel free to reach out to me on FreeNode (IRC handle jayunit100), @jayunit100 on twitter, or via the Gluster mailing list.

↧

How-to: Install CDH on Mac OSX 10.9 Mavericks

September 16, 2014, 9:35 am

≫ Next: The Definitive "Getting Started" Tutorial for Apache Hadoop + Your Own Demo Cluster

≪ Previous: Running CDH 5 on GlusterFS 3.3

This overview will cover the basic tarball setup for your Mac.

If you’re an engineer building applications on CDH and becoming familiar with all the rich features for designing the next big solution, it becomes essential to have a native Mac OSX install. Sure, you may argue that your MBP with its four-core, hyper-threaded i7, SSD, 16GB of DDR3 memory are sufficient for spinning up a VM, and in most instances — such as using a VM for a quick demo — you’re right. However, when experimenting with a slightly heavier workload that is a bit more resource intensive, you’ll want to explore a native install.

In this post, I will cover setup of a few basic dependencies and the necessities to run HDFS, MapReduce with YARN, Apache ZooKeeper, and Apache HBase. It should be used as a guideline to get your local CDH box setup with the objective to enable you with building and running applications on the Apache Hadoop stack.

Note: This process is not supported and thus you should be comfortable as a self-supporting sysadmin. With that in mind, the configurations throughout this guideline are suggested for your default bash shell environment that can be set in your ~/.profile.

Dependencies

Install the Java version that is supported for the CDH version you are installing. In my case for CDH 5.1, I’ve installed JDK 1.7 u67. Historically the JDK for Mac OSX was only available from Apple, but since JDK 1.7, it’s available directly through Oracle’s Java downloads. Download the .dmg (in the example below, jdk-7u67-macosx-x64.dmg) and install it.

Verify and configure the installation:

Old Java path: /System/Library/Frameworks/JavaVM.framework/Home
New Java path: /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home

export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home"

Note: You’ll notice that after installing the Oracle JDK, the original path used to manage versioning /System/Library/Frameworks/JavaVM.framework/Versions, will not be updated and you now have the control to manage your versions independently.

Enable ssh on your mac by turning on remote login. You can find this option under your toolbar’s Apple icon > System Preferences > Sharing.

Check the box for Remote Login to enable the service.
Allow access for: “Only these users: Administrators”
Note: In this same window, you can modify your computer’s hostname.

Enable password-less ssh login to localhost for MRv1 and HBase.

Open your terminal.
Generate an rsa or dsa key.

ssh-keygen -t rsa -P ""
Continue through the key generator prompts (use default options).

Test: ssh localhost

Homebrew

Another toolkit I admire is Homebrew, a package manager for OSX. While Xcode developer command-line tools are great, the savvy naming conventions and ease of use of Homebrew get the job done in a fun way.

I haven’t needed Homebrew for much else than for installing dependencies required for building native Snappy libraries for Mac OSX and ease of install of MySQL for Hive. Snappy is commonly used within HBase, HDFS, and MapReduce for compression and decompression.

CDH

Finally, the easy part: The CDH tarballs are very nicely packaged and easily downloadable from Cloudera’s repository. I’ve downloaded tarballs for CDH 5.1.0.

Download and explode the tarballs in a lib directory where you can manage latest versions with a simple symlink as the following. Although Mac OSX’s “Make Alias” feature is bi-directional, do not use it, but instead use your command-line ln -s command, such as ln -s source_file target_file.

/Users/jordanh/cloudera/
cdh5.1/

hadoop -> /Users/jordanh/cloudera/lib/hadoop-2.3.0-cdh5.1.0
hbase -> /Users/jordanh/cloudera/lib/hbase-0.98.1-cdh5.1.0
hive -> /Users/jordanh/cloudera/lib/hive-0.12.0-cdh5.1.0
zookeeper -> /Users/jordanh/cloudera/lib/zookeeper-3.4.5-cdh4.7.0

ops/

dn
logs/hadoop, logs/hbase, logs/yarn
nn/
pids
tmp/
zk/

You’ll notice above that you’ve created a handful of directories under a folder named ops. You’ll use them later to customize the configuration of the essential components for running Hadoop. Set your environment properties according to the paths where you’ve exploded your tarballs.

~/.profile

CDH="cdh5.1"
export HADOOP_HOME="/Users/jordanh/cloudera/${CDH}/hadoop"
export HBASE_HOME="/Users/jordanh/cloudera/${CDH}/hbase"
export HIVE_HOME="/Users/jordanh/cloudera/${CDH}/hive"
export HCAT_HOME="/Users/jordanh/cloudera/${CDH}/hive/hcatalog"

export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${ZK_HOME}/bin:${HBASE_HOME}/bin:${HIVE_HOME}/bin:${HCAT_HOME}/bin:${M2_HOME}/bin:${ANT_HOME}/bin:${PATH}

Update your main Hadoop configuration files, as shown in the sample files below. You can also download all files referenced in this post directly from here.

<configuration>
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://localhost:8020</value>
  <description>The name of the default file system.  A URI whose
    scheme and authority determine the FileSystem implementation.  The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class.  The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/Users/jordanh/cloudera/ops/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>
<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value>
  <description>A comma-separated list of the compression codec classes that can
    be used for compression/decompression. In addition to any classes specified
    with this property (which take precedence), codec classes on the classpath
    are discovered using a Java ServiceLoader.</description>
</property>
</configuration>

<configuration>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>/Users/jordanh/cloudera/ops/nn</value>
  <description>Determines where on the local filesystem the DFS name node
    should store the name table(fsimage).  If this is a comma-delimited list
    of directories then the name table is replicated in all of the
    directories, for redundancy. </description>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>/Users/jordanh/cloudera/ops/dn/</value>
  <description>Determines where on the local filesystem an DFS data node
    should store its blocks.  If this is a comma-delimited
    list of directories, then data will be stored in all named
    directories, typically on different devices.
    Directories that do not exist are ignored.
  </description>
</property>
<property>
  <name>dfs.datanode.http.address</name>
  <value>localhost:50075</value>
  <description>
    The datanode http server address and port.
    If the port is 0 then the server will start on a free port.
  </description>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
  </description>
</property>
</configuration>

I attribute the YARN and MRv2 configuration and setup from the CDH 5 installation docs. I will not digress into the specifications of each property or the orchestration and details of how YARN and MRv2 operate, but there’s some great information that my colleague Sandy has already shared for developers and admins.

Be sure to make the necessary adjustments per your system’s memory and CPU constraints. Per the image below, it is easy to see how these parameters will affect your machine’s performance when you execute jobs.

Next, edit the following files as shown.

<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
  <description>the valid service name should only contain a-zA-Z0-9_ and can not start with numbers</description>
</property>
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
  <description>Whether to enable log aggregation</description>
</property>
<property>
  <name>yarn.nodemanager.remote-app-log-dir</name>
  <value>hdfs://localhost:8020/tmp/yarn-logs</value>
  <description>Where to aggregate logs to.</description>
</property>
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>8192</value>
  <description>Amount of physical memory, in MB, that can be allocated
    for containers.</description>
</property>
<property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
  <value>4</value>
  <description>Number of CPU cores that can be allocated
    for containers.</description>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1024</value>
  <description>The minimum allocation for every container request at the RM,
    in MBs. Memory requests lower than this won't take effect,
    and the specified value will get allocated at minimum.</description>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
  <description>The maximum allocation for every container request at the RM,
    in MBs. Memory requests higher than this won't take effect,
    and will get capped to this value.</description>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-vcores</name>
  <value>1</value>
  <description>The minimum allocation for every container request at the RM,
    in terms of virtual CPU cores. Requests lower than this won't take effect,
    and the specified value will get allocated the minimum.</description>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>2</value>
  <description>The maximum allocation for every container request at the RM,
    in terms of virtual CPU cores. Requests higher than this won't take effect,
    and will get capped to this value.</description>
</property>
</configuration>

<configuration>
<property>
  <name>mapreduce.jobtracker.address</name>
  <value>localhost:8021</value>
</property>
<property>
  <name>mapreduce.jobhistory.done-dir</name>
  <value>/tmp/job-history/</value>
  <description></description>
</property>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>The runtime framework for executing MapReduce jobs.
  Can be one of local, classic or yarn.
  </description>
</property>
<property>
  <name>mapreduce.map.cpu.vcores</name>
  <value>1</value>
  <description>
      The number of virtual cores required for each map task.
  </description>
</property>
<property>
  <name>mapreduce.reduce.cpu.vcores</name>
  <value>1</value>
  <description>
      The number of virtual cores required for each reduce task.
  </description>
</property>
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>1024</value>
  <description>Larger resource limit for maps.</description>
</property>
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>1024</value>
  <description>Larger resource limit for reduces.</description>
</property>
<property>
  <name>mapreduce.map.java.opts</name>
  <value>-Xmx768m</value>
  <description>Heap-size for child jvms of maps.</description>
</property>
<property>
  <name>mapreduce.reduce.java.opts</name>
  <value>-Xmx768m</value>
  <description>Heap-size for child jvms of reduces.</description>
</property>
<property>
  <name>yarn.app.mapreduce.am.resource.mb</name>
  <value>1024</value>
  <description>The amount of memory the MR AppMaster needs.</description>
</property>
</configuration>

# Where log files are stored.  $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR="/Users/jordanh/cloudera/ops/logs/hadoop"
export YARN_LOG_DIR="/Users/jordanh/cloudera/ops/logs/yarn"

# The directory where pid files are stored when processes run as daemons. /tmp by default.
export HADOOP_PID_DIR="/Users/jordanh/cloudera/ops/pids"
export YARN_PID_DIR=${HADOOP_PID_DIR}

You can configure HBase to run without separately downloading Apache ZooKeeper. Rather, it has a bundled package that you can easily run as a separate instance or as standalone mode in a single JVM. I recommend using either distributed or standalone mode instead of a separately downloaded ZooKeeper tarball on your machine for ease of use, configuration, and management.

The primary difference with configuration between running HBase in distributed or standalone mode is with the hbase.cluster.distributed property in hbase-site.xml. Set the property to false for launching HBase in standalone mode or true to spin up separate instances for services such as HBase’s ZooKeeper and RegionServer. Update the following configurations for HBase as specified to run it per this type of configuration.

Note regarding hbase-site.xml: Property hbase.cluster.distributed is set to false by default and will launch in standalone mode. Also, hbase.zookeeper.quorum is set to localhost by default and does not need to be overridden in our scenario.

<configuration>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false for standalone mode and true for distributed mode.  If
      false, startup will run all HBase and ZooKeeper daemons together
      in the one JVM.
    </description>
  </property>
<property>
    <name>hbase.tmp.dir</name>
    <value>/Users/jordanh/cloudera/ops/tmp/hbase-${user.name}</value>
    <description>Temporary directory on the local filesystem.
    Change this setting to point to a location more permanent
    than '/tmp' (The '/tmp' directory is often cleared on
    machine restart).
    </description>
  </property>
<property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/Users/jordanh/cloudera/ops/zk</value>
    <description>Property from ZooKeeper's config zoo.cfg.
    The directory where the snapshot is stored.
    </description>
  </property>
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:8020/hbase</value>
    <description>The directory shared by region servers and into
    which HBase persists.  The URL should be 'fully-qualified'
    to include the filesystem scheme.  For example, to specify the
    HDFS directory '/hbase' where the HDFS instance's namenode is
    running at namenode.example.org on port 9000, set this value to:
    hdfs://namenode.example.org:9000/hbase.  By default HBase writes
    into /tmp.  Change this configuration else all data will be lost
    on machine restart.
    </description>
  </property>
</configuration>

Note regarding $HBASE_HOME/conf/hbase-env.sh: By default HBASE_MANAGES_ZK is set as true and is listed below only for explicit definition.

# Where log files are stored.  $HBASE_HOME/logs by default.
# Where log files are stored.  $HBASE_HOME/logs by default.
export HBASE_LOG_DIR="/Users/jordanh/cloudera/ops/logs/hbase"

# The directory where pid files are stored. /tmp by default.
export HBASE_PID_DIR="/Users/jordanh/cloudera/ops/pids"

# Tell HBase whether it should manage its own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=true

Pulling it All Together

By now, you should have accomplished setting up HDFS, YARN, and HBase. Hadoop setup and configuration is quite tedious, much less managing it over time (thus Cloudera Manager, which is unfortunately not available for Macs).

These are the bare essentials for getting your local machine ready for running MapReduce jobs and building applications on HBase. In the next few steps, we will start/stop the services and provide examples to ensure each service is operating correctly. The steps are listed in the specific order for initialization in order to adhere to dependencies. The order could be reversed for halting the services.

Service HDFS

NameNode

format: hdfs namenode -format

start: hdfs namenode

stop: Ctrl-C

url: http://localhost:50070/dfshealth.html

DataNode

start: hdfs datanode

stop: Ctrl-C

url: http://localhost:50075/browseDirectory.jsp?dir=%2F&nnaddr=127.0.0.1:8020

Test

hadoop fs -mkdir /tmp

hadoop fs -put /path/to/local/file.txt /tmp/

hadoop fs -cat /tmp/file.txt

Service YARN

ResourceManager

start: yarn resourcemanager

stop: Ctrl-C

url: http://localhost:8088/cluster

NodeManager

start: yarn nodemanager

stop: Ctrl-C

url: http://localhost:8042/node

MapReduce Job History Server

start: mapred historyserver, mr-jobhistory-daemon.sh start historyserver

stop: Ctrl-C, mr-jobhistory-daemon.sh stop historyserver

url: http://localhost:19888/jobhistory/app

Test Vanilla YARN Application

hadoop jar $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.0.jar -appname DistributedShell -jar $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.3.0-cdh5.1.0.jar -shell_command "ps wwaxr -o pid,stat,%cpu,time,command | head -10" -num_containers 2 -master_memory 1024

Test MRv2 YARN TestDFSIO

hadoop org.apache.hadoop.fs.TestDFSIO -write -nrFiles 5 -size 1GB
hadoop org.apache.hadoop.fs.TestDFSIO -read -nrFiles 5 -size 1GB

Test MRv2 YARN Terasort/Teragen

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar teragen 100000000 /tmp/eval/teragen
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar terasort /tmp/eval/teragen /tmp/eval/terasort

Test MRv2 YARN Pi

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.0.jar pi 100 100

Service HBase

HBase Master/RegionServer/ZooKeeper

start: start-hbase.sh

stop: stop-hbase.sh

logs: /Users/jordanh/cloudera/ops/logs/hbase/

url: http://localhost:60010/master-status

Test

hbase shell
create 'URL_HITS', {NAME=>'HOURLY'},{NAME=>'DAILY'},{NAME=>'YEARLY'}
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090110', '10'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090111', '5'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090112', '30'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090113', '80'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'HOURLY:2014090114', '7'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'DAILY:20140901', '10012'
put 'URL_HITS', 'com.cloudera.blog.osx.localinstall', 'YEARLY:2014', '93310101'

scan 'URL_HITS'

Kite SDK Test

Get familiar with the Kite SDK by trying out this example that loads data to both HDFS and then HBase. Note that there are a few common issues on your OSX that may surface when running through the Kite SDK example. They can be easily resolved with additional setup/config as specified below.

Problem: NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/NoSuchObjectException

Resolution: Fix your classpath by making sure to set HIVE_HOME and HCAT_HOME in your environment.

export HIVE_HOME="/Users/jordanh/cloudera/${CDH}/hive"
export HCAT_HOME="/Users/jordanh/cloudera/${CDH}/hive/hcatalog"

Problem: InvocationTargetException Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path

Resolution: Snappy libraries are not compiled for Mac OSX out of the box. A Snappy Java port was introduced in CDH 5 and likely will require to be recompiled on your machine.

git clone https://github.com/xerial/snappy-java.git
cd snappy-java
make 

cp target/snappy-java-1.1.1.3.jar $HADOOP_HOME/share/hadoop/common/lib/asnappy-java-1.1.1.3.jar

Landing Page

Creating a landing page will help consolidate all the HTTP addresses of the services that you’re running. Please note that localhost can be replaced with your local hostname (such as jakuza-mbp.local).

Service Apache HTTPD

start: sudo -s launchctl load -w /System/Library/LaunchDaemons/org.apache.httpd.plist

stop: sudo -s launchctl unload -w /System/Library/LaunchDaemons/org.apache.httpd.plist

logs: /var/log/apache2/

url: http://localhost/index.html

Create index.html (edit /Library/WebServer/Documents/index.html, which you can download here).

It will look something like this:

Conclusion

With this guide, you should have a locally running Hadoop cluster with HDFS, MapReduce, and HBase. These are the core components for Hadoop, and are good initial foundation for building and prototyping your applications locally.

I hope this will be a good starting point on your dev box to try out more ways to build your products, whether they are data pipelines, analytics, machine learning, search and exploration, or more, on the Hadoop stack.

Jordan Hambleton is a Solutions Architect at Cloudera.

↧

The Definitive "Getting Started" Tutorial for Apache Hadoop + Your Own Demo Cluster

October 6, 2014, 2:04 pm

≫ Next: Cloudera Enterprise 5.2 is Released

≪ Previous: How-to: Install CDH on Mac OSX 10.9 Mavericks

Using this new tutorial alongside Cloudera Live is now the fastest, easiest, and most hands-on way to get started with Hadoop.

At Cloudera, developer enablement is one of our most important objectives. One only has to look at examples from history (Java or SQL, for example) to know that knowledge fuels the ecosystem. That objective is what drives initiatives such as our community forums, the Cloudera QuickStart VM, and this blog itself.

Today, we are providing what we believe is a model for Hadoop developer enablement going forward: a definitive end-to-end tutorial and free, cloud-based demo cluster and sample data for hands-on exercises, via the Cloudera Live program.

When Cloudera Live was launched in April 2014, it initially contained a read-only environment where users could experiment with CDH, our open source platform containing the Hadoop stack, for a few hours. Today, we are launching a new interactive version (hosted by GoGrid) in which you can use pre-loaded datasets or your own data, and which is available to you for free for two weeks. Furthermore, the environment is available in two other flavors—with Tableau or Zoomdata included—so you can test-drive CDH and Cloudera Manager alongside familiar BI tools, too.

Now, back to that tutorial:

To There and Back

Most Hadoop tutorials take a piecemeal approach: they either focus on one or two components, or at best a segment of the end-to-end process (just data ingestion, just batch processing, or just analytics). Furthermore, few if any provide a business context that makes the exercise pragmatic.

This new tutorial closes both gaps. It takes the reader through the complete Hadoop data lifecycle—from data ingestion through interactive data discovery—and does so while emphasizing the business questions concerned: What products do customers view on the Web, what do they like to buy, and is there a relationship between the two?

Getting those answers is a task that organizations with traditional infrastructure have been doing for years. However, the ones that bought into Hadoop do the same thing at greater scale, at lower cost, and on the same storage substrate (with no ETL, that is) upon which many other types of analysis can be done.

To learn how to do that, in this tutorial (and assuming you are using our sample dataset) you will:

Load relational and clickstream data into HDFS (via Apache Sqoop and Apache Flume respectively)
Use Apache Avro to serialize/prepare that data for analysis
Create Apache Hive tables
Query those tables using Hive or Impala (via the Hue GUI)
Index the clickstream data using Flume, Cloudera Search, and Morphlines, and expose a search GUI for business users/analysts

Go Live

We think that even on its own, this tutorial will be a huge help to developers of all skill levels—and with Cloudera Live in the mix as a demo backend for doing the hands-on exercises, it’s almost irresistible.

If you have any comments or encounter a roadblock, let us know about it in this discussion forum.

Justin Kestelyn is Cloudera’s developer outreach director.

↧

Cloudera Enterprise 5.2 is Released

October 14, 2014, 1:01 pm

≫ Next: Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D

≪ Previous: The Definitive "Getting Started" Tutorial for Apache Hadoop + Your Own Demo Cluster

Cloudera Enterprise 5.2 contains new functionality for security, cloud deployments, and real-time architectures, and support for the latest open source component releases and partner technologies.

We’re pleased to announce the release of Cloudera Enterprise 5.2 (comprising CDH 5.2, Cloudera Manager 5.2, Cloudera Director 1.0, and Cloudera Navigator 2.1).

This release reflects our continuing investments in Cloudera Enterprise’s main focus areas, including security, integration with the partner ecosystem, and support for the latest innovations in the open source platform (including Impala 2.0, its most significant release yet, and Apache Hive 0.13.1). It also includes a new product, Cloudera Director, that streamlines deployment and management of enterprise-grade Hadoop clusters in cloud environments; new component releases for building real-time applications; and new support for significant partner technologies like EMC Isilon. Furthermore, this release ships the first results of joint engineering with Intel, including WITH GRANT OPTION for Hive and Impala and performance optimizations for MapReduce.

Here are some of the highlights (incomplete; see the respective Release Notes for CDH, Cloudera Manager, and Cloudera Navigator for full lists of features and fixes):

Security

Via Apache Sentry (incubating) 1.4, GRANT and REVOKE statements in Impala and Hive can now include WITH GRANT OPTION, for delegation of granting and revoking privileges (joint work with Intel under Project Rhino).
Hue has a new Sentry UI that supports policy management for visually creating/editing roles in Sentry and permissions on files in HDFS.
Kerberos authentication is now supported in Apache Accumulo.
Impala, authentication can now be done through a combination of Kerberos and LDAP.

Data Management and Governance

Cloudera Navigator 2.1 features a brand-new auditing UI that is unified with lineage and discovery, so you now have access to all Navigator functionality from a single interface.
Navigator 2.1 includes role-based access control so you can restrict access to auditing, metadata and policy management capabilities.
We’re also shipping a beta policy engine in Navigator 2.1. Targeted to GA by year-end, the policy engine allows you to set up rules and notifications so you can classify data as it arrives and integrate with data preparation and profiling tools. Try it out and let us know what you think!
And we’ve added lots of top-requested enhancements, such as Sentry auditing for Impala and integration with Hue.

Cloud Deployment

Cloudera Director is a simple and reliable way to deploy, scale, and manage Hadoop in the cloud (initially for AWS) in an enterprise-grade fashion. It’s free to download and use, and supported by default for Cloudera Enterprise customers. See the User Guide for more details.

Real-Time Architecture

Re-base on Apache HBase 0.98.6
Re-base on Apache Spark/Streaming 1.1
Re-base on Impala 2.0
Apache Sqoop now supports import into Apache Parquet (incubating) file format
Apache Kafka integration with CDH is now incubating in Cloudera Labs; a Kafka-Cloudera Labs parcel (unsupported) is available for installation. Integration with Flume via special Source and Sink have also been provided.

Impala 2.0

Disk-based query processing: enables large queries to “spill to disk” if their in-memory structures are larger than the currently available memory. (Note that this feature only uses disk for the portion that doesn’t fit in the available memory.)
Greater SQL compatibility: SQL 2003 analytic window functions, support for legacy data types (such as CHAR and VARCHAR), better compliance with SQL standards (WHERE, EXISTS, IN), and additional vendor-specific SQL extensions.

New Open Source Releases and Certifications

Cloudera Enterprise 5.2 includes multiple new component releases:

Apache Avro 1.7.6
Apache Crunch 0.11
Apache Hadoop 2.5
Apache HBase 0.98.6
Apache Hive 0.13.1
Apache Parquet (incubating) 1.5 / Parquet-format 2.1.0
Apache Sentry (incubating) 1.4
Apache Spark 1.1
Apache Sqoop 1.4.5
Impala 2.0
Kite SDK 0.15.0

…with new certifications on:

Filesystems: EMC Isilon
OSs: Ubuntu 14.04 (Trusty)
Java: Oracle JDK1.7.0_67

Over the next few weeks, we’ll publish blog posts that cover some of these and other new features in detail. In the meantime:

As always, we value your feedback; please provide any comments and suggestions through our community forums. You can also file bugs via issues.cloudera.org.

↧

Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D

October 15, 2014, 7:19 am

≫ Next: New in CDH 5.2: More SQL Functionality and Compatibility for Impala 2.0

≪ Previous: Cloudera Enterprise 5.2 is Released

Cloudera Labs contains ecosystem innovations that one day may bring developers more functionality or productivity in CDH.

Since its inception, one of the defining characteristics of Apache Hadoop has been its ability to evolve/reinvent and thrive at the same time. For example, two years ago, nobody could have predicted that the formative MapReduce engine, one of the cornerstones of “original” Hadoop, would be marginalized or even replaced. Yet today, that appears to be happening via Apache Spark, with Hadoop becoming the stronger for it. Similarly, we’ve seen other relatively new components, like Impala, Apache Parquet (incubating), and Apache Sentry (also incubating), become widely adopted in relatively short order.

This unique characteristic requires Cloudera to be highly sensitive to new activity at the “edges” of the ecosystem — in other words, to be vigilant for the abrupt arrival of new developer requirements, and new components or features that meet them. (In fact, Cloudera employees are often the creators of such solutions.) When there is sufficient market interest and customer success with them seems assured, these new components often join the Cloudera platform as shipping product.

Today, we are announcing a new program that externalizes this thought process: Cloudera Labs (cloudera.com/labs). Cloudera Labs is a virtual container for innovations being incubated within Cloudera Engineering, with the goal of bringing more use cases, productivity, or other types of value to developers by constantly exploring new solutions for their problems. Although Labs initiatives are not supported or intended for production use, you may find them interesting for experimentation or personal projects, and we encourage your feedback about their usefulness to you. (Note that inclusion in Cloudera Labs is not a precondition for productization, either.)

Apache Kafka is among the “charter members” of this program. Since its origin as proprietary LinkedIn infrastructure just a couple years ago for highly scalable and resilient real-time data transport, it’s now one of the hottest projects associated with Hadoop. To stimulate feedback about Kafka’s role in enterprise data hubs, today we are making a Kafka-Cloudera Labs parcel (unsupported) available for installation.

Other initial Labs projects include:

Exhibit
Exhibit is a library of Apache Hive UDFs that usefully let you treat array fields within a Hive row as if they were “mini-tables” and then execute SQL statements against them for deeper analysis.
Hive-on-Spark Integration
A broad community effort is underway to bring Apache Spark-based data processing to Apache Hive, reducing query latency considerably and allowing IT to further standardize on Spark for data processing.
Impyla
Impyla is a Python (2.6 and 2.7) client for Impala, the open source MPP query engine for Hadoop. It communicates with Impala using the same standard protocol as ODBC/JDBC drivers.
Oryx
Oryx, a project jointly spearheaded by Cloudera Engineering and Intel, provides simple, real-time infrastructure for large-scale machine learning/predictive analytics applications.
RecordBreaker
RecordBreaker, a project jointly developed by Hadoop co-founder Mike Cafarella and Cloudera, automatically turns your text-formatted data into structured Avro data–dramatically reducing data prep time.

As time goes on, and some of the projects potentially graduate into CDH components (or otherwise remain as Labs projects), more names will join the list. And of course, we’re always interested in hearing your suggestions for new Labs projects.

↧

New in CDH 5.2: More SQL Functionality and Compatibility for Impala 2.0

October 16, 2014, 5:40 am

≫ Next: New in CDH 5.2: Apache Sentry Delegated GRANT and REVOKE

≪ Previous: Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D

Impala 2.0 is the most SQL-complete/SQL-compatible release yet.

As we reported in the most recent roadmap update (“What’s Next for Impala: Focus on Advanced SQL Functionality”), more complete SQL functionality (and better SQL compatibility with other vendor extensions) is a major theme in Impala 2.0.

In this post, we’ll describe the highlights (not a complete list), and provide links to the docs that drill-down on these functions.

Analytic (Window) Functions

Analytic functions (aka window functions) are a special category of built-in functions. Analytic functions are frequently used in fields such as finance and science to provide trend, outlier, and bucketed analysis for large data sets. You might also see the term “window functions” in database literature, referring to the interval (the “window”) to which the function call applies, particularly when the query includes a ROWS clause.

Like aggregate functions, analytic functions examine the contents of multiple input rows to compute each output value. However, rather than being limited to one result value per GROUP BY group, they operate on sliding windows where the input rows are ordered and grouped using flexible conditions expressed through an OVER() clause.

Impala 2.0 now supports the following analytic query clauses and pure analytic functions:

OVER Clause
Window Clause
DENSE_RANK() Function
FIRST_VALUE() Function
LAG() Function
LAST_VALUE() Function
LEAD() Function
RANK() Function
ROW_NUMBER() Function

See the docs for more details about these functions.

New Data Types

New data types in Impala 2.0 provide greater compatibility with source code from traditional database systems:

VARCHAR is like the STRING data type, but with a maximum length. See VARCHAR Data Type for details.
CHAR is like the STRING data type, but with a precise length. Short values are padded with spaces on the right. See CHAR Data Type for details.

Subquery Enhancements

Impala 2.0 also supports a number of subquery enhancements including:

Subqueries in the WHERE clause (for example, with the IN operator).
EXISTS and NOT EXISTS operators (always used in conjunction with subqueries).
The IN and NOT IN queries on the result set form a subquery, not just a hardcoded list of values.
Uncorrelated subqueries let you compare against one or more values for equality, IN, and EXISTS comparisons. For example, you might use WHERE clauses such as WHERE column = (SELECT MAX(some_other_column FROM table) or WHERE column IN (SELECT some_other_column FROM table WHERE conditions).
Correlated subqueries let you cross-reference values from the outer query block and the subquery.
Scalar subqueries let you substitute the result of single-value aggregate functions such as MAX(), MIN(), COUNT(), or AVG(), where you would normally use a numeric value in a WHERE clause.

See the docs for more details.

SQL Operations That Spill to Disk

Certain memory-intensive operations now write temporary data to disk (known as “spilling to disk”) when Impala is close to exceeding its memory limit for a particular node.

For example, when large tables are joined, Impala keeps the distinct values of the join columns from one table in memory, to compare them to incoming values from the other table. When a query uses a GROUP BY clause for columns with millions or billions of distinct values, Impala keeps a similar number of temporary results in memory, to accumulate the aggregate results for each value in the group. When a large result set is sorted by the ORDER BY clause, each node sorts its portion of the result set in memory. The DISTINCT and UNION operators also build in-memory data structures to represent all values found so far, to eliminate duplicates as the query progresses.

The result is a query that completes successfully, rather than failing with an out-of-memory error. The tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back in. Thus, while this feature improves reliability and reduces memory usage, you should optimize your queries, system parameters, and hardware configuration to make spilling rare.

See the docs for more details.

More SQL on the Way

The features above are just a few of the most notable highlights in Impala 2.0, which also includes additional SQL functionality such as vendor-specific extensions like DECODE and DATE_PART.

Stay tuned to this blog for information about SQL functionality in future releases.

John Russell is the technical writer for Impala, and the author of Getting Started with Impala (O’Reilly Media).

↧

New in CDH 5.2: Apache Sentry Delegated GRANT and REVOKE

October 17, 2014, 8:13 am

≫ Next: New in CDH 5.2: Impala Authentication with LDAP and Kerberos

≪ Previous: New in CDH 5.2: More SQL Functionality and Compatibility for Impala 2.0

This new feature, jointly developed by Cloudera and Intel engineers, makes management of role-based security much easier in Apache Hive, Impala, and Hue.

Apache Sentry (incubating) provides centralized authorization for services and applications in the Apache Hadoop ecosystem, allowing administrators to set up granular, role-based protection on resources, and to review them in one place. Previously, Sentry only designated administrators to GRANT and REVOKE privileges on an authorizable object. In Apache Sentry 1.5.0 (shipping inside CDH 5.2), we have implemented a new feature (SENTRY-327) that allows admin users to delegate the GRANT privilege to other users using WITH GRANT OPTION. If a user has the GRANT OPTION privilege on a specific resource, the user can now grant the GRANT privilege to other users on the same resource. Apache Hive, Impala, and Hue have all been updated to take advantage of this new Sentry functionality.

In this post, we’ll provide an overview of how this new feature works.

Delegating GRANT/REVOKE Privileges

You can use Hive or Impala to grant privileges using the GRANT ... WITH GRANT OPTION SQL statement:

GRANT
    priv_type [, priv_type ] ...
    ON table_or_view_name
    TO principal_specification [, principal_specification] ...
    [WITH GRANT OPTION];

Note: Impala currently only supports granting/revoking a single privilege at a time (IMPALA-1341).

When WITH GRANT OPTION is specified, the command will give members of the target role privileges to issue their own GRANT

statements. Initially, only a pre-defined set of Sentry admin users can issue GRANT statements.

For example, the following commands will create a new role, sales_dept, and provide members of the role the GRANT OPTION privilege on database salesdb:

USE salesdb;
CREATE ROLE sales_dept;
GRANT ROLE sales_dept TO GROUP sales_grp;
GRANT ALL ON DATABASE salesdb TO ROLE sales_dept WITH GRANT OPTION;

This will give users belonging to the sales_dept role the ability to grant equivalent or lesser privileges—privileges on salesdb or tables under salesdb—to other roles. This status includes the ability to grant using the GRANT OPTION privilege.

Thus, a user who belongs to the sales_dept role will now have privileges to execute commands such as:

GRANT ALL ON TABLE marketing TO ROLE marketing_dept;
GRANT SELECT ON DATABASE salesdb TO ROLE marketing_dept;

The GRANT OPTION privilege also allows for granting the GRANT OPTION to other roles. For example, the following will grant the GRANT OPTION privilege to role marketing_dept, which will give members of that role the ability to grant it to other roles:

GRANT SELECT ON DATABASE salesdb TO ROLE marketing_dept WITH GRANT OPTION;

Viewing Granted Privileges

When managing role privileges, you can determine which privileges have been granted to a role and whether the privilege was granted using WITH GRANT OPTION, using:

SHOW GRANT ROLE <roleName>;

This statement returns all privileges granted to a role by all users. It can be executed by admin users or by any user who currently belongs to the role.

An example from Impala is shown below. The statement returns similar results in Hive:

+----------+----------+-----------+-----------+--------------+-----------------+
  | scope    | database | table     | privilege | grant_option | create_time     |
  +----------+----------+-----------+-----------+--------------+-----------------+
  | TABLE    | salesdb  | marketing | ALL       | false        | Wed, Oct 01 ... |
  | DATABASE | salesdb  |           | SELECT    | true         | Wed, Oct 01 ... |
  +----------+----------+-----------+-----------+--------------+-----------------+

Revoking the GRANT privilege

If a user has the GRANT OPTION privilege, they can also revoke privilege from roles. The Impala and Hive syntax for REVOKE is:

REVOKE [GRANT OPTION FOR]
    priv_type [, priv_type ] ...
    ON table_or_view_name
    FROM principal_specification [, principal_specification] ... ;

To revoke only the grant option from a privilege, the GRANT OPTION FOR clause can be added to a REVOKE statement. When this clause is specified, the target privilege will be preserved, but users in the role will no longer be allowed to issue GRANT statements.

Hive does not currently support the GRANT OPTION FOR, but the REVOKE command without this clause will always revoke all privileges (those granted with and without WITH GRANT OPTION). For example, if a role named sales_dept was granted SELECT and INSERT privileges on table marketing:

USE salesdb;
GRANT SELECT, INSERT ON TABLE marketing TO ROLE sales_dept;

The following REVOKE will only remove the INSERT on the table marketing, preserving the SELECT privilege:

REVOKE INSERT ON TABLE marketing FROM ROLE sales_dept;

Furthermore, we support the revocation of child privileges when executing the REVOKE command. To revoke all privileges on the database salesdb along with all privileges granted on all child tables:

REVOKE ALL ON DATABASE salesdb FROM ROLE sales_dept;

Future Work

The Hive integration with Sentry is based on Hive 0.13, which does not support the GRANT OPTION FOR clause in the Hive revoke command. In Hive 0.14.0, this syntax is supported and the grant option for a privilege can be removed while still keeping the privilege using REVOKE. (For more information, see SENTRY-473.)

Impala syntax will also be enhanced to match the Hive syntax for granting/revoking multiple privileges to/from multiple roles in a single statement (IMPALA-1341).

Acknowledgments

This feature is co-developed by Intel and Cloudera. Many thanks to everyone who participated in this work (listed in alphabetical order):

Arun Suresh
Dapeng Sun
Haifeng Chen
Lenni Kuff
Prasad Mujumdar
Sravya Tirukkovalur
Xiaomeng Huang

Xiaomeng Huang is a Software Engineer at Intel.

Lenni Kuff is a Software Engineer at Cloudera.

↧

New in CDH 5.2: Impala Authentication with LDAP and Kerberos

October 23, 2014, 9:21 am

≫ Next: New in CDH 5.2: New Security App and More in Hue

≪ Previous: New in CDH 5.2: Apache Sentry Delegated GRANT and REVOKE

Impala authentication can now be handled by a combination of LDAP and Kerberos. Here’s why, and how.

Impala, the open source analytic database for Apache Hadoop, supports authentication—the act of proving you are who you say you are—using both Kerberos and LDAP. Kerberos has been supported since release 1.0, LDAP support was added more recently, and with CDH 5.2, you can use both at the same time.

Using LDAP and Kerberos together provides significant value; Kerberos remains the core authentication protocol and is always used when Impala daemons connect to each other and to the Hadoop cluster. However, Kerberos can require more maintenance to support. LDAP is ubiquitous across the enterprise and is commonly utilized by client applications connecting to Impala via ODBC and JDBC drivers. A mix of the two therefore frequently makes sense.

This table demonstrates the various combinations and their use cases:

In this post, I’ll explain why and how to set-up Impala authentication using a combination of LDAP and Kerberos.

Kerberos

Kerberos remains the primary authentication mechanism for Apache Hadoop. A little Kerberos terminology will help with the discussion to follow.

A principal is some Kerberos entity, like a person or a daemon process. For our purposes, a principal looks like name/hostname@realm for daemon processes, or just name@realm for users.
The name field can be associated with a process, like “impala”, or it can be a username, like “myoder”.
The hostname field can be the fully qualified name of the machine, or the Hadoop-specific magic _HOST string, which is auto-replaced with the fully qualified hostname.
The realm is similar to (but not necessarily the same as) a DNS domain.

Kerberos principals can prove that they are who they say that they are by either supplying a password (if the principal is a human) or by providing a “keytab” file. Impala daemons need a keytab file, which must be well protected: anyone who can read that keytab file can impersonate the Impala daemons.

Basic support for Kerberos in impala for this process is straightforward: Supply the following arguments, and the daemons will use the given principal and the keys in the keytab file to take on the identity of the principal for all communication.

--principal=impala/hostname@realm and
--keytab_file=/full/path/to/keytab

There is another wrinkle if the Impala daemon (impalad) is sitting behind a load balancer. When the clients running queries go through the load balancer (a proxy) the client is expecting the impalad to have a principal that’s the same as the name of the load balancer. So the impalad has to use a principal matching the name of the proxy when it services these external queries, but will need to use a principal matching its actual host name when doing back-end communication between daemons. The new flags to the impalad in this case would be:

--principal=impala/proxy-hostname@realm
--be_principal=impala/actual-hostname@realm
--keytab_file=/full/path/to/keytab

The first --principal specifies what principal to use when the impalad services external queries, and the --be_principal specifies the principal for when the impalad is doing back-end communication. Keys for both of these principals must reside in the same keytab file.

Debugging Kerberos

Kerberos is an elegant protocol, but practical implementations are not always very helpful if something goes wrong. The top two things to check in case of failure are:

Time. Kerberos is dependent on synchronized clocks, so it is a best practice to install and use NTP (the Network Time Protocol) on all machines dealing with Kerberos.
DNS. Make sure that your hostnames are fully qualified and that forward (name->IP) and reverse (IP->name) DNS lookups are correct.

Beyond that, it is possible to set two environment variables that will give you Kerberos debugging information. The output may be a little overwhelming, but frequently it will point the way to a solution.

KRB5_TRACE=/full/path/to/trace/output.log: This environment variable will instruct all kerberos clients and utilities to print debug output to the named file.
JAVA_TOOL_OPTIONS=-Dsun.security.krb5.debug=true: This environment variable is passed to the impala daemons, which in turn pass it to the internal java component.

In CDH 5.2 and later you can also supply the --krb5_debug_file parameter, which will turn on Kerberos debugging and write the output to the given file. You can supply it in Cloudera Manager via the Impala Configuration “Service-Wide” -> “Advanced” -> “Impala Command Line Argument Advanced Configuration Snippet” parameters. (Environment variables like those above can be supplied in the adjacent “Impala Service Environment Advanced Configuration Snippet” parameters.) It also goes without saying that Google is your friend when debugging problems with Kerberos.

Kerberos Flags

The Cloudera documentation for Kerberos and Impala covers this in greater detail, but these are the basic flags:

LDAP

Kerberos is great, but it does require that the end user have a valid Kerberos credential, which is not practical in many environments—because every user who interacts with Impala and the Hadoop cluster must have a Kerberos principal configured. For organizations that use Active Directory to manage user accounts, it can be onerous to create corresponding user accounts for each user in an MIT Kerberos realm. Many corporate environments use the LDAP protocol instead, where clients authenticate themselves using their username and password.

When configured to use LDAP, think of the impalad as an LDAP proxy: the client (the Impala shell, ODBC, JDBC, Hue, and so on) sends its username and password to the impalad, and the impalad takes the username and password and sends them to the LDAP server in an attempt to log in. In LDAP terminology, the impalad issues an LDAP "bind" operation. If the LDAP server returns success for the login attempt, the impalad accepts the connection.

LDAP is only used to authenticate external clients, such as the Impala shell, ODBC, JDBC, and Hue. All other back-end authentication is handled by Kerberos.

LDAP Configurations

LDAP is complicated (and powerful) because it is so flexible; there are many ways to configure LDAP entities and authenticate those entities. In general, every person in LDAP has a Distinguished Name, or DN, which can be considered the username or principal according to LDAP.

Let’s examine how users are set up for two different LDAP servers. The first user is named "Test1 Person" and resides in Windows 2008 Active Directory.

# Test1 Person, Users, ad.sec.cloudera.com
dn: CN=Test1 Person,CN=Users,DC=ad,DC=sec,DC=cloudera,DC=com
cn: Test1 Person
sAMAccountName: test1
userPrincipalName: test1@ad.sec.cloudera.com

The second is me: the entry for user myoder, residing in an OpenLDAP server:

# myoder, People, cloudera.com
dn: uid=myoder,ou=People,dc=cloudera,dc=com
cn: Michael Yoder
uid: myoder
homeDirectory: /home/myoder

Many items have been removed from the above for simplicity. Let’s note some of the similarities and differences in these two accounts:

DN: The first line after the comment is for the DN. This is the primary identifying string for one LDAP account. The name starts out specific (CN=Test1 Person and uid=myoder) and works out to the more general; DC=cloudera,DC=com corresponds to cloudera.com. They are quite different: the AD entry has a human name in the first field (CN=Test1 Person) and the OpenLDAP entry has an account username (uid=myoder).
CN: The Common Name. For AD, it’s the same as in the DN; for OpenLDAP it’s the human name, which is not the uid in the DN.
sAMAccountName: This AD-only entry is a legacy form of a username. Despite being deprecated it is widely used and documented.
userPrincipalName: This AD-only entry, by convention, should map to the user’s email name. It will usually look like this: sAMAccountName@fully.qualified.domain.com. This is the modern Active Directory username and is widely used.

There is an additional interesting implementation detail about AD. Normally, authentication in LDAP is based on the DN. With AD, several items are tried in turn:

First, the DN
userPrincipalName
sAMAccountName + "@" + the DNS domain name
Netbios domain name + "\" + the sAMAccountName
And several other somewhat more obscure mechanisms (see the link above)

LDAP and the Impalad

Given all these differences, it is fortunate that the impala daemon provides several mechanisms to address the varieties of LDAP configurations out there. First, let’s start simple:

--enable_ldap_auth must be set, and
--ldap_uri=ldap://ldapserver.your.company.com needs to be specified.

With just those set, the username given to the impalad (by the impala shell, jdbc, odbc, etc) is passed straight through to the LDAP server unaltered. This approach works out great for AD if the user name is fully qualified, like test1@ad.sec.cloudera.com—it’ll match either the userPrincipal or the sAMAccountName plus the DNS domain name.

It’s also possible to set up the impalad up so that the domain (ad.sec.cloudera.com in this case) is automatically added to the username, by setting --ldap_domain=ad.sec.cloudera.com as an argument to the impalad. Now when a client username comes in, like "test1", it will append that domain name so that the result passed to AD becomes test1@ad.sec.cloudera.com. This behavior can be a convenience to your users.

So far, things are working smoothly for AD. But what about other LDAP directories, like OpenLDAP? It doesn’t have any of the sAMAccountName or userPrincipalName stuff, and instead we have to authenticate directly against the DN. Users aren’t going to know their LDAP DN!

Fortunately, the impalad has parameters for this scenario, too. The --ldap_baseDN=X parameter is used to convert the username into the LDAP DN, so that the resulting DN looks like uid=username,X. For example, if --ldap_baseDN=ou=People,dc=cloudera,dc=com, and the username passed in is "myoder", the resulting query passed to LDAP will look like uid=myoder,ou=People,dc=cloudera,dc=com—which does indeed match the DN of user myoder above. Presto!

For maximum flexibility, it’s also possible to specify an arbitrary mapping from usernames into a DN via the --ldap_bind_pattern string. The idea is that the string specified must have a placeholder named #UID inside it, and that #UID is replaced with the username. For example, you could mimic the behavior of --ldap_baseDN by specifying --ldap_bind_pattern=uid=#UID,ou=People,dc=cloudera,dc=com. When the username of "myoder" comes in, it replaces the #UID, and we’ll get the same string as above. This option should be used when more control over the DN is needed.

LDAP and TLS

When using LDAP, the username and password are sent over the connection to the LDAP server in the clear. This means that without any other sort of protection, anyone can see the password travelling over the wire. To prevent that, you must protect connection with TLS (Transport Layer Security, formerly known as SSL). There are two different connections to protect: between the client and the impalad, and between the impalad and the LDAP server.

TLS Between the Client and the Impalad

Authentication for TLS connections is done with certificates, so the impalad (as a TLS server) will need its own certificate. The impalad presents this certificate to clients in order to prove that it really is the impalad. In order to supply this certificate, the impalad must be started with --ssl_server_certificate=/full/path/to/impalad-cert.pem and --ssl_private_key=/full/path/to/impalad-key.pem.

Now clients must use TLS to talk to the impalad. In the impala shell, you accomplished that goal with the --ssl and --ca_cert=/full/path/to/ca-certificate.pem arguments. The ca_cert argument specifies the certificate that signed the ssl_server_certificate above. For ODBC connections, consult the documentation for the Cloudera ODBC driver for Impala. It offers a thorough description of the settings required for certificates, authentication, and TLS.

Frankly, using TLS between the impala clients and the impalad is a good idea, regardless of whether or not LDAP is being used. Otherwise, your queries, and the results of those queries, go over the wire in the clear.

TLS Between the Impalad and the LDAP Server

There are two ways to turn on TLS with the LDAP Server:

Supply --ldap_tls as an argument to the impalad. The connection will take place over the usual LDAP port, but after the connection is first made it will issue a STARTTLS request which will upgrade the connection to a secure connection using TLS on that same port.
Supply a URI starting with ldaps://. This uses a different port than ldap://

Finally, the connection to the LDAP server needs its own authentication; this way, you know that the impalad is talking to the correct ldap server and you’re not giving your passwords to a rogue man-in-the-middle attacker. You’ll need to pass --ldap_ca_certificate to the impalad to specify the location of the certificate that signed the LDAP server’s certificate.

LDAP Flags

The Cloudera documentation for LDAP and Impala contains much of this information, and the documentation for TLS between the Impala client and the Impala daemon is required reading as well. In Cloudera Manager, you set these flags in the Impala Configuration in the “Service-Wide” -> “Security” menu. You must specifiy them in the “Service-Wide” -> “Advanced” -> “Impala Command Line Argument Advanced Configuration Snippet” parameters.

To summarize all these flags:

Bringing it All Together

Correctly implementing authentication in the most secure manner possible results in quite a lot of flags being passed to the Impala daemons. Here is an example invocation of the impalad (minus other flags), assuming that we want to enable both kerberos and LDAP authentication:

impalad --enable_ldap_auth \
    --ldap_uri=ldap://ldapserver.your.company.com \
    --ldap_tls \
    --ldap_ca_certificate=/full/path/to/certs/ldap-ca-cert.pem \
    --ssl_server_certificate=/full/path/to/certs/impala-cert.pem \
    --ssl_private_key=/full/path/to/certs/impala-key.pem \
    --principal=impala/_HOST@EXAMPLE.COM \
    --keytab_file=/full/path/to/keytab

Connecting from the impala shell might look like this:

impala-shell.sh --ssl \
    --ca_cert=/full/path/to/cert/impala-ca-cert.pem \
    -k

When authenticating with Kerberos, or

impala-shell.sh --ssl \
   --ca_cert=/full/path/to/cert/impala-ca-cert.pem \
   -l -u myoder@cloudera.com

When authenticating with LDAP.

Michael Yoder is a Software Engineer at Cloudera.

↧

New in CDH 5.2: New Security App and More in Hue

October 24, 2014, 9:49 am

≫ Next: How Cerner Uses CDH with Apache Kafka

≪ Previous: New in CDH 5.2: Impala Authentication with LDAP and Kerberos

Thanks to new improvements in Hue, CDH 5.2 offers the best GUI yet for using Hadoop.

CDH 5.2 includes important new usability functionality via Hue, the open source GUI that makes Apache Hadoop easy to use. In addition to shipping a brand-new app for managing security permissions, this release is particularly feature-packed, and is becoming a great complement to BI tools from Cloudera partners like Tableau, MicroStrategy, and Zoomdata because a more usable Hadoop translates into better BI overall across your organization!

In the rest of this post, we’ll offer an overview of these improvements.

Security

To support the growth of the Apache Sentry (incubating) project and make it easier to secure your cluster, CDH 5.2 Hue contains a new Security app. Sentry privileges determine which Apache Hive/Impala databases and tables a user can see or modify. The Security app lets you create/edit/delete roles and privileges directly from your browser (there is no sentry-provider.ini file to edit anymore), and to take advantage of Sentry’s new WITH GRANT OPTION functionality. (Learn more here.)

Search

Hue’s initial Search dashboards introduced new ways to quickly explore a lot of data by dragging & dropping graphical widgets and leveraging Apache Solr capabilities. In CDH 5.2, the new application is greatly improved. For example, the top menu bar is re-organized to split widgets displaying records or facets, new Heatmap and Tree widgets let you explore data in 2D or n-Dimensions, the new Marker Map is great for automatically plotting the result rows on a leaflet, and index fields can now have their terms and stats retrieved in a single click. (Learn more here.)

Apache Oozie

Hue’s Oozie dashboard got a few improvements to make Oozie job management less tedious: page display is faster, you can now suspend/kill/resume jobs and rerun failed coordinator actions in bulk, and there’s a new Metrics feature. (Learn more here.)

File Browser

A lot of exciting work has been done on File Browser to provide the best user experience possible, including:

Drag & drop uploading
Quick links to most recently used paths (up to 10 most recent)
Cleaner, more streamlined actions bar
New Actions Context menu
Improved Copy/Move Modals

and more.

Query Editors

The query editors for Impala and Hive are now more secure and convenient to use thanks to new support for LDAP passthrough, SSL encryption with HiveServer2, and automatic query timeout. Plus, there are pretty new graphs.

And Don’t Forget…

Kerberos support in the Apache HBase app, Solr Indexer improvements (picks up the Apache ZooKeeper config and gives a hint if pointing to the wrong Solr), a “kill application button” for YARN in Job Browser, and more SDK functionality for building your own apps.

Look for even more in the next CDH release!

↧