How Cerner Uses CDH with Apache Kafka

November 11, 2014, 8:52 am

≫ Next: Guidelines for Installing CDH Packages on Unsupported Operating Systems

≪ Previous: New in CDH 5.2: New Security App and More in Hue

Our thanks to Micah Whitacre, a senior software architect on Cerner Corp.’s Big Data Platforms team, for the post below about Cerner’s use case for CDH + Apache Kafka. (Kafka integration with CDH is currently incubating in Cloudera Labs.)

Over the years, Cerner Corp., a leading Healthcare IT provider, has utilized several of the core technologies available in CDH, Cloudera’s software platform containing Apache Hadoop and related projects—including HDFS, Apache HBase, Apache Crunch, Apache Hive, and Apache Oozie. Building upon those technologies, we have been able to architect solutions to handle our diverse ingestion and processing requirements.

At various points, however, we reached certain scalability limits and perhaps even abused the intent of certain technologies, causing us to look for better options. By adopting Apache Kafka, Cerner has been able to solidify our core infrastructure, utilizing those technologies as they were intended.

One of the early challenges Cerner faced when building our initial processing infrastructure was moving from batch-oriented processing to technologies that could handle a streaming near-real-time system. Building upon the concepts in Google’s Percolator paper, we built a similar infrastructure on top of HBase. Listeners interested in data of specific types and from specific sources would register interest in data written to a given table. For each write performed, a notification for each applicable listener would be written to a corresponding notification table. Listeners would continuously scan a small set of rows on the notification table looking for new data to process, deleting the notification when complete.

Our low-latency processing infrastructure worked well for a time but quickly reached scalability limits based on its use of HBase. Listener scan performance would degrade without frequent compactions to remove deleted notifications. During the frequent compactions, performance would degrade, causing severe drops in processing throughput. Processing would require frequent reads from HBase to retrieve the notification, the payload, and often supporting information from other HBase tables. The high number of reads would often contend with writes done our processing infrastructure that were writing transformed payloads and additional notifications for downstream listeners. The I/O contention and the compaction needs required careful management to distribute the load across the cluster, often segregating the notification tables on isolated region servers.

Adopting Kafka was a natural fit for reading and writing notifications. Instead of scanning rows in HBase, a listener would process messages off of a Kafka topic, updating its offset as notifications were successfully processed.

Kafka’s natural separation of producers and consumers eliminated contention at the HBase RegionServer due to the high number of notification read and write operations. Kafka’s consumer offset tracking helped to eliminate the need for notification deletes, and replaying notifications became as simple as resetting the offset in Kafka. Offloading the highly transient data from HBase greatly reduced unnecessary overhead from compactions and high I/O.

Building upon the success of Kafka-based notifications, Cerner then explored using Kafka to simplify and streamline data ingestion. Cerner systems ingest data from multiple disparate sources and systems. Many of these sources are external to our data centers. The “Collector,” a secured HTTP endpoint, will identify and namespace the data before it is persisted into HBase. Prior to utilizing Kafka, our data ingestion infrastructure targeted a single data store such as an HBase cluster.

The system satisfied our initial use cases but as our processing needs changed, so did the complexity of our data ingestion infrastructure. Data would often need to be ingested into multiple clusters in near real time, and not all data needed the random read/write functionality of HBase.

Utilizing Kafka in our ingestion platform helped provide a durable staging area, giving us a natural way to broadcast the ingested data to multiple destinations. The collector process stayed simple by persisting data into Kafka topics, segregated by source. Pushing data to Kafka resulted in a noticeable improvement as the uploading processes were no longer subject to intermittent performance degradations due to compaction or region splitting with HBase.

After data lands in Kafka, Apache Storm topologies push data to consuming clusters independently. Kafka and Storm allow the collector process to remain simple by eliminating the need to deal with multiple writes or the performance influence of the slowest downstream system. Storm’s at least once guarantee of delivering the data is acceptable because persistence of the data is idempotent.

The separation that Kafka provides also allows us to aggregate the data for processing as necessary. Some medical data feeds produce a high volume of small payloads that only need to be processed through batch methods such as MapReduce. Linkedin’s Camus project allows our ingestion platform to persist batches of small payloads within Kafka topics into larger files in HDFS for processing. In fact, all the data we ingest into Kafka is archived into HDFS as Kite SDK Datasets using the Camus project. This approach gives us the ability to perform further analytics and processing that do not require low latency processing on that data. Archiving the data also provides a recovery mechanism in case data delivery lags beyond the topic retention policies of Kafka.

Cerner’s use of Kafka for ingesting data will allow us to continue to experiment and evolve our data processing infrastructure when new use cases are discovered. Technologies such as Spark Streaming, Apache Samza (incubating), and Apache Flume can be explored as alternatives or additions to the current infrastructure. Cerner can prototype Lambda and Kappa architectures for multiple solutions independently without affecting the processes producing data. As Kafka’s multi-tenancy capabilities develop, Cerner can also look to simplify some of its data persistence needs, eliminating the need to push to downstream HBase clusters.

Overall, Kafka will play a key role in Cerner’s infrastructure for large-scale distributed processing and be a nice companion to our existing investments in Hadoop and HBase.

Micah Whitacre (@mkwhit) is a senior software architect on Cerner Corp.’s Big Data Platforms team, and an Apache Crunch committer.

↧

Guidelines for Installing CDH Packages on Unsupported Operating Systems

November 18, 2014, 9:04 am

≫ Next: For Apache Hadoop, The POODLE Attack Has Lost Its Bite

≪ Previous: How Cerner Uses CDH with Apache Kafka

Installing CDH on newer unsupported operating systems (such as Ubuntu 13.04 and later) can lead to conflicts. These guidelines will help you avoid them.

Some of the more recently released operating systems that bundle portions of the Apache Hadoop stack in their respective distro repositories can conflict with software from Cloudera repositories. Consequently, when you set up CDH for installation on such an OS, you may end up picking up packages with the same name from the OS’s distribution instead of Cloudera’s distribution. Package installation may succeed, but using the installed packages may lead to unforeseen errors.

If you are manually installing (via apt-get, yum, or Puppet) packages for Ubuntu 14.04 with CDH 5.2.0 or later (which is supported by Cloudera), this issue does not pertain to you—Cloudera Manager takes care of the necessary steps to avoid conflicts. Furthermore, if you are using CDH parcels instead of packages for any release or OS, conflicts are similarly not an issue.

If, however, you are either:

Manually installing CDH packages (any release) on a newer unsupported OS (such as Ubuntu 13.04, Ubuntu 13.10, Debian 7.5, Fedora 19, and Fedora 20–refer to the CDH 5 Requirements and Supported Versions guide for an up-to-date list of supported OSs), or
Manually installing CDH packages for a release earlier than CDH 5.2.0 on Ubuntu 14.04

then you should find the following good-faith installation guidelines helpful.

(Note: If you are installing CDH 5.2 packages manually on a supported OS like Ubuntu 14.04, the documentation lists the necessary steps you need to take. However, you may still find this blog post useful as background reading.)

The Problem in Action

As explained above, if you are mixing-and-matching packages between distributions, you may easily end up with misleading errors.

For example, here is an error when running hbase shell on Ubuntu 13.04 with CDH 5. In this case, the zookeeper package is installed from the OS repositories (in this case, Ubuntu 13.04) whereas the hbase package is installed from CDH.

$ hbase shell
2014-09-08 13:27:51,017 INFO  [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.98.1-cdh5.1.2, rUnknown, Mon Aug 25 19:33:59 PDT 2014

hbase(main):001:0&gt; status

ERROR: cannot link Java class org.apache.hadoop.hbase.client.HBaseAdmin, probable missing dependency: org/apache/zookeeper/KeeperException

Does My OS Have This Problem?

In the table below, you will find the mapping of various operating systems and the conflicting packages.

Red means a conflict exists: Installing packages today on the OS would install some package(s) from the OS repo instead of the CDH repo, or worse, have a mix of packages from OS and CDH repo. The value of the field represents the package that will be installed from the OS. For example, OS zookeeper refers to the fact that the zookeeper package would be installed from the OS instead of the CDH repository, which will cause issues.

Orange means no conflict currently exists but that one could arise if the OS repo decides to bump up or change the package version.

If you are using a problematic OS, you will find the solution in the next section.

(Note: Even though Ubuntu 14.04 is listed as a “problematic” OS in the above table, the solution described below is already implemented in Cloudera Manager and described in the documentation. You don’t have to do anything extra if you are using Cloudera Manager or simply following the documentation.)

Solution

The best way to fix this problem is to ensure that all the packages are coming from the same CDH repository. The OS repository is added by default and it’s usually not a good idea to disable that repository. You can, however, set the priority of CDH repo to be higher than the default OS repo. Consequently, if there is a package with the same name in the CDH and the default OS repo, the package from the CDH repository would take precedence over the one in the OS repository regardless of which one has the higher version. This concept is generally referred to as pinning.

For Debian-based OSs (e.g. Ubuntu 13.04, Ubuntu 13.10, Debian 7.5)

Create a file at /etc/apt/preferences.d/cloudera.pref with the following contents:

Package: *
Pin: release o=Cloudera, l=Cloudera
Pin-Priority: 501

No apt-get update is required after creating this file.

For those curious about this solution, the default priority of packages is 500. By creating the file above, you provide a higher priority of 501 to any package that has origin specified as “Cloudera” (o=Cloudera) and is coming from Cloudera’s repo (l=Cloudera), which does the trick.

For RPM-based OSs (such as Fedora 19 and Fedora 20)

Install the yum-plugin-priorities package by running:

sudo yum install yum-plugin-priorities

This package enables us to use yum priorities which you will see in the next step.

Then, edit the relevant cloudera-cdh*.repo file under /etc/yum.repos.d/ and add this line at the bottom of that file:

priority = 98

The default priority for all repositories (including the OS repository) is 99. Lower priority takes more precedence in RHEL/CentOS. By setting the priority to 98, we give the Cloudera repository higher precedence than the OS repository.

For OSs Not on the List

In general, you will have a problem if the OS repository and the CDH repository provide a package with the same name. The most common conflicting packages are zookeeper and hadoop-client, so as a start you need to ascertain whether there is more than one repository delivering those packages.

On a Debian-based system, you can run something like apt-cache policy zookeeper. That command will list all the repositories where the package zookeeper is available. For example, here is the result of running apt-cache policy zookeeper on Ubuntu 13.04:

root@ip-172-26-1-209:/etc/apt/sources.list.d# apt-cache policy zookeeper
zookeeper:
    Installed: (none)
    Candidate: 3.4.5+dfsg-1~exp2
    Version table:
       3.4.5+dfsg-1~exp2 0
          500 http://us-west-1.ec2.archive.ubuntu.com/ubuntu/ raring/universe amd64 Packages
       3.4.5+cdh5.1.0+28-1.cdh5.1.0.p0.208~precise-cdh5.1.0 0
          500 http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/ precise-cdh5/contrib amd64 Packages

As you can see, the package zookeeper is available from two repositories: Ubuntu’s Raring Universe Repository and the CDH repository. So, you have a problem.

On a yum-based system, you can run something like yum whatprovides hadoop-client. That command will list all the repositories where the hadoop-client package is available. For example, here is the result from Fedora 20:

$yum whatprovides hadoop-client
Loaded plugins: priorities
cloudera-cdh4
hadoop-client-2.0.0+1604-1.cdh4.7.0.p0.17.el6.x86_64 : Hadoop client side dependencies
Repo        : cloudera-cdh4

hadoop-client-2.2.0-1.fc20.noarch : Libraries for Hadoop clients
Repo        : fedora

hadoop-client-2.2.0-5.fc20.noarch : Libraries for Apache Hadoop clients
Repo        : updates

As you can see, the package zookeeper is available from multiple repositories: Fedora 20 repositories and the CDH repository. Again, that’s a problem.

Conclusion

Managing package repositories that deliver conflicting packages can be tricky. You have to take the above steps on affected operating systems to avoid any conflicts.

To re-iterate, this issue is mostly contained to the manual use of packages on unsupported OSs:

If you are using parcels, you don’t have to worry about such problems. On top of that you get easy rolling upgrades.
If you are installing packages via Cloudera Manager, you don’t have to worry about such problems since Cloudera Manager takes care of pinning.
If the preceding points don’t apply to you, follow the instructions in the blog post to ensure there are no conflicts among CDH and OS packages

Mark Grover is a Software Engineer on Cloudera Engineering’s Packaging and Integration team, an Apache Bigtop committer, and a co-author of the O’Reilly Media book, Hadoop Application Architectures.

↧

For Apache Hadoop, The POODLE Attack Has Lost Its Bite

December 3, 2014, 8:32 am

≫ Next: New in CDH 5.2: Improvements for Running Multiple Workloads on a Single HBase Cluster

≪ Previous: Guidelines for Installing CDH Packages on Unsupported Operating Systems

A significant vulnerability affecting the entire Apache Hadoop ecosystem has now been patched. What was involved?

By now, you may have heard about the POODLE (Padding Oracle On Downgraded Legacy Encryption) attack on TLS (Transport Layer Security). This attack combines a cryptographic flaw in the obsolete SSLv3 protocol with the ability of an attacker to downgrade TLS connections to use that protocol. The result is that an active attacker on the same network as the victim can potentially decrypt parts of an otherwise encrypted channel. The only immediately workable fix has been to disable the SSLv3 protocol entirely.

POODLE sent many technical people scrambling. Web servers needed configuration changes, software projects using TLS needed to change default behavior, and web browsers moved to phase out SSLv3 support. Cloudera has also taken action.

POODLE logo

This blog post provides an overview of the POODLE vulnerability, discuss its impacts on Apache Hadoop, and describe the fixes Cloudera pushed forward across the ecosystem.

What is POODLE?

Let’s begin with some background about SSL/TLS terminology: SSL (Secure Sockets Layer) is the former name for what is today called TLS. Between SSLv3 and TLSv1, the protocol was renamed. Even though this happened 15 years ago, the SSL name has stuck around. And even though SSLv3 has long been obsolete, and has been known to have other, lesser, vulnerabilities, its retirement has been drawn out due to the desire to provide backward compatibility for the sake of a smooth user experience.

In the meantime, SSLv3 has been replaced by TLSv1, TLSv1.1, and TLSv1.2. Under normal circumstances, the strongest protocol version that both sides support is negotiated at the start of the connection. However, an active attacker can introduce errors into this negotiation and force a fallback into the weakest protocol version: SSLv3.

POODLE—the attack on SSLv3—was discovered by Bodo Möller, Thai Duong, and Krzysztof Kotowicz at Google. Their report describes how the SSLv3 protocol can be tortured to reveal otherwise encrypted information, one byte at a time. Using the vulnerability, the researchers were able to extract an average of one byte for every 256 SSLv3 connection attempts. This might not sound bad to non-crypto-geeks, but attackers can realistically use it to retrieve session cookies: strings that identify a user in a secure session. If you have a session cookie for someone logged into say, Gmail, you can then gain access to his or her Gmail account.

The attack itself is an excellent piece of work. If you’re interested in more details, I can highly recommend this Imperial Violet blog post and this blog post by Matthew Green. The Wikipedia article on TLS has a huge amount of general information.

Leashing POODLE

One common thread between the Hadoop community and the security research community is the habit of devising creative project names; a clever acronym or portmanteau seems to be at least as valuable as the code or exploits themselves. In that spirit I bring you HADOODLE: fixes for POODLE across the Hadoop ecosystem.

As you all know, the Hadoop platform isn’t one project; rather, it’s a confederation of many different projects, all interoperating and cooperating in the same environment. The word “ecosystem” is overused, but describes the situation perfectly. This can be a great thing, because it lets multiple different groups solve a variety of problems independently of each other. It’s also a great model for fast-paced innovation. However, it can be problematic for the sort of pervasive changes required by security vulnerabilities.

The loose confederation of projects means that there are several different web servers and other usages of TLS. Rounding up and fixing POODLE meant educating and coordinating changes among 12 different projects comprising five different types of web servers and three programming languages. While conceptually simple, “turning off SSLv3″ is done slightly differently for each of these technologies and required an awareness of TLS idiosyncrasies. Beyond the full system-level tests (provided by services like Apache Bigtop), every component of the ecosystem needed to be individually scrutinized to ensure that POODLE was really fixed. All in all, it was a fair amount of work.

I’m happy to say that Cloudera took up the challenge, and today we’re able to announce patches for every current release of CDH and Cloudera Manager. Cloudera engineers contributed the following HADOODLE fixes upstream:

Cloudera also has fixes for Apache HBase, Impala, and Cloudera Manager. Every component not mentioned does not yet support TLS and hence is not vulnerable.

The table below shows the releases where the POODLE fixes are first available.

This issue is also described in our Technical Service Bulletin #37 (TSB-37). If you run a secure Hadoop cluster, I strongly recommend upgrading to the appropriate patch release above.

Michael Yoder is a Software Engineer at Cloudera.

↧

New in CDH 5.2: Improvements for Running Multiple Workloads on a Single HBase Cluster

December 9, 2014, 8:57 am

≫ Next: Cloudera Enterprise 5.3 is Released

≪ Previous: For Apache Hadoop, The POODLE Attack Has Lost Its Bite

These new Apache HBase features in CDH 5.2 make multi-tenant environments easier to manage.

Historically, Apache HBase treats all tables, users, and workloads with equal weight. This approach is sufficient for a single workload, but when multiple users and multiple workloads were applied on the same cluster or table, conflicts can arise. Fortunately, starting with HBase in CDH 5.2 (HBase 0.98 + backports), workloads and users can now be prioritized.

One can categorize the approaches to this multi-tenancy problem in three ways:

Physical isolation or partitioning – each application or workload operates on its own table and each table can be assigned to a set of machines.
Scheduling – applications and workloads are scheduled based on access patterns or time and resources needed to complete.
Quotas – limited ad-hoc queries on a table can be shared with other applications.

In this post, I’ll explain three new HBase mechanisms (see umbrella JIRA HBASE-10994 – HBase Multitenancy) focused on enabling some of the approaches above.

Throttling

In a multi-tenant environment, it is useful to enforce manual limits that prevent users from abusing the system. (A simple example is: “Let MyApp run as fast as possible and limit all the other users to 100 request per second.”)

The new throttling feature in CDH 5.2 (HBASE-11598 – Add rpc throttling) allows an admin to enforce a limit on number of requests by time or data by time for a specified user, table, or namespace. Some examples are:

Throttle Table A to X req/min
Throttle Namespace B to Y req/hour
Throttle User K on Table Z to KZ MB/sec

An admin can also change the throttle at runtime. The change will propagate after the quota refresh period has expired, which at the moment has a default refresh period of 5 minutes. This value is configurable by modifying the hbase.quota.refresh.period property in hbase-site.xml. In future releases, a notification will be sent to apply the changes instantly.

In the chart below, you can see an example of the results of throttling.

Initially, User 1 and User2 are unlimited and then the admin decides that the User 1 job is more important and throttles the User 2 job, reducing contention with the User 1 requests.

The shell allows you to specify the limit in a descriptive way (e.g. LIMIT => 10req/sec or LIMIT => 50M/sec). To remove the limit, use LIMIT => NONE.

Examples:

$ hbase shell
hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10req/sec'
hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10M/sec'
hbase> set_quota TYPE => THROTTLE, USER => 'u1', TABLE => 't2', LIMIT => '5K/min'
hbase> set_quota TYPE => THROTTLE, USER => 'u1', NAMESPACE => 'ns2', LIMIT => NONE
hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns1', LIMIT => '10req/sec'
hbase> set_quota TYPE => THROTTLE, TABLE => 't1', LIMIT => '10M/sec'
hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => NONE

You can also place a global limit and exclude a user or a table from the limit by applying the GLOBAL_BYPASS property. Consider a situation with a production workload and many ad-hoc workloads. You can choose to set a limit for all the workloads except the production one, reducing the impact of the ad-hoc queries on the production workload.

$ hbase shell
hbase> set_quota NAMESPACE => 'ns1', LIMIT => '100req/min'
hbase> set_quota USER => 'u1', GLOBAL_BYPASS => true

Note that the throttle is always enforced; even when the production workload is currently in-active, the ad-hoc requests are all throttled.

Request Queues

Assuming no throttling policy is in place, when the RegionServer receives multiple requests they are now placed into a queue waiting for a free execution slot (HBASE-6721 – RegionServer Group based Assignment).

The simplest queue is a FIFO queue, which means that each request has to wait for the completion of all the requests in the queue before it. And, as you can see from the picture below, fast/interactive queries can get stuck behind large requests. (To keep the example simple, let’s assume that there is a single executor.)

One solution would be to divide the large requests into small requests and interleave each chunk with other requests, allowing multiple requests to make progress. The current infrastructure doesn’t allow that; however, if you are able to guess how long a request will take to be served, you can reorder requests—pushing the long requests to the end of the queue and allowing short requests to jump in front of longer ones. At some point you have to execute the large requests and prioritize the new requests behind large requests. However, the short requests will be newer, so the result is not as bad as the FIFO case but still suboptimal compared to the solution described above where large requests are split into multiple smaller requests.

Deprioritizing Long-running Scanners

Along the line of what we described above, CDH 5.2 has a “fifo” queue and a new queue type called “deadline” configurable by setting the hbase.ipc.server.callqueue.type property (HBASE-10993 – Deprioritize long-running scanners). Currently there is no way to estimate how long each request may take, so de-prioritization only affects scans and is based on the number of “next” calls a scan request did. This assumes that when you are doing a full table scan, your job is probably not that interactive, so if there are concurrent requests you can delay long-running scans up to a limit tunable by setting the hbase.ipc.server.queue.max.call.delay property. The slope of the delay is calculated by a simple square root of (numNextCall * weight) where the weight is configurable by setting the hbase.ipc.server.scan.vtime.weight property.

Multiple-Typed Queues

Another way you can prioritize/deprioritize different kinds of requests is by having a specified number of dedicated handlers and queues. That way you can segregate the scan requests in a single queue with a single handler, and all the other available queues can service short Get requests.

Currently, some static tuning options are available to adjust the ipc queues/handlers based on the type of workload. This approach is an interim first step that will eventually allow you to change the settings at runtime as you do for throttling, and to enable dynamically adjusting values based on the load.

Multiple Queues

To avoid contention and separate different kinds of requests, a new property, hbase.ipc.server.callqueue.handler.factor, allows admins to increase the number of queues and decide how many handlers share the same queue (HBASE-11355 – Multiple Queues / Read-Write Queues).

Having more queues, such as one queue per handler, reduces contention when adding a task to a queue or selecting it from a queue. The trade-off is that if you have some queues with long-running tasks, a handler may end up waiting to execute from that queue rather than stealing from another queue which has waiting tasks.

Read and Write

With multiple queues, you can now divide read and write requests, giving more priority (queues) to one or the other type. Use the hbase.ipc.server.callqueue.read.ratio property to choose to serve more reads or writes (HBASE-11724 Short-Reads/Long-Reads Queues).

Similar to the read/write split, you can split gets and scans by tuning the hbase.ipc.server.callqueue.scan.ratio to give more priority to gets or to scans. The chart below shows the effect of the settings.

A scan ratio 0.1 will give more queue/handlers to the incoming gets, which means that more of them can be processed at the same time and that fewer scans can be executed at the same time. A value of 0.9 will give more queue/handlers to scans so the number of scan request served will increase and the number of gets will decrease.

Future Work

Aside from addressing the current limitations mentioned above (static conf, unsplittable large requests, and so on) and doing things like limiting the number of tables that a user can create or using the namespaces more, a couple of major new features on the roadmap will further improve interaction between multiple workloads:

Per-user queues: Instead of a global setting for the system, a more advanced way to schedule requests is to allow each user to have its own “scheduling policy” allowing each user to define priorities for each table, and allowing each table to define request-types priorities. This would be administered in a similar way to throttling.
Cost-based scheduling: Request execution can take advantage of the known system state to prioritize and optimize scheduling. For example, one could prioritize requests that are known to be served from cache, prefer concurrent execution of requests that are hitting two different disks, prioritize requests that are known to be short, and so on.
Isolation/partitioning: Separating workload onto different machines is useful in situations where the admin understand the workload of each table and how to manually separate them. The basic idea is to reserve enough resources to run everything smoothly. (The only way to achieve that today is to set up one cluster per use case.)

Conclusion

Based on the above, you should now understand how to improve the interaction between different workloads using this new functionality. Note however that these features are only down payments on what will become more robust functionality in future releases.

Matteo Bertozzi is a Software Engineer at Cloudera and an HBase committer/PMC member.

↧

Cloudera Enterprise 5.3 is Released

December 23, 2014, 8:59 am

≫ Next: New in Cloudera Manager 5.3: Easier CDH Upgrades

≪ Previous: New in CDH 5.2: Improvements for Running Multiple Workloads on a Single HBase Cluster

We’re pleased to announce the release of Cloudera Enterprise 5.3 (comprising CDH 5.3, Cloudera Manager 5.3, and Cloudera Navigator 2.2).

This release continues the drumbeat for security functionality in particular, with HDFS encryption (jointly developed with Intel under Project Rhino) now recommended for production use. This feature alone should justify upgrades for security-minded users (and an improved CDH upgrade wizard makes that process easier).

Here are some of the highlights (incomplete; see the respective Release Notes for CDH, Cloudera Manager, and Cloudera Navigator for full lists of features and fixes):

Security

Folder-level HDFS encryption (in addition to storage, management, and access to encryption zone keys) is now a production-ready feature (HDFS-6134). This feature integrates with Navigator Key Trustee so that encryption keys can be securely stored separately from the data, with all the enterprise access and audit controls required to pass most security compliance audits such as PCI.
The Cloudera Manager Agent can now be run as a single configured user when running as root is not permitted.
In Apache Sentry (incubating), data can now be shared across Impala, Apache Hive, Search, and other access methods such as MapReduce using only Sentry permissions.
A Sentry bug that affected CDH 5.2 upgrades has been patched (SENTRY-500).

Data Management and Governance

In Cloudera Navigator 2.2, policies are now generally available and enabled by default. Policies let you set, monitor and enforce data curation rules, retention guidelines, and access permissions. They also let you notify partner products, such as profiling and data preparation tools, whenever there are relevant changes to metadata.
Navigator 2.2′s REST API now supports user-defined relations. Using these new APIs, you can augment Navigator’s automatically-generated lineage with your own column-level lineage. This is particularly useful for custom MapReduce jobs that run on structured data sources.
Navigator 2.2 also features many top-requested enhancements, including metadata search auto-suggest and a number of other usability improvements.

Cloud Deployments

Cloudera Enterprise 5.3 is now a first-class citizen with respect to deployments on Microsoft Azure.
Apache Hadoop gets a new S3-native filesystem for improved performance on AWS (HADOOP-10400).

Real-Time Architecture

Apache Flume now includes an Apache Kafka Channel for tighter Kafka-Flume integration (FLUME-2500).
Apache HBase performance is significantly improved thanks to updated defaults (HBASE-2611, HBASE-12529).

New or Updated Open Source Components

Apache Spark 1.2
Hue 3.7
Impala 2.1

Other notables: Oracle JDK 1.8 is now supported, Impala now does incremental computation of table and column statistics (IMPALA-1122), and Apache Avro has new date, time, timestamp, and duration binary types (AVRO-739).

Over the next few weeks, we’ll publish blog posts that cover some of these features in detail. In the meantime:

As always, we value your feedback; please provide any comments and suggestions through our community forums. You can also file bugs via issues.cloudera.org.

↧

New in Cloudera Manager 5.3: Easier CDH Upgrades

February 10, 2015, 8:42 am

≫ Next: Checklist for Painless Upgrades to CDH 5

≪ Previous: Cloudera Enterprise 5.3 is Released

An improved upgrade wizard in Cloudera Manager 5.3 makes it easy to upgrade CDH on your clusters.

Upgrades can be hard, and any downtime to mission-critical workloads can have a direct impact on revenue. Upgrading the software that powers these workloads can often be an overwhelming and uncertain task that can create unpredictable issues. Apache Hadoop can be especially complex as it consists of dozens of components running across multiple machines. That’s why an enterprise-grade administration tool is necessary for running Hadoop in production, and is especially important when taking the upgrade plunge.

Cloudera Manager makes it easy to upgrade to the latest version of CDH. Not only does Cloudera Manager have a built-in upgrade wizard to make your CDH upgrades simple and predictable, it also features rolling-restart capability that enables zero-downtime upgrades under certain conditions.

This post illustrates how to leverage Cloudera Manager to upgrade your Cloudera cluster using the upgrade wizard, and also highlights some of the new features in Cloudera Enterprise 5.3.

Why a Wizard?

Upgrading can involve many steps that can depend on the services installed and the start/end versions. A wizard to upgrade across major versions (CDH 4 to CDH 5) has been available since Cloudera Manager 5. Cloudera Manager 5.3 introduces an enhanced CDH upgrade wizard that adds support for minor (CDH 5.x to CDH 5.y) and maintenance (CDH 5.b.x to CDH 5.b.y) version upgrades. The enhanced upgrade wizard performs service-specific upgrade steps that you would have had to run manually in the past.

Parcel and package installations are both supported by the CDH upgrade wizard. Using parcels is the preferred and recommended way, as packages must be manually installed, whereas parcels are installed by Cloudera Manager. Consider upgrading from packages to parcels so that the process is more automated, supports rolling upgrades, and provides an easier upgrade experience. (See this FAQ and this blog post to learn more about parcels.)

If you use parcels, have a Cloudera Enterprise license, and have enabled HDFS high availability, you can perform a rolling upgrade for non-major upgrades. This enables you to upgrade your cluster software and restart the upgraded services without incurring any cluster downtime. Note that it is not possible to perform a rolling upgrade from CDH 4 to CDH 5 because of incompatibilities between the two major versions.

Running the Upgrade Wizard

The Cloudera Manager version must always be equal to or greater than the CDH version you upgrade to. For example, to upgrade to CDH 5.3, you must be on Cloudera Manager 5.3 or higher.

Log in to the Cloudera Manager Admin Console.
To access the wizard, on the Home page, click the cluster’s drop down menu, and select “Upgrade Cluster.”
Alternately, you can trigger the wizard from the Parcels page, by first downloading and distributing a parcel to upgrade to, and then selecting the “Upgrade” button for this parcel.
When starting from the cluster’s Upgrade option, if the option to pick between packages and parcels is provided, click the “Use Parcels” radio button. Select the CDH version.

If there are no qualifying parcels, the location of the parcel repository will need to be added under “Parcel Configuration Settings.”
The wizard will now prompt you to backup existing databases. It will provide examples of additional steps to prepare your cluster for upgrade. Please read the upgrade documentation for a more complete list of actions to be taken at this stage, before proceeding with the upgrade. Check “Yes” for all required actions to be able to “Continue.”
The wizard now performs consistency and health checks on all hosts in the cluster. This feature is particularly helpful if you have mismatched versions of packages across cluster hosts. If any problems are found, you will be prompted to fix these before continuing.
The selected parcel is downloaded and distributed to all hosts.
For major upgrades, the wizard will warn that the services are about to be shut down for the upgrade.

For minor and maintenance upgrades, if you are using parcels and have HDFS high availability enabled, you will have the option to select “Rolling Upgrade” on this page. Supported services will undergo a rolling restart, while the rest will undergo a normal restart, with some downtime. Check “Rolling Upgrade” to proceed with this option.

Until this point, you can exit and resume the wizard without impacting any running services.
The Command Progress screen displays the results of the commands run by the wizard as it shuts down all services, activates the new parcel, upgrades services, deploys client configuration files, and restarts services.
The service commands include upgrading HDFS metadata, upgrading the Apache Oozie database and installing ShareLib, upgrading the Apache Sqoop server and Hive Metastore, among other things.
The Host Inspector runs to validate all hosts, as well as report CDH versions running on them.
At the end of the wizard process, you are prompted to finalize the HDFS metadata upgrade. It is recommended at this stage to refer to the upgrade documentation for additional steps that might be relevant to your cluster configuration and upgrade path.
For major (CDH 4 to CDH 5) upgrades, you have the option of importing your MapReduce configurations into your YARN service. Additional steps in the wizard will assist with this migration. On completion, Cloudera recommends that you review the YARN configurations for any additional tuning you might need.
Your upgrade is now complete!

Next Steps

Cloudera Enterprise 5 provides additional enterprise-ready capabilities and marks the next step in the evolution of the Hadoop-based data management platform. Any enhancements are ineffective if the benefits of the enterprise data hub are not easily accessible to existing users. That’s why Cloudera has placed an increased emphasis on the upgrade experience, to make it easier to upgrade to the latest version of the software. The team will continue to work on making improvements to this experience.

To ensure the highest level of functionality and stability, consider upgrading to the most recent version of CDH.

Please refer to the upgrade documentation for more comprehensive details on using the CDH upgrade wizard. Also, register for the “Best Practices for Upgrading Hadoop in Production” webinar that will occur live on Feb. 12, 2015.

Jayita Bhojwani is a Software Engineer at Cloudera.

Vala Dormiani is a Product Manager at Cloudera.

↧

Checklist for Painless Upgrades to CDH 5

March 26, 2015, 8:57 am

≫ Next: Cloudera Enterprise 5.4 is Released

≪ Previous: New in Cloudera Manager 5.3: Easier CDH Upgrades

Following these best practices can make your upgrade path to CDH 5 relatively free of obstacles.

Upgrading the software that powers mission-critical workloads can be challenging in any circumstance. In the case of CDH, however, Cloudera Manager makes upgrades easy, and the built-in Upgrade Wizard, available with Cloudera Manager 5, further simplifies the upgrade process. The wizard performs service-specific upgrade steps that, previously, you had to run manually, and also features a rolling restart capability that reduces downtime for minor and maintenance version upgrades. (Please refer to this blog post or webinar to learn more about the Upgrade Wizard).

As you prepare to upgrade your cluster, keep this checklist of some of Cloudera’s best practices and additional recommendations in mind. Please note that this information is complement to, not a replacement for, the comprehensive upgrade documentation.

Backing Up Databases

You will need to take backups prior to the upgrade. It is recommended that you already have procedures in place to periodically backup your databases. Prior to upgrading, be sure to:

Back-up the Cloudera Manager server and management databases that store configuration, monitoring, and reporting data. (These include the databases that contain all the information about what services you have configured, their role assignments, all configuration history, commands, users, and running processes.)
Back-up all databases (if you don’t already have regularly scheduled backup procedures), including the Apache Hive Metastore Server, Apache Sentry server (contains authorization metadata), Cloudera Navigator Audit Server (contains auditing information), Cloudera Navigator Metadata Server (contains authorization, policies, and audit report metadata), Apache Sqoop Metastore, Hue, Apache Oozie, and Apache Sqoop.
Back-up NameNode metadata by locating the NameNode Data Directories in the HDFS service and back up a listed directory (you only need to make a backup of one directory if more than one is listed)

Note: Cloudera Manager provides an integrated, easy-to-use management solution for enabling Backup and Disaster Recovery and the key capabilities are fully integrated into the Cloudera Manager Admin Console. It also is automated and fault tolerant.

Cloudera Manager makes it easy to manage data stored in HDFS and accessed through Hive. You can define your backup and disaster recovery policies and apply them across services. You can select the key datasets that are critical to your business, schedule and track the progress of data replication jobs, and get notified when a replication job fails. Replication can be set up on files or directories in the case of HDFS and on tables in the case of Hive. Hive metastore information is also replicated which means that table definitions are updated. (Please refer to the BDR documentation for more details.)

A separate Disaster Recovery cluster is not required for a safe upgrade but the Backup and Disaster Recovery capability in Cloudera Manager can ease the upgrade process by ensuring the critical parts of your infrastructure are backed up before you take the upgrade plunge.

Recommended Practices for Upgrading to CDH 5

Create fine-grained, step-by-step production plan for critical upgrades (using the Upgrade Documentation as a reference).
Document the current deployment by chronicling the existing cluster environment and dependencies, including

The current CDH and Cloudera Manager versions that are installed
All third-party tools that interact with the cluster
The databases for Cloudera Manager, Hive, Oozie, and Hue
Important job performance metrics so pre-upgrade baselines are well defined

Test the production upgrade plan in a non-production environment (e.g. sandbox or test environment) so you can update the plan if there are unexpected outcomes. It also allows you to:

Test job compatibility with the new version
Run performance tests

Upgrade to Cloudera Manager 5 before upgrading to CDH 5.

Ensure the Cloudera Manager minor version is equal to or greater than the target CDH minor version—the Cloudera Manager version must always be equal to or greater than the CDH version to which you upgrade.

Reserve a maintenance window with enough time allotted to perform all steps.

For a major upgrade on production clusters, Cloudera recommends allocating up to a full-day maintenance window to perform the upgrade (but time is dependent on the number of hosts, the amount of Hadoop experience, and the particular hardware). Note that it is not possible to perform a rolling upgrade from CDH 4 to CDH 5 (major upgrade) due to incompatibilities between the two major versions.

Maintain your own local Cloudera Manager and CDH package/parcel repositories to protect against external repositories being unavailable.

Read the reference documentation for details on how to create a local Yum repository, or
Pre-download a parcel to a local parcel repository on the Cloudera Manager server, where it is available for distribution to the other nodes in any of your clusters managed by this Cloudera Manager server. You can have multiple parcels for a given product downloaded to your Cloudera Manager server. Once a parcel has been downloaded to the server, it will be available for distribution on all clusters managed by the server. (Note: Parcel and package installations are equally supported by the Upgrade Wizard. Using parcels is the preferred and recommended way, as packages must be manually installed, whereas parcels are installed by Cloudera Manager. See this FAQ and this blog post to learn more about parcels.)

Ensure there are no Oozie workflows in RUNNING or SUSPENDED status as the Oozie database upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows. (Note: When upgrading from CDH 4 to CDH 5, the Oozie upgrade can take a very long time. You can reduce this time by reducing the amount of history Oozie retains; see the documentation.)

Import MapReduce configurations to YARN as part of the Upgrade Wizard. (Note: If you do not import configurations during upgrade, you can manually import the configurations at a later time. In addition to importing configuration settings, the import process will configure services to use YARN as the MapReduce computation framework instead of MapReduce and overwrites existing YARN configuration and role assignments.)

Conclusion

These recommendations and notable points to address when planning an upgrade to a Cloudera cluster are intended to complement the upgrade documentation that is provided for Cloudera Manager and CDH. As mentioned, Cloudera Manager streamlines the upgrade process and strives to prevent job failures by making upgrades simple and predictable—which is especially necessary for production clusters.

Cloudera’s enterprise data hub is constantly evolving with more production-ready capabilities and innovative tools. To ensure the highest level of functionality and stability, consider upgrading to the most recent version of CDH.

↧

Cloudera Enterprise 5.4 is Released

April 23, 2015, 4:37 pm

≫ Next: New in CDH 5.4: Apache HBase Request Throttling

≪ Previous: Checklist for Painless Upgrades to CDH 5

We’re pleased to announce the release of Cloudera Enterprise 5.4 (comprising CDH 5.4, Cloudera Manager 5.4, and Cloudera Navigator 2.3).

Cloudera Enterprise 5.4 (Release Notes) reflects critical investments in a production-ready customer experience through governance, security, performance and deployment flexibility in cloud environments. It also includes support for a significant number of updated open standard components–including Apache Spark 1.3, Impala 2.2, and Apache HBase 1.0 (as well as unsupported beta releases of Hive-on-Spark data processing and OpenStack deployments).

Recently Cloudera made the upgrade process considerably easier via an improved CDH upgrade wizard; see details about that wizard here and best practices here. (Note: Due to metadata format changes in Apache Hadoop 2.6, upgrading to CDH 5.4.0 and later from any earlier release requires an HDFS metadata upgrade, as well. Note also that for parcel-based installations of CDH 5.4, Cloudera Manager 5.4 is required.)

Here are some of the highlights (incomplete; see the respective Release Notes for CDH, Cloudera Manager, and Cloudera Navigator for full lists of features and fixes):

Security

SSL and Kerberos support in Apache Flume for the Thrift source and sink.
SSL support across Cloudera Search (Solr and all integrations with CDH).
Cluster-wide redaction of sensitive data in logs is now possible.
HBase impersonation in Hue allows your client to authenticate to HBase as any user, and to re-authenticate at any time.
Solr metadata stored in ZooKeeper can now be protected by ZooKeeper ACLs.
Kerberos is now supported for Apache Sqoop2.

Performance

Includes beta release of Hive-on-Spark as an option for improved Hive data processing performance (unsupported).
MultiWAL support for HBase RegionServers allows you to increase throughput when a region writes to the write-ahead log (WAL).
You can now store medium-sized objects (MOBs) up to 10MB in size directly in HBase while maintaining read and write performance.
A new Kafka connector for Spark Streaming avoids the need for the HDFS WAL.
Hue pages render much faster.

Data Management and Governance

Expanded coverage in Cloudera Navigator

Impala (CDH 5.4 and higher) lineage
Cloudera Search (CDH 5.4 and higher) auditing
Auditing of Navigator activity, such as audit views, metadata searches, and policy editing
Avro and Parquet schema inference

Platform enhancements

Redesigned metadata search provides autocomplete, faster filtering, and saved searches
SAML for single sign-on

Cloud Deployments

OpenStack deployments are now possible as an unsupported beta.
HBase support on Microsoft Azure.

Real-Time Architecture

Cloudera Distribution of Apache Kafka 1.3 installs by default and is supported for production use.
Spark Streaming now has a receiver-less “direct” connector for Kafka.

New or Updated Open Source Components

Apache Hadoop 2.6
Apache HBase 1.0
Apache Hive 1.1
Apache Oozie 4.1
Apache Solr 4.10.3
Apache Spark 1.3
Cloudera Distribution of Apache Kafka 1.3
Hue 3.7
Impala 2.2
Kite SDK 1.0

New/Updated OS & Java Support

RHEL 6.6/CentOS 6.6/OEL 6.6 (UEK3)
JDK8u40, JDK7u75

Over the next few weeks, we’ll publish blog posts that cover some of these features in detail. In the meantime:

As always, we value your feedback; please provide any comments and suggestions through our community forums. You can also file bugs via issues.cloudera.org.

↧

New in CDH 5.4: Apache HBase Request Throttling

May 28, 2015, 8:42 am

≫ Next: New in CDH 5.4: Sensitive Data Redaction

≪ Previous: Cloudera Enterprise 5.4 is Released

The following post about the new request throttling feature in HBase 1.1 (now shipping in CDH 5.4) originally published in the ASF blog. We re-publish it here for your convenience.

Running multiple workloads on HBase has always been challenging, especially when trying to execute real-time workloads while concurrently running analytical jobs. One possible way to address this issue is to throttle analytical MR jobs so that real-time workloads are less affected.

A new QoS (quality of service) feature that Apache HBase 1.1 introduces is request-throttling, which controls the rate at which requests get handled by a HBase cluster. HBase typically treats all requests identically; however, the new throttling feature can be used to specify a maximum rate or bandwidth to override this behavior. The limit may be applied to a requests originating from a particular user, or alternatively, to requests directed to a given table or a specified namespace.

The objective of this post is to evaluate the effectiveness of this feature and the overhead it might impose on a running HBase workload. The performance runs carried out showed that throttling works very well, by redirecting resources from a user whose workload is throttled to the workloads of other users, without incurring a significant overhead in the process.

Enabling Request Throttling

It is straightforward to enable the request-throttling feature—all that is necessary is to set the HBase configuration parameter hbase.quota.enabled to true. The related parameter hbase.quota.refresh.period specifies the time interval in milliseconds that that regionserver should re-check for any new restrictions that have been added.

The throttle can then be set from the HBase shell, like so:

hbase> set_quota TYPE => THROTTLE, USER => 'uname', LIMIT => '100req/sec'
hbase> set_quota TYPE => THROTTLE, TABLE => 'tbl', LIMIT => '10M/sec'
hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns', LIMIT => 'NONE'

Test Setup

To evaluate how effectively HBase throttling worked, a YCSB workload was imposed on a 10 node cluster. There were 6 RegionServers and 2 master nodes. YCSB clients were run on the 4 nodes that were not running RegionServer processes. The client processes were initiated by two separate users and the workload issued by one of them was throttled.

More details on the test setup follow.

HBase version:

HBase 0.98.6-cdh5.3.0-SNAPSHOT (HBASE-11598 was backported to this version)

Configuration:
CentOS release 6.4 (Final)
CPU sockets: 2
Physical cores per socket: 6
Total number of logical cores: 24
Number of disks: 12
Memory: 64GB
Number of RS: 6
Master nodes: 2 (for the Namenode, Zookeeper and HBase master)
Number of client nodes: 4
Number of rows: 1080M
Number of regions: 180
Row size: 1K
Threads per client: 40
Workload: read-only and scan
Key distribution: Zipfian
Run duration: 1 hour

Procedure

An initial data set was first generated by running YCSB in its data generation mode. A HBase table was created with the table specifications above and pre-split. After all the data was inserted, the table was flushed, compacted and saved as a snapshot. This data set was used to prime the table for each run. Read-only and scan workloads were used to evaluate performance; this eliminates effects such as memstore flushes and compactions. One run with a long duration was carried out first to ensure the caches were warmed and that the runs yielded repeatable results.

For the purpose of these tests, the throttle was applied to the workload emanating from one user in a two-user scenario. There were four client machines used to impose identical read-only workloads. The client processes on two machines were run by the user “jenkins”, while those on the other two were run as a different user. The throttle was applied to the workload issued by this second user. There were two sets of runs, one with both users running read workloads and the second where the throttled user ran a scan workload. Typically, scans are long running and it can be desirable on occasion to de-prioritize them in favor of more real-time read or update workloads. In this case, the scan was for sets of 100 rows per YCSB operation.

For each run, the following steps were carried out:

Any existing YCSB-related table was dropped.
The initial data set was cloned from the snapshot.
The desired throttle setting was applied.
The desired workloads were imposed from the client machines.
Throughput and latency data was collected and is presented in the table below.

The throttle was applied at the start of the job (the command used was the first in the list shown in the “Enabling Request Throttling” section above). The hbase.quota.refresh.period property was set to under a minute so that the throttle took effect by the time test setup was finished.

The throttle option specifically tested here was the one to limit the number of requests (rather than the one to limit bandwidth).

Observations and Results

The throttling feature appears to work quite well. When applied interactively in the middle of a running workload, it goes into effect immediately after the the quota refresh period and can be observed clearly in the throughput numbers put out by YCSB while the test is progressing. The table below has performance data from test runs indicating the impact of the throttle. For each row, the throughput and latency numbers are also shown in separate columns, one set for the “throttled” user (indicated by “T” for throttled) and the other for the “non-throttled” user (represented by “U” for un-throttled).

Read + Read Workload

As can be seen, when the throttle pressure is increased (by reducing the permitted throughput for user “T” from 2500 req/sec to 500 req/sec, as shown in column 1), the total throughput (column 2) stays around the same. In other words, the cluster resources get redirected to benefit the non-throttled user, with the feature consuming no significant overhead. One possible outlier is the case where the throttle parameter is at its most restrictive (500 req/sec), where the total throughput is about 10% less than the maximum cluster throughput.

Correspondingly, the latency for the non-throttled user improves while that for the throttled user degrades. This is shown in the last two columns in the table.

The charts above show that the change in throughput is linear with the amount of throttling, for both the throttled and non-throttled user. With regard to latency, the change is generally linear, until the throttle becomes very restrictive; in this case, latency for the throttled user degrades substantially.

One point that should be noted is that, while the throttle parameter in req/sec is indeed correlated to the actual restriction in throughput as reported by YCSB (ops/sec) as seen by the trend in column 4, the actual figures differ. As user “T”’s throughput is restricted down from 2500 to 500 req/sec, the observed throughput goes down from 2500 ops/sec to 1136 ops/sec. Therefore, users should calibrate the throttle to their workload to determine the appropriate figure to use (either req/sec or MB/sec) in their case.

Read + Scan Workload

With the read/scan workload, similar results are observed as in the read/read workload. As the extent of throttling is increased for the long-running scan workload, the observed throughput decreases and latency increases. Conversely, the read workload benefits. displaying better throughput and improved latency. Again, the specific numeric value used to specify the throttle needs to be calibrated to the workload at hand. Since scans break down into a large number of read requests, the throttle parameter needs to be much higher than in the case with the read workload. Shown above is a log-linear chart of the impact on throughput of the two workloads when the extent of throttling is adjusted.

Conclusion

HBase request throttling is an effective and useful technique to handle multiple workloads, or even multi-tenant workloads on an HBase cluster. A cluster administrator can choose to throttle long-running or lower-priority workloads, knowing that RegionServer resources will get re-directed to the other workloads, without this feature imposing a significant overhead. By calibrating the throttle to the cluster and the workload, the desired performance can be achieved on clusters running multiple concurrent workloads.

Govind Kamat is a Performance Engineer at Cloudera, and an HBase contributor.

↧

New in CDH 5.4: Sensitive Data Redaction

June 3, 2015, 8:37 am

≫ Next: Deploying Apache Kafka: A Practical FAQ

≪ Previous: New in CDH 5.4: Apache HBase Request Throttling

The best data protection strategy is to remove sensitive information from everyplace it’s not needed.

Have you ever wondered what sort of “sensitive” information might wind up in Apache Hadoop log files? For example, if you’re storing credit card numbers inside HDFS, might they ever “leak” into a log file outside of HDFS? What about SQL queries? If you have a query like select * from table where creditcard = '1234-5678-9012-3456', where is that query information ultimately stored?

This concern affects anyone managing a Hadoop cluster containing sensitive information. At Cloudera, we set out to address this problem through a new feature called Sensitive Data Redaction, and it’s now available starting in Cloudera Manager 5.4.0 when operating on a CDH 5.4.0 cluster.

Specifically, this feature addresses the “leakage” of sensitive information into channels unrelated to the flow of data–not the data stream itself. So, for example, Sensitive Data Redaction will get credit-card numbers out of log files and SQL queries, but it won’t touch credit-card numbers from the actual data returned from an SQL query. nor modify the stored data itself.

Investigation

Our first step was to study the problem: load up a cluster with sensitive information, run queries, and see if we could find the sensitive data outside of the expected locations. In the case of log files, We found that SQL queries themselves were written to several log files. Beyond the SQL queries, however, we did not observe any egregious offenders; developers seem to know that writing internal data to log files is a bad idea.

That’s the good news. The bad news is that the Hadoop ecosystem is really big, and there are doubtless many code paths and log messages that we didn’t exercise. Developers are also adding code to the system all the time, and future log messages might reveal sensitive data.

Looking more closely at how copies of SQL queries are distributed across the system was enlightening. Apache Hive writes a job configuration file that contains a copy of the query, and makes this configuration file available “pretty much everywhere.” Impala keeps queries and query plans around for debugging and record keeping and makes them available in the UI. Hue saves queries so they can be run again. This behavior makes perfect sense: users want to know what queries they have run, want to debug queries that went bad, and want information on currently running queries. When sensitive information is in the query itself, however, this helpfulness is suddenly much less helpful.

One way to tackle such “leakage” of sensitive data is to put log files in an encrypted filesystem such as that provided by Cloudera Navigator Encrypt. This strategy is reasonable and addresses compliance concerns, especially in the event that some users require the original queries.

This approach allows some users to still see the log files in the clear; the contents of the log files make their way to the Cloudera Manager UI. However, in most cases, the original query strings are not strictly required and the preferred solution is to simply remove (a.k.a. redact) the sensitive data entirely from places where it’s not needed.

The Approach

We decided to tackle redaction for log files as well as for SQL queries. Even though we observed little information leakage from log files, we decided that it would be better to be safe than sorry and apply redaction to all of them. We also wanted to protect against future log messages and code paths that we didn’t exercise. We therefore implemented “log redaction” code that plugs itself into the logging subsystem used by every component of Hadoop. This “log4j redactor” will inspect every log message as it’s generated, redact it, and pass it on to the normal logging stream.

The other component of this effort was to redact SQL queries at their source, which required work in Hive, Impala, and Hue (the open source GUI for Hadoop components). In both Hive and Impala, as soon as the query is entered it is split into two: the original copy that’s used to actually run the query, and a redacted copy of the query that’s shown to anything external. In Hue, queries are redacted as they are saved.

Finally, Cloudera Manager makes this all easy to configure. The administrator is able to specify a set of rules for redaction in one place, click a button, and have the redaction rules take effect on everything throughout the cluster.

Configuring Redaction

Let’s see how this works in practice. In Cloudera Manager there are two new parameters in HDFS, one to enable redaction and one to specify what to redact. Let’s say that I want to redact credit-card numbers. But because credit-card numbers are also a boring demo, let’s also say that I just read a Harry Potter book and would feel more comfortable if the name “Voldemort” were replaced by “He Who Shall Not Be Named.”

Redaction is an HDFS parameter that is applied to the whole cluster. The easiest way to find it is to simply search for ‘redact’ on the HDFS Configuration page:

Here there is no “Log and Query Redaction Policy” defined yet, and I’ve clicked on the + sign to add one. There are four presets in the UI, and it’s easy to create custom rules from scratch. For the credit-card numbers, I’ll select the first entry, “Credit Card numbers (with separator)”.

This action creates the first rule.

The “Description” is useful for record keeping and to remember what a rule is for, but has no impact on the actual redaction itself.
The “Trigger” field, if specified, limits the cases in which the “Search” field is applied. It’s a simple string match (not a regular expression), and if that string appears in the data to redact the “Search” regular expression is applied. This is a performance optimization: string matching is much faster than regular expressions.
The most important parameter is the “Search” field. It’s a regular expression that describes the thing to redact. The search parameter shown here is complicated and matches four digits, a separator, and so on that describes a credit card.
The final field, “Replace,” is what to put in the place of the text matched by “Search.”

Let’s now click on the + sign and select a custom rule.

The fields start out blank; I filled them in so that instances of the name “Voldemort” are replaced with “He Who Shall Not Be Named”.

Now, for a really useful part of the UI: It’s possible to test the existing redaction rules against some text in order to be certain that they work as expected. This takes the guesswork out of making redaction rules; you can easily see how they work in action. Here’s an example:

The Test Redaction Rules box has my sample sentence to redact, and the Output box shows that Voldemort’s name and credit-card information have both been replaced with something much more comforting.

After checking the “Enable Log and Query Redaction” box (visible in the first screenshot) and a cluster restart, these redaction rules are populated to the cluster. The easiest place to see them in effect is in Hue. For the purposes of this entirely contrived example, let’s say I have a table containing Harry Potter character credit-card information. Let’s try an Impala search inside Hue:

I actually typed in the word “Voldemort” into the UI. After being executed, the notification “This query had some sensitive information removed when saved” appeared. Let’s check out the list of recent queries:

Hue replaced Voldemort’s name with “He Who Shall Not Be Named” in the list of recent queries. We can also see what Impala does with this query. Going to the Impala Daemon’s web UI, we see the same redaction taking place:

The same redaction holds true for Hive, for log files, and for any other place where this query might have appeared.

Conclusion

We hope that you find this functionality useful. Several teams across CDH came together to make this project happen, including those working on Cloudera Manager, Impala, Hive, Hue, packaging, QA, and docs.

Removing sensitive information from places where it’s not needed is the simplest and most effective data protection strategy. The Sensitive Data Redaction feature achieves that goal throughout CDH and provides an easy, intuitive UI in Cloudera Manager.

Michael Yoder is a Software Engineer at Cloudera.

↧

Deploying Apache Kafka: A Practical FAQ

July 1, 2015, 9:21 am

≫ Next: How-to: Run Apache Mesos on CDH

≪ Previous: New in CDH 5.4: Sensitive Data Redaction

This post contains answers to common questions about deploying and configuring Apache Kafka as part of a Cloudera-powered enterprise data hub.

Cloudera added support for Apache Kafka, the open standard for streaming data, in February 2015 after its brief incubation period in Cloudera Labs. Apache Kafka now is an integrated part of CDH, manageable via Cloudera Manager, and we are witnessing rapid adoption of Kafka across our customer base.

As more customers adopt Apache Kafka, a common set of questions from them about development and deployment have emerged as a pattern. In this post, you’ll find answers to most of those questions.

Should I use SSDs for my Kafka Brokers?

Using SSDs instead of spinning disks has not been shown to provide a significant performance improvement for Kafka, for two main reasons:

Kafka writes to disk are asynchronous. That is, other than at startup/shutdown, no Kafka operation waits for a disk sync to complete; disk syncs are always in the background. That’s why replicating to at least three replicas is critical—because a single replica will lose the data that has not been sync’d to disk, if it crashes.
Each Kafka Partition is stored as a sequential write ahead log. Thus, disk reads and writes in Kafka are sequential, with very few random seeks. Sequential reads and writes are heavily optimized by modern operating systems.

How do I encrypt the data persisted on my Kafka Brokers?

Currently, Kafka does not provide any mechanism to encrypt the data persisted on the brokers (i.e. encrypting data at rest). Users can always encrypt the payload of the messages written to Kafka—that is, producers encrypt the data before writing to Kafka, and then the consumers can decrypt the received messages. But that would require the producers to share encryption protocols and keys with the Consumers.

The other option is to use software that provides filesystem-level encryption such as Cloudera Navigator Encrypt, included as part of Cloudera Enterprise, which provides a transparent encryption layer between the application and the filesystem.

Is it true that Apache ZooKeeper can become a pain point with a Kafka cluster?

Older versions of Kafka’s high-level consumer (0.8.1 and older) used ZooKeeper to maintain read offsets (the most recently read offset of each partition of a topic). If there are many consumers simultaneously reading from Kafka, the read write load on ZooKeeper may exceed its capacity, making ZooKeeper a bottleneck. However, this only occurs in extreme cases, when there are many hundreds of consumers using the same ZooKeeper cluster for offset management.

Nevertheless, this issue has been resolved in the current version of Kafka (0.8.2 at the time of this writing). Starting with version 0.8.2, the high-level consumer can use Kafka itself to manage offsets. Essentially, it uses a separate Kafka topic to manage recently read offsets, and thus Zookeeper is no longer required for offset management. However, users get to choose whether they want offsets managed in Kafka or ZooKeeper, via the consumer config parameter offsets.storage.

Cloudera highly recommends using Kafka to store offsets. However, you may choose to use ZooKeeper to store offsets for backwards compatibility. (You may, for example, have a monitoring console that reads offset information from ZooKeeper.) If you have to use ZooKeeper for offset management, we recommend using a dedicated ZooKeeper ensemble for your Kafka cluster. If a dedicated ZooKeeper ensemble is still a performance bottleneck, you can address the issue by using SSDs on your ZooKeeper nodes.

Does Kafka support cross-data center availability?

The recommended solution for cross-data center availability with Kafka is via MirrorMaker. Set up a Kafka cluster in each of your data centers, and use MirrorMaker to do near real-time replication of data between the Kafka clusters.

An architectural pattern while using MirrorMaker is to have one topic per Data-Center (DC) for each “logical” topic: For example, if you want a topic for “clicks” you’ll have “DC1.clicks” and “DC2.clicks” (where DC1 and DC2 are your data centers). DC1 will write to DC1.clicks and DC2 to DC2.clicks. MirrorMaker will replicate all DC1 topics to DC2 and all DC2 topics to DC1. Now the application on each DC will have access to events written from both DCs. It is up to the application to merge the information and handle conflicts accordingly.

Another more sophisticated pattern is to use local and aggregate Kafka clusters in each DC. This pattern is in use at LinkedIn, and has been described in detail in this blog post written by LinkedIn’s Kafka operations team. (Check out the section “Tiers and Aggregation”).

What type of data transformations are supported on Kafka?

Kafka does not enable transformation of data as it flows through Kafka. To perform data transformation, we recommend the following methods:

For simple event by event processing, use the Flume Kafka integration, and write a simple Apache Flume Interceptor
For complex processing, use Apache Spark Streaming to read from Kafka and process the data

In either case, the transformed/processed data can be written back to a new Kafka topic (which is useful if there are multiple downstream consumers of the transformed data), or directly delivered to the end consumer of the data.

For a more comprehensive description of real time event processing patterns, check out this blog post.

How to send large messages or payloads through Kafka?

Cloudera benchmarks indicate that Kafka reaches maximum throughput with message sizes of around 10KB. Larger messages will show decreased throughput. However, in certain cases, users need to send messages that will be much larger than 10K.

If the message payload sizes will be in the order of 100s of MB, we recommend exploring the following alternatives:

If shared storage is available (HDFS, S3, NAS), place the large payload on shared storage and use Kafka just to send a message with the payload location.
Handle large messages by chopping them into smaller parts before writing into Kafka, using a message key to make sure all the parts are written to the same partition so that they are consumed by the same Consumer, and re-assembling the large message from its parts when consuming.

While sending large messages through Kafka, keep the following in mind:

Compression configs

Kafka Producers can compress messages. Ensure compression is turned on via the config parameter compression.codec. Valid options are “gzip” and “snappy”.

Broker configs

message.max.bytes (default:1000000): maximum size of a message the broker will accept. Increase this value to accommodate your largest message.
log.segment.bytes (default: 1GB): size of a Kafka data file. Make sure it’s larger than one message. Default should be fine since large messages should not exceed 1GB in size.
replica.fetch.max.bytes (default: 1MB): maximum size of data that a broker can replicate. This has to be larger than message.max.bytes, or a broker will accept messages and fail to replicate them, leading to potential data loss.

Consumer configs

fetch.message.max.bytes (default 1MB) – Maximum size of message a consumer can read. This should be greater than or equal to message.max.bytes configuration on the broker.

A few other considerations:

Brokers will need to allocate a buffer of size replica.fetch.max.bytes for each partition they replicate. Do the math and make sure the number of partitions * the size of the largest message does not exceed available memory, or you’ll see OOMs.
Same for consumers and fetch.message.max.bytes: Confirm there’s enough memory for the largest message for each partition the consumer reads.
Large messages may cause longer garbage collection pauses (as brokers need to allocate large chunks). Keep an eye on the GC log and on the server log. If long GC pauses cause Kafka to lose the ZooKeeper session, you may need to configure longer timeout values for zookeeper.session.timeout.ms.

Does Kafka support the MQTT or JMS protocols?

Currently, Kafka does not provide out-of-the-box support for the above protocols. However, users have been known to write adaptors to read data from MQTT or JMS and write to Kafka.

For further information on running Kafka with CDH, download the Deployment Guide or watch the webinar “Bringing Real-Time Data to Hadoop.”

If you have additional questions, please feel to free to post them on our community forum.

Anand Iyer is a senior product manager at Cloudera. His primary areas of focus are platforms for real-time streaming, Apache Spark, and tools for data ingestion into the Hadoop platform. Before joining Cloudera, he worked as an engineer at LinkedIn, where he applied machine learning techniques to improve the relevance and personalization of LinkedIn’s Feed.

Gwen Shapira is a Software Engineer at Cloudera, and a committer on Apache Sqoop and Apache Kafka. She has 15 years of experience working with customers to design scalable data architectures.

↧

How-to: Run Apache Mesos on CDH

August 18, 2015, 8:02 am

≫ Next: How-to: Prepare Your Apache Hadoop Cluster for PySpark Jobs

≪ Previous: Deploying Apache Kafka: A Practical FAQ

Big Industries, Cloudera systems integration and reseller partner for Belgium and Luxembourg, has developed an integration of Apache Mesos and CDH that can be deployed and managed through Cloudera Manager. In this post, Big Industries’ Rob Gibbon explains the benefits of deploying Mesos on your cluster and walks you through the process of setting it up.

[Editor's Note: Mesos integration is not currently supported by Cloudera, thus the setup described below is not recommended for production use.]

Apache Mesos is a distributed, generic grid workload manager. Similar to YARN in Apache Hadoop, Mesos is designed to run the generic tasks and services that your Hadoop cluster might not otherwise be able to manage–it’s ideal for running stuff like Memcached, MySQL, Apache httpd, Nginx, HAProxy, Snort, ActiveMQ, or whatever it is that you need to run, for as long as you need to run it. Mesos is designed to scale, like YARN is, and Mesos services can be deployed on clusters of up to 10,000s of nodes.

Why would you want to run things like web servers, proxies, and caches on a Hadoop cluster, though? Well, when assembling a technical solution, especially an off-the-shelf solution, it is common that the buyer expects the vendor to provide a complete, ready to go platform, with a single bill-of-materials. Solutions often make use of operational, front-end serving components (reverse proxies, load balancers, web servers, application servers) and middle-tier components (object caches, JMS, workflow engine etc.) in addition to backend components and while Hadoop is great at solving backend data processing challenges, until Mesos it has been pretty difficult to deploy and operate front-end and middle tier components in a consistent manner as part of a complete, Hadoop-powered solution.

For example, building a security information and event management (SIEM) solution on top of Hadoop means including a live-traffic inspection layer as well as an active archive of security-related event logs and reporting tools. While Hadoop perfectly fits the needs for the active archiving element, without Mesos integration to run a live-traffic inspection system and a reporting server, it would be quite difficult to deliver on the complete system requirements in a consistent way from a single platform.

Mesos comes with a framework called Marathon to launch tasks on the cluster, and a scheduler framework called Chronos, which offers a highly available, fault-tolerant alternative to Unix cron.

Docker

In order to launch an application, Mesos Marathon uses Docker, an application virtualization system that enables portable, standardized and containerized deployment of applications and components across the cluster. The engineer writes a dockerfile, which is a text file containing a set of automation instructions for deploying and configuring the application.

There are other ways to launch applications on Mesos, but Docker offers a robust solution with extensive features.

Putting It Together: Mesos on Cloudera

In order to get Apache Mesos running on a Cloudera environment, we put together Custom Service Descriptors (CSDs) and custom deployment parcels for CentOS 6.5, RHEL 6.5, and Ubuntu 14.04 LTS that can be installed into Cloudera Manager. With this approach, deploying Mesos and Docker is a similar experience to deploying other Hadoop components like YARN, Impala, or Hive. (Note that we made two parcels, one for Mesos and one for Docker, because Docker needs to be run as root whilst Mesos can run under a dedicated user account.)

First, add the Big Industries parcel repo to Cloudera Manager.

To get Cloudera Manager to work with the Mesos and Docker parcels, you need to copy the CSD files to the Cloudera Manager CSD repository:

scp MESOS-1.0.jar myhost.com:/opt/cloudera/csd/.
scp DOCKER-1.0.jar myhost.com:/opt/cloudera/csd/.

Once done you’ll need to restart Cloudera Manager to pick up the changes:

service cloudera-scm-server restart

The next step is to download, distribute, and activate the Mesos and Docker parcels via Cloudera Manager.

We can now set up and configure a new Mesos service on our cluster from Cloudera Manager, in the same way we would set up any other Hadoop service.

You can choose which nodes of the cluster to use as Mesos slaves, Mesos masters, and where to deploy the Marathon service. You should deploy and run Docker on each node that will run as a Mesos slave. In order to ensure solid resource isolation, you can use Cloudera Manager’s Linux Control Groups integration to allocate appropriate system resource shares to the Mesos framework; this way Mesos and other Hadoop components like YARN and Impala can coexist.

Set up the hosts-to-roles mappings in Cloudera Manager:

Running Docker Images in Marathon

Marathon has a REST API. Docker images can be started with a POST to:

http://[host]:[port]/v2/apps

containing a configuration file in JSON format.

Editing the JSON files

Various settings can be configured in the JSON file. They are used to configure Marathon and the service the Docker image contains. For example:

{
   "id": "example",
   "instances": 1,
   "cpus": 0.5,
   "mem": 1024.0,
   "disk": 128,
   "constraints": [["hostname", "CLUSTER", "[desired-hostname]"]],
   "container": {
     "docker": {
       "type": "DOCKER",
       "image": "repo/example:1.0",
       "network": "BRIDGE",
       "parameters": [],
       "portMappings": [
         {
           "containerPort": 5000,
           "hostPort": 0,
           "protocol": "tcp",
           "servicePort": 5000
         }
       ]
     },
     "volumes": [
       {
         "hostPath": "/docker/packages",
         "containerPath": "/storage",
         "mode": "RW"
       }
     ]
   },
   "env": {
     "SETTINGS_FLAVOR": "local",
     "STORAGE_PATH": "/storage"
   },
   "ports": [ 0 ]
 }

Further documentation can be found here. To find the exposed ports, volumes and enviroment variables, check the Dockerfile for the following commands:

EXPOSE
ENV
VOLUME

Setting Up a Docker Registry Containing the Images

Docker images must be made available to the Docker daemons on the cluster.

The best way to do this is to provide a Docker Registry, which is comparable to a Git-repository for Docker images. A JSON file to setup a registry using Marathon is included in the project. To move docker images from one host to another use the following commands.

sudo docker save [imagename] > [imagename].tar
sudo docker load [imagename] > [imagename].tar

To put these images on the docker registry first tag them, then push them. IP address and port of the registry can be found in the Marathon UI.

sudo docker tag imagename [registry_ip]:[port]/[image_name]:[version]
sudo docker push [registry_ip]:[port]/[image_name]:[version]

Note that when using an insecure private registry, like the one from the JSON file, it is important to add the –insecure-registry argument to the start command.

Example: Launching Memcached on Mesos, on CDH

To launch Memcached on a cluster, we need a Docker image for Memcached – we’ll use sameersbn/memcached:latest from the public registry.

You need to create a marathon configuration file like the one below:

{
   "id": "memcached",
   "cmd": "",
   "cpus": 0.5,
   "mem": 512,
   "instances": 5,
   "container": {
     "type": "DOCKER",
     "docker": {
       "image": "sameersbn/memcached:latest",
       "network": "BRIDGE",
       "portMappings": [
         {
           "containerPort": 11211,
           "hostPort": 0,
           "servicePort": 0,
           "protocol": "tcp"
         }
       ]
     }
   }
 }

Then you need to launch it on the cluster via the Marathon REST API:

curl -H "Content-Type: application/json" -X POST --data @memcached.json http://[host]:[port]/v2/apps

Troubleshooting

If the Marathon app stays in the deploying state:

Make sure there are enough resources on the slaves.
Make sure the IP address and the port number of the registry are set correctly and the registry is added as insecure registry on the Docker daemon.
Make sure the image name has a version if required.

Conclusion

In this article we have explained some of the features and benefits of Apache Mesos, seen how to deploy Mesos and Docker under CDH using Cloudera Manager and custom parcels, and had a look at launching an application component (Memcached) across the cluster using Mesos Marathon.

The source code for the Cloudera Manager Mesos and Docker extensions is available on github and its Apache v2 licensed.

Rob Gibbon is architect, manager, and partner at Big Industries, the industry-leading Hadoop SI partner for Belgium and Luxembourg.

↧

How-to: Prepare Your Apache Hadoop Cluster for PySpark Jobs

September 24, 2015, 9:07 am

≫ Next: How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark

≪ Previous: How-to: Run Apache Mesos on CDH

Proper configuration of your Python environment is a critical pre-condition for using Apache Spark’s Python API.

One of the most enticing aspects of Apache Spark for data scientists is the API it provides in non-JVM languages for Python (via PySpark) and for R (via SparkR). There are a few reasons that these language bindings have generated a lot of excitement: Most data scientists think writing Java or Scala is a drag, they already know Python or R, or don’t want to learn a new language to write code for distributed computing. Most important, these languages already have a rich variety of numerical libraries with a statistical, machine learning, or optimization focus.

Like everything in engineering, there are tradeoffs to be made when picking these non-JVM languages for your Spark code. Java offers advantages like platform independence by running inside the JVM, self-contained packaging of code and its dependencies into JAR files, and higher performance since Spark itself runs in the JVM. If you chose to use Python, users lose such advantages. In particular, managing dependencies and making them available for PySpark jobs on a cluster can be a pain. In this blog post, I will explain what your options are.

To determine what dependencies are required on the cluster ahead of time, it is important to understand where different parts of Spark code get executed and how computation is distributed on the cluster. Spark orchestrates its operations via the driver program. The driver program initializes a SparkContext, in which you define your data actions and transformations, e.g. map, flatMap, and filter. When the driver program is run, the Spark framework initializes executor processes on the worker nodes that then process your data across the cluster.

Self-contained Dependency

If the Python transformations you define use any third-party libraries, like NumPy or nltk, then the Spark executors will need access to those libraries when they execute your code on the remote worker nodes. A common situation is one where we have our own custom Python package that contains functionality we would like to apply to each element of our RDD. A simple example of this is illustrated below. I assume a SparkContext is already initialized as sc, as in the PySpark shell.

def import_my_special_package(x):
    import my.special.package
    return x

conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_my_special_package(x))
int_rdd.collect()

After creating a SparkContext, you create a simple rdd of four elements and call it int_rdd. Then you apply the function import_my_special_package to every element of the int_rdd. This function just imports my.special.package and then returns the original argument passed to it. This has the same effect as using classes or functions defined in my.special.package because Spark requires that each Spark executor can import my.special.package when its functionality is needed.

If you only need a single file inside my.special.package you may direct Spark to make this available to all executors by using the --py-files option in your spark-submit command and specifying the local path to the file. You may also specify this programmatically by using the sc.addPyFiles() function. If you use functionality from a package that spans multiple files in it, you will be better off making an *.egg for the package , as the --py-files flag also accepts a path to an egg file. (Caveat: if your package depends on compiled code and machines in your cluster have different CPU architectures than the code you compile your egg on, this will not work.)

In short, if you have a self-contained dependency there are a two ways that you can make required Python dependency available to your executors:

If you only depend on a single file, you can use either the --py-files command line option or programmatically add them to the SparkContext with sc.addPyFiles(path) and specify the local path to that Python file.
If you have have a dependency on a self contained module (meaning a module with no other dependencies) you can create an egg or zip file of that module and use either the --py-files command line option or programmatically add them to the SparkContext with sc.addPyFiles(path) and specify the local path to that egg or zip file.

Complex Dependency

If the operations you want to apply in a distributed way rely on complex packages that themselves have many dependencies, you have a real challenge. Let’s take the simple snippet below as an example:

from pyspark import SparkContext, SparkConf

def import_pandas(x):
    import pandas
    return x

conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_pandas(x))
int_rdd.collect()

Again, all we are doing is importing pandas. Pandas depends on NumPy, SciPy, and many other packages. We have a problem here because we cannot make an egg that contains all of the required dependencies. Instead, we need to have the required Python environment already set up on the cluster and the Spark executors configured to use that Python environment.

Now, let’s consider our options for making these libraries available. While pandas is too complex to just distribute a *.py file that contains the functionality we need, we could theoretically create an *.egg for it and try shipping it off to executors with the --py-files option on the command line or #addPyFiles() on a SparkContext. A major issue with this approach is that *.egg files for packages containing native code (which most numerically oriented Python packages do) must be compiled for the specific machine it will run on.

An assumption that anyone doing distributed computing with commodity hardware must assume is that the underlying hardware is potentially heterogeneous. A Python egg built on a client machine will be specific to the client’s CPU architecture because of the required C compilation. Distributing an egg for a complex, compiled package like NumPy, SciPy, or pandas is a brittle solution that is likely to fail on most clusters, at least eventually. This means we should prefer the alternative approach: have our required Python packages already installed on each node of the cluster and specify the path to the Python binaries for the worker nodes to use.

As long as the Python installations you want to use are in a consistent location on your cluster, you can set the PYSPARK_PYTHON environment variable to the path to your Python executables and Spark will use those as the Python installations your executors. You can set this environment variable on a per-session basis by executing the following line of the command line:

export PYSPARK_PYTHON=/path/to/python

If you would like to consistently use this PYSPARK_PYTHON definition, you can add that line to your spark-env.sh. In CDH this script is located at /etc/spark/conf/spark-env.sh. If you set PYSPARK_PYTHON in spark-env.sh, you should check that users have not set this environment variable already with the following lines:

if [ -n "${PYSPARK_PYTHON}" ]; then

  export PYSPARK_PYTHON=
<path>

fi

If you have complex dependencies like pandas or SciPy, you can create the required Python environment on each node of your cluster and set PYSPARK_PYTHON to the path to the associated Python executable.

Installing and Maintaining Python Environments

Installing and maintaining Python environments on a cluster is not an easy task, but it is the best solution that allows making full use of the Python package ecosystem with PySpark. In the best possible world, you have a good relationship with your local sysadmin and they are able and willing to set up a virtualenv or install the Anaconda distribution of Python on every node of your cluster, with your required dependencies. If you are a data scientist responsible for administering your own cluster, you may need to get creative about setting up your required Python environment on your cluster. If you have sysadmin or devops support for your cluster, use it! They are professionals who know what they are doing. If you are on your own, the following, somewhat fragile, instructions may be useful to you.

If you aren’t yet worrying about long term maintainability and just need to get a Python environment set up yourself, you could take the less maintainable path of setting up virtual environments on your cluster by executing commands on each machine using Cluster SSH, Parallel SSH, or Fabric.

As an example, I provide instructions for setting up the a standard data stack (including SciPy, NumPy, scikit-learn and pandas) in a virtualenv on a CentOS 6/RHEL 6 system, assuming you have logged into every node in your cluster using cluster ssh and each node has Python and pip installed. (Note that you may need sudo access in order to install packages, like LAPACK and BLAS, in the operating system.):

# install virtualenv:
pip install virtualenv
# create a new virtualenv:
virtualenv <mynewenv>
# Install SciPy required non-python dependencies that are not installed on CentOS by default:
yum install atlas atlas-devel lapack-devel blas-devel
pip install scipy
pip install numpy
pip install scikit-learn
pip install pandas

Once you have a virtualenv setup in a uniform location on each node in your cluster, you can use it as the Python executable for your Spark executors by setting the PYSPARK_PYTHON environment variable to /path/to/mynewenv/bin/python.

This is not particularly simple or easily maintainable. In a follow-up post I will discuss other options for creating and maintaining Python environments on a CDH cluster.

Acknowledgements

Thanks to Uri Laserson for his invaluable feedback on the blog post. Additional thanks to Sean Owen, Sandy Ryza, Mark Grover, Alex Moundalexis, and Stephanie Bodoff.

Juliet is a Data Scientist at Cloudera, and contributor/committer/maintainer for the Sparkling Pandas project. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al.

↧

How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark

October 8, 2015, 9:09 am

≫ Next: Cloudera Enterprise 5.5 is Now Generally Available

≪ Previous: How-to: Prepare Your Apache Hadoop Cluster for PySpark Jobs

Thanks to Michal Malohlava, Amy Wang, and Avni Wadhwa of H20.ai for providing the following guest post about building ML apps using Sparkling Water and Apache Spark on CDH.

The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark. This means that the project is heavily dependent on two of the fastest growing machine-learning open source projects out there. With every major release of Spark or H2O there are API changes and, less frequently, major data structure changes that affect Sparkling Water. Throw Cloudera releases into the mix, and you have a plethora of git commits dedicated to maintaining a few simple calls to move data between the different platforms.

All that hard work on the backend means that users can easily benefit from programming in a uniform environment that combines both H2O and MLLib algorithms. For the data scientist using a Cloudera-supported distribution of Spark (Spark 1.3/CDH 5.4 as of this writing), they can easily incorporate an H2O library into their Spark application. An entry point to the H2O programming world (called H2OContext) is created and allows for the launch of H2O, parallel import of frames into memory and the use of H2O algorithms. This seamless integration into Spark makes launching a Sparkling Water application as easy as launching a Spark application:

> bin/spark-submit --class water.YourSparklingWaterApp --master yarn-client sparkling-water-app-assembly.jar

Setup and Installation

Sparkling Water is certified on Cloudera and certified to work with versions of Spark installations that come prepackaged with each distribution. To install Sparkling Water, navigate to h2o.ai/download and download the version corresponding to the version of Spark available with your Cloudera cluster. Rather than downloading Spark and then distributing on the Cloudera cluster manually, simply set your SPARK_HOME to the spark directory in your opt directory:

$ export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

For ease of use, we are looking into taking advantage of Cloudera Manager and creating distributable H2O and Sparkling Water parcels. This will simplify the management of the various versions of Cloudera, Spark, and H2O.

Architecture

Figure 1 illustrates the concept of technical realization. The application developer implements a Spark application using the Spark API and Sparkling Water library. After submitting the resulting Sparkling Water application into a Spark cluster, the application can create H2OContext, which initializes H2O services on top of Spark nodes. The application can then use any functionality provided by H2O, including its algorithms and interactive UI. H2O uses its own data structure called H2OFrame to represent tabular data, but H2OContext allows H2O to share data with Spark’s RDDs.

Figure 1: Sparkling Water architecture

Figure 2 illustrates the launch sequence of Sparkling Water on a Cloudera cluster. Both Spark and H2O are in-memory processes and all computation occurs in memory with minimal writing to disk, occurring exclusively when specified by the user. Because all the data used in the modeling process needs to read into memory, the recommended method of launching Spark and H2O is through YARN, which dynamically allocates available resources. When the job is finished, you can tear down the Sparkling Water cluster and free up resources for other jobs. All Spark and Sparkling Water applications launched with YARN will be tracked and listed in the history server that you can launch on Cloudera Manager.

YARN will allocate the container to launch the application master in and when you launch with yarn-client, the spark driver runs in the client process and the application master submits a request to the resource manager to spawn the Spark Executor JVMs. Finally, after creating a Sparkling Water cluster, you have access to HDFS to read data into either H2O or Spark.

Figure 2: Sparkling Water on Cloudera [Launching on YARN]

Programming Model

The H2OContext exposes two operators for: (1) publishing Spark RDD as H2O Frame (2) publishing H2O Frame as Spark RDD. The direction from Spark to H2O makes sense when data are prepared with the help of Spark API and passed to H2O algorithms:

// ...
val srdd: SchemaRDD = sqlContext.sql("SELECT * FROM ChicagoCrimeTable where Arrest = 'true'")
// Publish the RDD as H2OFrame
val h2oFrame: H2OFrame = h2oContext.asH2OFrame(srdd)
// ...
val dlModel: DeepLearningModel = new DeepLearning().trainModel.get
...

The opposite direction from H2O Frame to Spark RDD is used in a situation when the user needs to expose H2O’s frames as Spark’s RDDs. For example:

val prediction: H2OFrame = dlModel.score(testFrame)
// ...
// Exposes prediction frame as RDD
val srdd: SchemaRDD = asSchemaRDD(prediction)

The H2O context simplifies the programming model by introducing implicit conversion to hide asSchemaRDD and asH2OFrame calls.

Sparkling Water excels in situations when you need to call advanced machine-learning algorithms from an existing Spark workflow. Furthermore, we found that it is the perfect platform for designing and developing smarter machine learning applications. In the rest of this post, we will demonstrate how to use Sparkling Water to create a simple machine-learning application that predicts arrest probability for a given crime in Chicago. (Although this app is tested on Spark 1.4, it should work on 1.3, the version inside CDH 5.4, as well without mods.)

Example Application

We’ve seen some incredible applications of Deep Learning with respect to image recognition and machine translation but this specific use case has to do with public safety; in particular, how Deep Learning can be used to fight crime in the forward-thinking cities of San Francisco and Chicago. The cool thing about these two cities (and many others!) is that they are both open data cities, which means anybody can access city data ranging from transportation information to building maintenance records. So if you are a data scientist or thinking about becoming a data scientist, there are publicly available city-specific datasets you can play with. For this example, we looked at the historical crime data from both Chicago and San Francisco and joined this data with other external data, such as weather and socioeconomic factors, using Spark’s SQL context:

Figure 3: Spark + H2O Workflow

We perform the data import, ad-hoc data munging (parsing the date column, for example), and joining of tables by leveraging the power of Spark. We then publish the Spark RDD as an H2O Frame (Fig. 2).

val sc: SparkContext = // ...
implicit val sqlContext = new SQLContext(sc)
implicit val h2oContext = new H2OContext(sc).start()
import h2oContext._

val weatherTable = asSchemaRDD(createWeatherTable("hdfs://data/chicagoAllWeather.csv"))
registerRDDAsTable(weatherTable, "chicagoWeather")
// Census data
val censusTable = asSchemaRDD(createCensusTable("hdfds://data/chicagoCensus.csv"))
registerRDDAsTable(censusTable, "chicagoCensus")
// Crime data
val crimeTable  = asSchemaRDD(createCrimeTable("hdfs://data/chicagoCrimes10k.csv", "MM/dd/yyyy hh:mm:ss a", "Etc/UTC"))
registerRDDAsTable(crimeTable, "chicagoCrime")

val crimeWeather = sql("""SELECT a.Year, ..., b.meanTemp, ..., c.PER_CAPITA_INCOME
    |FROM chicagoCrime a
    |JOIN chicagoWeather b
    |ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
    |JOIN chicagoCensus c
    |ON a.Community_Area = c.Community_Area_Number""".stripMargin)

// Publish result as H2O Frame
val crimeWeatherHF: H2OFrame = crimeWeather

// Split data into train and test datasets
val frs = splitFrame(crimeWeatherHF, Array("train.hex", "test.hex"), Array(0.8, 0.2))
val (train, test) = (frs(0), frs(1))

Figures 4 and 5 below include some cool visualizations we made of the joined table using H2O’s Flow as part of Sparkling Water.

Figure 4: San Francisco crime visualizations

Figure 5: Chicago crime visualizations

Interesting how in both cities’ crime seems to occur most frequently during the winter—a surprising fact given how cold the weather gets in Chicago!

Using H2O Flow, we were able to look at the arrest rates of every category of recorded crimes in Chicago and compare them with the percentage of total crimes each category represents. Some crimes with the highest arrest rates also occur least frequently, and vice versa.

Figure 6: Chicago arrest rates and total % of all crimes by category

Once the data is transformed to an H2O Frame, we train a deep neural network to predict the likelihood of an arrest for a given crime.

def DLModel(train: H2OFrame, test: H2OFrame, response: String,
            epochs: Int = 10, l1: Double = 0.0001, l2: Double = 0.0001,
            activation: Activation = Activation.RectifierWithDropout, hidden:Array[Int] = Array(200,200))
           (implicit h2oContext: H2OContext) : DeepLearningModel = {
  import h2oContext._
  import hex.deeplearning.DeepLearning
  import hex.deeplearning.DeepLearningModel.DeepLearningParameters

  val dlParams = new DeepLearningParameters()
  dlParams._train = train
  dlParams._valid = test
  dlParams._response_column = response
  dlParams._epochs = epochs
  dlParams._l1 = l1
  dlParams._l2 = l2
  dlParams._activation = activation
  dlParams._hidden = hidden

  // Create a job
  val dl = new DeepLearning(dlParams)
  val model = dl.trainModel.get
  model
}

// Build Deep Learning model
val dlModel = DLModel(train, test, 'Arrest)
// Collect model performance metrics and predictions for test data
val (trainMetricsDL, testMetricsDL) = binomialMetrics(dlModel, train, test)

Here is a screenshot of our H2O Deep Learning model being tuned inside Flow and the resulting AUC curve from scoring the trained model against the validation dataset.

Figure 7: Chicago validation data AUC

The last building block of the application is formed by a function which predicts the arrest rate probability for a new crime. The function combines the Spark API to enrich each incoming crime event with census information and H2O’s deep learning model, which scores the event:

def scoreEvent(crime: Crime, model: Model[_,_,_], censusTable: SchemaRDD)
              (implicit sqlContext: SQLContext, h2oContext: H2OContext): Float = {
  import h2oContext._
  import sqlContext._
  // Create a single row table
  val srdd:SchemaRDD = sqlContext.sparkContext.parallelize(Seq(crime))
  // Join table with census data
  val row: DataFrame = censusTable.join(srdd, on = Option('Community_Area === 'Community_Area_Number)) //.printSchema
  val predictTable = model.score(row)
  val probOfArrest = predictTable.vec("true").at(0)

  probOfArrest.toFloat
}

val crimeEvent = Crime("02/08/2015 11:43:58 PM", 1811, "NARCOTICS", "STREET",false, 422, 4, 7, 46, 18)
val arrestProbability = 100 * scoreEvent(crime, dlModel, censusTable)

Figure 8: Geo-mapped predictions

Because each of the crimes reported comes with latitude-longitude coordinates, we scored our hold out data using the trained model and plotted the predictions on a map of Chicago—specifically, the Downtown district. The color coding corresponds to the model’s prediction for likelihood of an arrest with red being very likely (X > 0.8) and blue being unlikely (X < 0.2). Smart analytics + resource management = safer streets.

Further Reading

If you’re interested in finding out more about Sparkling Water or H2O please join us at H2O World 2015 in Mountain View, CA. We’ll have a series of great speakers including Stanford Professors Rob Tibshirani and Stephen Boyd, Hilary Mason, the Founder of Fast Forward Labs, Erik Huddleston, the CEO of TrendKite, Danqing Zhao, Big Data Director for Macy’s and Monica Rogati, Equity Partner at Data Collective.

↧

Cloudera Enterprise 5.5 is Now Generally Available

November 19, 2015, 6:00 am

≫ Next: Sustained Innovation in Apache Spark: DataFrames, Spark SQL, and MLlib

≪ Previous: How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark

Cloudera Enterprise 5.5 (comprising CDH 5.5, Cloudera Manager 5.5, and Cloudera Navigator 2.4) has been released.

Cloudera is excited to bring you news of Cloudera Enterprise 5.5. Our persistent emphasis on quality is especially pronounced in this release, with more than 500 issues identified and triaged during its development.

A highlight of this release is the inclusion of Cloudera Navigator Optimizer (available in limited beta for select Cloudera Enterprise customers;

The post Cloudera Enterprise 5.5 is Now Generally Available appeared first on Cloudera Engineering Blog.

↧

Sustained Innovation in Apache Spark: DataFrames, Spark SQL, and MLlib

November 30, 2015, 9:12 am

≫ Next: Docker is the New QuickStart Option for Apache Hadoop and Cloudera

≪ Previous: Cloudera Enterprise 5.5 is Now Generally Available

Cloudera has announced support for Spark SQL/DataFrame API and MLlib. This post explains their benefits for app developers, data analysts, data engineers, and data scientists.

In July 2015, Cloudera re-affirmed its position since 2013: that Apache Spark is on course to replace MapReduce as the default general-purpose data processing engine for Apache Hadoop. Thanks to initiatives like the One Platform Initiative,

The post Sustained Innovation in Apache Spark: DataFrames, Spark SQL, and MLlib appeared first on Cloudera Engineering Blog.

↧

Docker is the New QuickStart Option for Apache Hadoop and Cloudera

December 1, 2015, 9:41 am

≫ Next: New in Cloudera Labs: Apache HTrace (incubating)

≪ Previous: Sustained Innovation in Apache Spark: DataFrames, Spark SQL, and MLlib

Now there’s an even quicker “QuickStart” option for getting hands-on with the Apache Hadoop ecosystem and Cloudera’s platform: a new Docker image.

You might already be familiar with Cloudera’s popular QuickStart VM, a virtual image containing our distributed data processing platform. Originally intended as a demo environment, the QuickStart VM quickly evolved over time into quite a useful general-purpose environment for developers, customers,

The post Docker is the New QuickStart Option for Apache Hadoop and Cloudera appeared first on Cloudera Engineering Blog.

↧

New in Cloudera Labs: Apache HTrace (incubating)

December 9, 2015, 8:40 am

≫ Next: DistCp Performance Improvements in Apache Hadoop

≪ Previous: Docker is the New QuickStart Option for Apache Hadoop and Cloudera

Via a combination of beta functionality in CDH 5.5 and new Cloudera Labs packages, you now have access to Apache HTrace for doing performance tracing of your HDFS-based applications.

HTrace is a new Apache incubator project that provides a bird’s-eye view of the performance of a distributed system. While log files can provide a peek into important events on a specific node, and metrics can answer questions about aggregate performance,

The post New in Cloudera Labs: Apache HTrace (incubating) appeared first on Cloudera Engineering Blog.

↧

DistCp Performance Improvements in Apache Hadoop

December 15, 2015, 8:51 am

≫ Next: New in Cloudera Enterprise 5.5: Improvements to HUE for Automatic HA Setup and More

≪ Previous: New in Cloudera Labs: Apache HTrace (incubating)

Recent improvements to Apache Hadoop’s native backup utility, which are now shipping in CDH, make that process much faster.

DistCp is a popular tool in Apache Hadoop for periodically backing up data across and within clusters. (Each run of DistCp in the backup process is referred to as a backup cycle.) Its popularity has grown in popularity despite relatively slow performance.

In this post, we’ll provide a quick introduction to DistCp.

The post DistCp Performance Improvements in Apache Hadoop appeared first on Cloudera Engineering Blog.

↧

New in Cloudera Enterprise 5.5: Improvements to HUE for Automatic HA Setup and More

December 17, 2015, 9:03 am

≫ Next: New in CDH 5.5: Apache Parquet Usability Improvements

≪ Previous: DistCp Performance Improvements in Apache Hadoop

Cloudera Enterprise 5.5 improves the life of the admin through a deeper integration between HUE and Cloudera Manager, as well as a rebase on HUE 3.9.

Cloudera Enterprise 5.5 contains a number of improvements related to HUE (the open source GUI that makes Apache Hadoop easier to use), including easier setup for HUE HA, built-in activity monitoring for improved stability, and better security and reporting via Cloudera Navigator and Apache Sentry (incubating).

The post New in Cloudera Enterprise 5.5: Improvements to HUE for Automatic HA Setup and More appeared first on Cloudera Engineering Blog.

↧