Big Data, Smaller Problems: Configuring Kerberos Authentication for Hadoop

In my first SecurityWeek column, I provided an overview of what Hadoop is, why it is gaining ground in the enterprise, what some of the security challenges are and ways to overcome them. In this installment, I will dive deeper into how to solve a actual and pressing Hadoop security challenge — specifically authentication.
By default, Hadoop is not secure and simply trusts that users are who they say they are. This is an example of how Hadoop communicates with users on its own:
"Hello, Hadoop. I am Don Quixote. Please count the number of windmills I have encountered.”
“Hello, Don Quixote, I’m Hadoop. Here is a look at the number of windmills you have encountered.”
This example is accurate because Hadoop has no native authentication controls in place. On its own, Hadoop does not challenge user identities. Within real business use cases, especially when confidential and sensitive data sets are involved, restricting access to only authorized users is critical.
To enable permissions and authorization for Hadoop users, administrators need to first solve the challenge of user identity verification — a hurdle overcome through authentication.
Understanding Kerberos for Hadoop
Being able to confirm a user’s identity is the basic goal of authentication, something that everyone is accustom to when they log in to a laptop or a social media website.
So how does Hadoop solve the authentication problem? Hadoop has adopted a well-known authentication method that was developed at MIT (Massachusetts Institute of Technology) named Kerberos. Kerberos technology builds on cryptographic methods to establish ways for users (and systems) to identify themselves, and to create authentication tickets that can be presented to multiple services.
With Kerberos in place, the conversation between Hadoop and Don Quixote (or maybe not Don Quixote) would sound a bit different.
"Hello, Hadoop. I am Don Quixote. Please count the number of windmills I have encountered.”
“Hello, I am Hadoop. I have checked with our Kerberos server and confirmed that you are in fact NOT Don Quixote. The ticket you presented is for Sancho Panza.”
This example shows that the benefits of using Kerberos and Hadoop are clear. Using Kerberos for Hadoop can confirm a user identity for services running in the cluster, including HDFS, MapReduce, YARN, Hive, Flume, Hue, Oozie and ZooKeeper, to name a few.
Understanding Kerberos
To understand better the capabilities Kerberos provides, let’s review the services that make up a Kerberos server.
KDC or Key Distribution Center is the server that will authenticate users and issue the login tickets using the following two services:
• Authentication Service — This service will issue the TGT tickets, known as Ticket Granting Tickets.
• Ticket Granting Service — This service will validate a TGT ticket and issue a service ticket, hence authenticating the user to the requested service.
Please note that it is not sufficient to just be authenticated to Kerberos, but it is also required to have a proper ticket per service. For example, a ticket for HDFS cannot also be used for Hive.
As mentioned previously, by default Kerberos is not enabled on Hadoop. So I will cover some basic concepts and starting points to enabling and configuring Kerberos. For an in-depth technical understanding, I recommend reading “Kerberos: The Definitive Guide” and “Hadoop Security.”
Key Concepts for Configuring Kerberos on Hadoop
I will lighten up the technical concepts to configure Hadoop by using Don Quixote; his friend, Sancho Panza; his love, Dulcinea; and his horse, Rocinante.
A user in Kerberos is defined as a UPN or User Principal Name.
Example UPNs:
As you can see, UPNs are very similar to an email address.
Systems that the user will log into or run Hadoop processes on are defined in Kerberos as SPN or Service Principal Name.
Example SPNs:
• gallop/rocinante.delmancha.com@DELAMANCHA.COM
• mapred/node1.delmancha.com@DELAMANCHA.COM
You can think of SPNs as the URL to a website, but the format is a bit different. Imagine Rocinante and node1 represent servers on Hadoop running services gallop and mapred respectively.
UPNs and SPNs are logically grouped into a realm, which is an administrative grouping and organization of users and services. An example realm is:
• DELAMANCHA.COM (Realms are uppercase by convention.)
A realm is pretty similar to what we know on the Internet as a domain name. Now with the basic understanding of a KDC, user principals, service principals and realms, let’s quickly walk through a Kerberos workflow. In this example, the KDC is running on the server kdc.delamancha.com.
Don Quixote wants Rocinante to gallop, but Rocinante will NOT gallop for anyone but Don Quixote. So, he must prove he is who he says he is.
1. Don Quixote sends an authentication request to the KDC.
2. The KDC sends back an encrypted ticket.
3. Don Quixote decrypts the ticket by providing his password (Dulciena).
4. Now authenticated, Don Quixote sends a request for a service ticket to gallop.
5. KDC validates the ticket and sends back a service ticket that can only be used by gallop.
6. Don Quixote presents the service ticket to gallop/rocinante.delmancha.com.
7. gallop/rocinante.delmancha.com decrypts the ticket, validating the user’s identity, and Rocinante runs the gallop service for Don Quixote.
Below is an example of what the commands would look like for obtaining a Hadoop word count on windmill for user donquixote.
$ Enter password for donquixote@DELAMANCHA.COM:
$ bin/hadoop jar wordcount.jar /usr/donquixote/input /usr/donquixote/output
$ bin/hadoop dfs -cat /usr/donquixote/output/part-00000 | grep windmill
$ windmill X
This example does not go into many technical details — such as how all the encryption keys are created or why rocinante will trust something encrypted by the KDC. But rest assured that years of Kerberos engineering has made the technology industrial-strength. The example also skipped over many complexities, including crypto algorithms, ticket lifetimes and many more configuration aspects.
The important takeaway here is that most services in Hadoop — like HDFS, MapReduce, YARN, Hive, Flume, Hue, Oozie and ZooKeeper — already support Kerberos to enable authentication, but they are not ON by default.
Make no mistake, Kerberos is not easy to configure and use. Because of this, when evaluating a Hadoop management platform, look for built-in installation scripts and configuration wizards that create necessary user principals, service principles and underlying configuration files that enable Kerberos for Hadoop.
For those that like to get into the nuts and bolts, you should read the two books mentioned on Kerberos and Hadoop security to reach a better understanding of exactly how authentication is implemented in Hadoop.
In the next series, I will talk about authorization, or “Why Rocinante will not gallop for Sancho Panza."

Comments

  1. Thanks for sharing this article.. You may also refer http://www.s4techno.com/blog/2016/07/11/hadoop-administrator-interview-questions/..

    ReplyDelete
  2. Awesome blog thanks for sharing. For hadoop training visit our website.

    ReplyDelete
  3. Great and interesting article to read.. i Gathered more useful and new information from this article.thanks a lot for sharing this article to us..

    best big data training institute in chennai | big data hadoop training in Velachery

    ReplyDelete

Post a Comment

Popular posts from this blog

Hive Indexing

HIVE Sorting and Join

Sqoop with Postgresql