Posts

Choosing Kerberos approach for Hadoop cluster in an enterprise environment

Factors to consider before choosing an approach for Kerberos implementation within an enterprise.
Article Choosing an approach for Kerberos implementation on Hadoop cluster is critical from a long term maintenance point. Enterprises have their own security policies and guidelines and a successful kerberos implementation needs to adhere to enterprise security architecture. There are multiple guides available on how to implement Kerberos but I couldn't find information on which approach to choose and Pros and Cons associated with each approach.
In a Hortonworks Hadoop cluster, there are 3 different ways of generating keytabs and principals and managing them.
a. Use an MIT KDC specific to Hadoop cluster - automated keytab management using Ambari
KDC specific to Hadoop cluster can be installed and maintained on one of the Hadoop nodes. All users/keytabs required for kerberos implementation are automatically managed using Ambari.
Pros…

Default mapred.tasktracker.map.tasks.maximum and Increasing io.sort.mb

mapred.tasktracker.map.tasks.maximum 2 mapred.tasktracker.reduce.tasks.maximum 2 If you want to change them, then you should change the file {$HADOOP_HOME}/conf/mapred-site.xml, where ${HADOOP_HOME} is the path of hadoop. For example, if you determine that you want 8 reducers (this can be done by setting conf.setNumReduceTasks(8); in your code) and you keep these default values, assuming that you have 2 nodes in the cluster, each node will run 2 map tasks at the beginning, so, in overall, 2x2 = 4 map tasks will be running in your cluster. When any of these map tasks finishes, the node will run the next map task in the queue. At any point, 4 map tasks (maximum) will be running in your cluster. EDIT: I found the mistake. In the first link it says: The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). It should be: The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.reduce.tasks.maximum).
Increasing…

5 Tips for efficient Hive queries with Hive Query Language

Image
Hive on Hadoop makes data processing so straightforward and scalable that we can easily forget to optimize our Hive queries. Well designed tables and queries can greatly improve your query speed and reduce processing cost. This article includes five tips, which are valuable for ad-hoc queries, to save time, as much as for regular ETL (Extract, Transform, Load) workloads, to save money. The three areas in which we can optimize our Hive utilization are:
Data Layout (Partitions and Buckets)Data Sampling (Bucket and Block sampling)Data Processing (Bucket Map Join and Parallel execution)
We will discuss these areas in detail in this article, you can if like, also watch our webinar on the topic given by Ashish Thusoo, co-founder of Apache Hive, and Sadiq Sid Shaik, Director of Product at Qubole. Example Data Set We can illustrate the improvements best on an example data set we use at Qubole. The data consists of three tables. The table Airline Bookings All contains 276 million…

Big Data, Smaller Problems: Configuring Kerberos Authentication for Hadoop

Image
In my first SecurityWeek column, I provided an overview of what Hadoop is, why it is gaining ground in the enterprise, what some of the security challenges are and ways to overcome them. In this installment, I will dive deeper into how to solve a actual and pressing Hadoop security challenge — specifically authentication.
By default, Hadoop is not secure and simply trusts that users are who they say they are. This is an example of how Hadoop communicates with users on its own:
"Hello, Hadoop. I am Don Quixote. Please count the number of windmills I have encountered.” “Hello, Don Quixote, I’m Hadoop. Here is a look at the number of windmills you have encountered.” This example is accurate because Hadoop has no native authentication controls in place. On its own, Hadoop does not challenge user identities. Within real business use cases, especially when confidential and sensitive data sets are involved, restricting access to only authorized users is critical.
To enable…

Hadoop Security : Kerberos Tutorial

Image
There are following things to remember 1.There are three parties involved in this process overall a.Client :You, who want to access FileServer (Principal) b.KDC (It is made of two components) i.Authentication Service ii.Ticket Granting Service c.FileServer : The actual resource which you want to access 2.In total 3 Secrete keys (1 for Client, 1 for File Server, 1 for KDC itself): Which never ever travels over the network. a.Client key resides on client machine as well as KDC b.Server Key resides on the Server machine as well as KDC c.KDC key resides only on KDC machine

Client Machine File Server Machine KDC Machine

Big Data (Hadoop) Glossary

Image
3Vs of BigData Three Vs of Big Data: Volume (Big), Velocity (Fast) and Variety (Smart) More clearly define, Big Data is typically explained via 3Vs – Volume (2.5 Quintillion Bytes of data are estimated to be created every day), Variety (data from all possible source from structured to unstructured) and Velocity (tremendous speed of generating data due to increasing digitization of society). ACID PropertiesACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction. Atomicity Atomicity requires that each transaction be “all or nothing“: if one part of the transaction fails, the entire transaction fails, and the database state is left…