Showing posts from 2015

5 Tips for efficient Hive queries with Hive Query Language

Hive on Hadoop makes data processing so straightforward and scalable that we can easily forget to optimize our Hive queries. Well designed tables and queries can greatly improve your query speed and reduce processing cost. This article includes five tips, which are valuable for ad-hoc queries, to save time, as much as for regular ETL (Extract, Transform, Load) workloads, to save money. The three areas in which we can optimize our Hive utilization are:
Data Layout (Partitions and Buckets)Data Sampling (Bucket and Block sampling)Data Processing (Bucket Map Join and Parallel execution)
We will discuss these areas in detail in this article, you can if like, also watch our webinar on the topic given by Ashish Thusoo, co-founder of Apache Hive, and Sadiq Sid Shaik, Director of Product at Qubole. Example Data Set We can illustrate the improvements best on an example data set we use at Qubole. The data consists of three tables. The table Airline Bookings All contains 276 million…

Big Data, Smaller Problems: Configuring Kerberos Authentication for Hadoop

In my first SecurityWeek column, I provided an overview of what Hadoop is, why it is gaining ground in the enterprise, what some of the security challenges are and ways to overcome them. In this installment, I will dive deeper into how to solve a actual and pressing Hadoop security challenge — specifically authentication.
By default, Hadoop is not secure and simply trusts that users are who they say they are. This is an example of how Hadoop communicates with users on its own:
"Hello, Hadoop. I am Don Quixote. Please count the number of windmills I have encountered.” “Hello, Don Quixote, I’m Hadoop. Here is a look at the number of windmills you have encountered.” This example is accurate because Hadoop has no native authentication controls in place. On its own, Hadoop does not challenge user identities. Within real business use cases, especially when confidential and sensitive data sets are involved, restricting access to only authorized users is critical.
To enable…

Hadoop Security : Kerberos Tutorial

There are following things to remember 1.There are three parties involved in this process overall a.Client :You, who want to access FileServer (Principal) b.KDC (It is made of two components) i.Authentication Service ii.Ticket Granting Service c.FileServer : The actual resource which you want to access 2.In total 3 Secrete keys (1 for Client, 1 for File Server, 1 for KDC itself): Which never ever travels over the network. a.Client key resides on client machine as well as KDC b.Server Key resides on the Server machine as well as KDC c.KDC key resides only on KDC machine

Client Machine File Server Machine KDC Machine

Big Data (Hadoop) Glossary

3Vs of BigData Three Vs of Big Data: Volume (Big), Velocity (Fast) and Variety (Smart) More clearly define, Big Data is typically explained via 3Vs – Volume (2.5 Quintillion Bytes of data are estimated to be created every day), Variety (data from all possible source from structured to unstructured) and Velocity (tremendous speed of generating data due to increasing digitization of society). ACID PropertiesACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction. Atomicity Atomicity requires that each transaction be “all or nothing“: if one part of the transaction fails, the entire transaction fails, and the database state is left…