Big Data (Hadoop) Glossary

3Vs of BigData

Three Vs of Big Data: Volume (Big), Velocity (Fast) and Variety (Smart)

More clearly define, Big Data is typically explained via 3Vs – Volume (2.5 Quintillion Bytes of data are estimated to be created every day), Variety (data from all possible source from structured to unstructured) and Velocity (tremendous speed of generating data due to increasing digitization of society).

ACID Properties

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

Atomicity

Atomicity requires that each transaction be “all or nothing“: if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes.

Consistency

The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, i.e. which demands that the data must meet all validation rules including constraints,cascades, triggers, and any combination thereof.

Isolation

The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e., one after the other. Providing isolation is the main goal of concurrency control. Depending on concurrency control method (i.e. if it uses strict – as opposed to relaxed – serializability), the effects of an incomplete transaction might not even be visible to another transaction.

Durability

Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.

Active NN

The Active NameNode is responsible for all client operations in the cluster. Only one active NameNode could be present in a cluster.

Big ‘O’ notation

Big ‘O’ notation is a way to express the speed of algorithms.

O(1) notation means that, no matter how much data, it will execute in constant time.
O(n) notation means it takes an amount of time linear with the size of the data set, so increase the data set size double will take double the time.

CAP theorem

The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency – all nodes always give the same result.
Availability – a guarantee that nodes always answer queries and accept updated. System is always on, no downtime.
Partition tolerance – system continues working even if one or more nodes become silent or not responsive.

A distributed system can satisfy any two of the above guarantees at the same time but not all three.

But I am confused that how a system can give atomic consistency and perfect availability without partition tolerance in a distributed system. We cannot, however, choose both consistency and availability in a distributed system.

As a thought experiment, imagine a distributed system which keeps track of a single piece of data using three nodes—A, B, and C—and which claims to be both consistent and available in the face of network partitions. Misfortune strikes, and that system is partitioned into two components: {A,B} and {C}. In this state, a write request arrives at node C to update the single piece of data.

That node only has two options:

Accept the write, knowing that neither A nor B will know about this new data until the partition heals.
Refuse the write, knowing that the client might not be able to contact A or B until the partition heals.

You either choose availability (option #1) or you choose consistency (option #2). You cannot choose both.

To claim to do so is claiming either that the system operates on a single node (and is therefore not distributed) or that an update applied to a node in one component of a network partition will also be applied to another node in a different partition component magically.

This is, as you might imagine, rarely true.

So the answer is, Consistency, Availability, and Partition Tolerance are the Platonic ideals of a distributed system–we can partake of them enough to meet business requirements, but the nature of reality is such that there will always be compromises.

Checkpointing

Checkpointing is a process that takes an fsimage and edit log and packed them into a new fsimage. This way, instead of replaying a potentially unbounded edit log, the NameNode can load the final in-memory state directly from the fsimage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is a very I/O and network intensive operation and can be time taken operation. So, rather than pausing the active NameNode to perform a checkpoint operation, HDFS defers this job to either the SecondaryNameNode or Standby NameNode depending HA configuration.

Checkpointing creates a new fsimage from an old fsimage and edit log.

Checkpointing process is triggered by one of two conditions: if enough time has elapsed since the last checkpoint (dfs.namenode.checkpoint.period), or if enough new edit log transactions have accumulated (dfs.namenode.checkpoint.txns). The checkpointing node periodically checks if either of these conditions are met (dfs.namenode.checkpoint.check.period), and if so, kicks off the checkpointing process.

Circular Buffer

A circular buffer, also known as a circular queue or ring buffer, is a high performance, first in, first out (FIFO) queue. As with any other type of queue, values can be added to the end of the queue and extracted from the start, so that items are dequeued in the same order that they were enqueued.

In some queue structures, when items are added or removed, the contents of the queue are physically moved in memory. With a circular buffer the positions are fixed. The head and tail of the queue are identified using two pointers that are updated when items are added or removed. In addition, the buffer spaces can be thought of as cyclic. After the last space in the buffer is used, the next item enqueued is stored in the first space. The diagram below shows a conceptual model of a circular buffer.

The diagram represents a circular buffer with sixteen spaces, thirteen of which are in use. In this case, the values one to thirteen have been added in order. The head of the queue is pointing at the latest value, thirteen, and the tail is at the space containing the number one. When the next item is enqueued it will be added in the next available empty space after the thirteen and the head pointer will be shifted to point to this position. When the next dequeue operation takes place the number one at the tail pointer will be extracted and the tail will move clockwise one space. When dequeuing there is no need to clear the old buffer space as the value will be overwritten at some point in the future.

Cluster

Cluster is a group of computers and other resources connected together in such a way that they behave like a single computer. Clustering is used for parallel processing, load balancing and fault tolerance. See figure

Cluster Split Brain

Split-brain can occur due to private communication fail between nodes but public communication still serving client requests in which case the cluster becomes partitioned into subclusters, and each subcluster believes that it is the only partition in the cluster and promote one of the node as master and serving client request independently. In this scenario, there would be two or more master nodes and try to access a shared resource independently which could cause a conflict in shared resources as data corruption. In hadoop, when communication fail between two name nodes (active and standby) and each think that is the only node present in the cluster and standby name node promote itself to active name node, which both think they are active and try to write shared edit log. See figure1, figure2

Cluster Amnesia

An example is a two-node cluster with nodes A and B. If node A goes down, the configuration data in the cluster is updated on node B only, and not node A. If node B goes down at a later time, and if node A is rebooted, node A will be running with old contents of the cluster. This state is called amnesia and might lead to running a cluster with old configuration information.

Cluster Quorum

Quorum is a voting algorithm used to prevent split-brain (split into two or more active subclusters or partitions) problem in a cluster by assign each node one vote and mandating a majority of votes for an operational cluster. A partition with the majority of votes has a quorum and is enabled to operate and other partition will become inactive. Quorum consists of a simple majority (number of nodes / 2 + 1 extra). It is recommended to make odd number of nodes in a cluster to make a quorum otherwise in case of even number of nodes, we may need to tiebreak (if equal number of nodes in each partition) – maybe by giving a member a “super vote” or by disabling one member’s vote.

Cluster Fencing

Fencing is the process of isolating a node of a computer cluster or protecting shared resources when a node appears to be malfunctioning.

DataNode (DN)

DataNodes (DN) stores the actual Data in HDFS.

Data Warehouse vs Data Mart

A data warehouse is the place where all the data of a company is stored. It contains data from all the departments of the company. Often, it is called a central or enterprise data warehouse. Typically, a data warehouse assembles data from multiple source systems (RDBMS/CRM/ERP etc). Data can be model using Star Data or Snowflake schema.

A data mart provides the primary access to the data stored in the data warehouse. A data mart is a simple form of a data warehouse (mini data warehouse) or subset of a data warehouse that supports the requirements of a particular department or business function, such as Sales, Finance, or Marketing. Data can be model using Star Data or Snowflake schema.

Figure shows dependent data mart: data mart exists with data warehouse

Figure shows independent data mart: data mart exists without data warehouse

Data Cleansing

Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. E.g., due to misspellings during data entry, missing information or other invalid data. When multiple data sources (probability that some of the sources contain “dirty data” is high) need to be integrated in Data Warehouses, the need for data cleaning increases significantly. In Data Warehouses, data cleaning is a major part of the so-called ETL process. Furthermore, data warehouses are used for decision making, so that the correctness of their data is vital to avoid wrong conclusions. Examples of such invalid/inconsistent data are showing in below tables.

DDL & DML

Data Definition Language (DDL) statements are used to build and modify the tables and other objects in the database. Some examples:

CREATE – to create objects in the database
ALTER – alters the structure of the database
DROP – delete objects from the database
TRUNCATE – remove all records from a table, including all spaces allocated for the records are removed
COMMENT – add comments to the data dictionary
RENAME – rename an object

Data Manipulation Language (DML) statements are used to retrieve, store, modify, delete, insert and update data in the database. Some examples:

SELECT – retrieve data from the a database
INSERT – insert data into a table
UPDATE – updates existing data within a table
DELETE – deletes all records from a table, the space for the records remain
MERGE – UPSERT operation (insert or update)
CALL – call a PL/SQL or Java subprogram
EXPLAIN PLAN – explain access path to data
LOCK TABLE – control concurrency

Deadlock

A simple example is when Process 1 has obtained an exclusive lock on Resource A, and Process 2 has obtained an exclusive lock on Resource B. If Process 1 then tries to lock Resource B, it will have to wait for Process 2 to release it. But if Process 2 then tries to lock Resource A, both processes will wait forever for each other.

Namespace

It consists of files, directories, file blocks etc. Namespace operations are creation of files, deletion of files, renaming of files etc.

DFS Namespaces

A namespace is a logical layer that is inserted between clients (users and applications) and file systems. A DFS (Distributed File System) namespace is basically a place where you will have links to all your file shares. As it means an administrator can use a namespace to logically arrange and present data to users, irrespective of where the data is physically located. Your users will see it as a single share with many folders and they will have no idea that they are navigating across a set of servers to get to the subfolders and files.

ETL (Extract-Transform-Load)

Most Data Warehouse systems are front-ended with an ETL system. The purpose of this system is to:

Extract data from outside source systems to a staging area.
Transform it to fit operational needs with business calculations applied.
Load it from the staging area into the target data warehouse.

Popular ETL tools are DataStage, Informatica, or SQL Server Integration Services (SSIS).

File Edit Log

A file edit log (transaction log or operations log) tracks changes to the namespace of a file system. A file edit log will record changes such as creates, links, unlinks (deletions), renamed files, data changes, and metadata changes which can be applied to fsimage file through checkpointing process.

Filesystem Metadata

Filesystem Metadata is the data that describes the structure of a file system, such as hierarchical namespace, file and directory attributes and mapping of file to data chunks.

roughly, 1 GB metadata = 1 PB physical storage

HA Cluster

The HDFS High Availability (HA) feature addresses the below problems by providing the option of running two redundant NameNodes in the same cluster in
an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes.

1. In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator manually restarted the NameNode.
2. Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in periods of cluster downtime.

Hadoop Daemon

Daemons in computing terms is a process that runs in the background. Under Windows such process are called service. In Hadoop, each of the daemon run in its own JVM.

Hadoop Distributed File System (HDFS)

HDFS is a distributed, highly fault-tolerant file system designed to run on low-cost commodity hardware. HDFS stores file system metadata and application data separately. HDFS architecture consists of NameNode, DataNode, and HDFS Client. NameNode stores file system metadata and DataNode stores actual application data. HDFS client always first communicate with NameNode and then it will directly communicate with DataNode for any read and write operation on actual data.

HDFS Client

Client program developed in Java or other language, which writes and reads files to and from hadoop cluster. User applications access the filesystem using the HDFS client, a library that exports the HDFS filesystem interface.

Internet of Things (IoT)

Simply put this is the concept of basically connecting any device with an on and off switch to the Internet (and/or to each other). This includes everything from cell phones, coffee makers, washing machines, headphones, lamps, wearable devices and almost anything else you can think of. This also applies to components of machines, for example a jet engine of an airplane or the drill of an oil rig. As I mentioned, if it has an on and off switch then chances are it can be a part of the IoT. The analyst firm Gartner says that by 2020 there will be over 26 billion connected devices…that’s a lot of connections (some even estimate this number to be much higher, over 100 billion). The IoT is a giant network of connected “things” (which also includes people). The relationship will be between people-people, people-things, and things-things.

How does this impact you?

The new rule for the future is going to be, “anything that can be connected, will be connected.” But why on earth would you want so many connected devices talking to each other? There are many examples for what this might look like or what the potential value might be. Say for example you are on your way to a meeting, your car could have access to your calendar and already know the best route to take, if the traffic is heavy your car might send a text to the other party notifying them that you will be late. What if your alarm clock wakes up you at 6 am and then notifies your coffee maker to start brewing coffee for you? What if your office equipment knew when it was running low on supplies and automatically re-ordered more? What if the wearable device you used in the workplace could tell you when and where you were most active and productive and shared that information with other devices that you used while working?

Join

INNER JOIN – This produces a set of records which match in both the user (id, name, course) left table and course (id, name) right table, i.e. all users who are enrolled on a course:

SELECT user.name, course.name
FROM `user`
INNER JOIN `course` on user.course = course.id;

OUTER JOIN – This produces all the records from both left and right tables whether they match or not. It fill with NULL value if it does not match.

LEFT JOIN – This produces all the records from left side table whether they match or not. It fill with NULL value if it does not match.

RIGHT JOIN – This produces all the records from right side table whether they match or not. It fill with NULL value if it does not match.

Log Processing

There is a large amount of “log” data generated at any sizeable internet company. This data typically includes (1) user activity events corresponding to logins, page views, clicks, “likes”, sharing, comments, and search queries; (2) operational metrics such as service call stack, call latency, errors, and system metrics such as CPU, memory, network, or disk utilization on each machine.

OLTP vs OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

OLTP (On-line Transaction Processing) deals with operational/transactional data and is designed to efficiently process short high volumes of transactions (INSERT, UPDATE, and DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). Example of OLTP systems are Operational Systems (RDBMS/ERP/CRM etc).

OLAP (On-line Analitical Processing) deals with historical data and is designed for analysis and decision support. It is characterized by relatively low volume of transactions. In addition, the Queries needed for these systems are often very complex and involve aggregations as for OLAP systems the response time is an effectiveness measure. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star/snowflake schema). Example of OLAP system are Data Warehouse Systems (Teradata/Hadoop).

Page Cache

If data is written, it is first written to the Page Cache (in an area of memory) and managed as one of its dirty pages. Dirty means that the data is stored in the Page
Cache, but needs to be written to the underlying storage device first. The content of these dirty pages is periodically transferred (the flusher threads then periodically write back to disk any dirty pages) to the underlying storage device.

During a page I/O operation, such as read(), the kernel checks whether the data resides in the page cache. If the data is in the page cache, the kernel can quickly return the requested page rather than read the data off the disk.

MapReduce

Distributed huge data batch processing framework written in Java.

MD5 Hash

MD5 which stands for Message Digest algorithm 5 is a widely used cryptographic hash function. The idea behind this algorithm is to take up input data (.txt/.zip/.tar or binary) and generate a fixed size “hash value” (e.g. e4d9 09c2 90d0 fb1c a068 ffad df22 cbd0) as the output. The input data can be of any size or length, but the output “hash value” size is always fixed (32 digit hexadecimal value).

Whenever you download any valuable data from the Internet, it is completely necessary that you check the integrity of the downloaded file. That is, you need to ensure that the downloaded file is exactly the same as that of the original one. The file can be corrupted due to any of the following reasons:

Data loss during the download process, due to instability in the Internet connection/server.
The file can be tampered due to virus infections or
Due to Hacker attacks.

In this scenario, the MD5 hash can become handy. All you have to do is generate MD5 hash (or MD5 check-sum) for the intended file on your server. After you download the file onto your PC, again generate MD5 hash for the downloaded file. Compare these two hash values and if they match, that means the file is downloaded perfectly without any data loss. Otherwise it means that the file is corrupt.

You can even try this on your own using the MD5 hash generator tool here.

NameNode (NN)

NameNode stores meta data (No of Blocks, On Which Rack which DataNode the actual data is stored and other details) about the actual data being stored in HDFS. Sometimes it is called Metadata Server.

Namespace Vs Metadata

The NameNode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. Essentially, Namespace means a container. In this context is means the file name grouping or hierarchy structure.

Metadata contains things like the owners of files, file permission, no of blocks, block locations, size etc.

NAS

Network Attached Storage (NAS) is a single shared storage device which is a convenient and centralised place to store and share data among multiple computers in a network. See figure

NFS

Network File System (NFS) is a distributed file system protocol , allowing a user to access files over a network much like local storage (hard drive) is accessed. For example, if you were using a computer linked to a second computer via NFS, you could access files on the second computer as if they resided in a directory on the first computer.

Node

Any system (computer) or device connected to a network is called a node.

On-Line Transaction Processing (OLTP)

On-Line Transaction Processing (OLTP) databases are designed for large numbers of concurrent transactions, where each transaction is a relatively simple operation processing a small amount of data.

Persistence data structure

Immutable (does not allow modification) data structure with history. No in-place modification. Operations on it create new data versions.Older versions are always available. Making update to a persistent data structure instance always creates a new instance that contains this update. We want persistence data structures to be space and time efficient. Two of most popular persistence data structure are Btree and Queue.

Figure: History of updates

Persistence queue

Represent a queue using two lists:

the “front part” of the queue
the “back part” of the queue in reverse order
E.g.:
([1,2,3],[7,6,5,4]) represents the queue with elements 1,2,3,4,5,6,7

To enqueue (Write operation) an element, just cons it onto the back list.
To dequeue (Read operation) an element, just remove it from the front list.

So, there is no inter-dependency between the writes and reads, and they can occur simultaneously.

Rack

Rack is a standardized frame to hold collection of network equipments (switches, routers, UTM appliances, servers, patch panels, cables, modems, etc). See figure

Race Condition

A race condition occurs when two threads access a shared variable at the same time and they try to change it at the same time. Because the thread scheduling algorithm can swap between threads at any time, you don’t know the order in which the threads will attempt to access the shared data. Therefore, the result of the change in data is dependent on the thread scheduling algorithm, i.e. both threads are “racing” to access/change the data.

“Real-time” mean

Real-time is often used to describe two different ideas:

Data “freshness” – how much time has passed since the data was written into the database until we can include it in analytics results?. Terms that fit this quick analytical access to the data are real-time (immediate), near real-time (within seconds), low latency (minutes) or high latency (hours).
Query response time – how long does it take before the user receives a response – the classic “how quick is the query” question. Terms that fit here are immediate (real-time event response), interactive/ad-hoc (seconds), “quick” (minutes) and batch (hours).

Ad-hoc query example: An Ad-hoc Query is a query which is created on the fly, it’s not predefined. As the basic structure of an SQL statement consist of SELECT keyword FROM table WHERE conditions, an ad hoc query dynamically supplies the keyword, data source and the conditions without the user knowing it.

UseCase-1: Stock trading – uses real-time data that is fresh, and requires an immediate response time.

UseCase-2: Dashboards – are typically near real-time, and requires an immediate response time. If you’ve invested time and money to create a dashboard, you’d want it to show what was going on minutes ago, not yesterday.

SAN

Storage Area Network (SAN) is a high-speed collection (pool) of shared storage devices interconnected through Fibre Channel like LAN. See figure

Secondary NameNode (SNN)

SNN is a checkpoint node which stores the image of primary NameNode at certain checkpoint and is used as backup to restore NameNode.

Standby NN

The Standby NameNode is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

Stream Data

A stream is a continuous flow of data. Ex. A video is a stream of still images.

Stream Processing

Stream processing refer to a method of continuous computation that happens as data is flowing through the system. A real time processing system can be achieved by using a software architecture that utilizes continuous (stream) processing. In stream processing, sometimes data might be awaiting processing. The only constraint on such a stream processing system is that its output rate (consumtion) should be faster or atleast equal to the input rate (producing). Otherwise the storage requirements of the system grow without bound. Additionally, it must have enough memory to store queued inputs should it be stuck while processing any item in the input stream.

Switch

A switch is a computer networking device that connects devices together on a computer network. Each rack of nodes is interconnected using a rack-level switch. Each rack-level switch is connected to a cluster-level switch. See figure

Tuple

The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type.

Zero Copy

Applications which take data copy from disk to socket require multiple copies of same data. So it not only consumes CPU cycles and memory bandwidth, also I/O performance is going down. Fortunately, you can eliminate these copies through a technique called — zero copy.

Traditional Approach: Copying bytes from a file to a socket

Zero-Copy Approach: Copying bytes from a file to a socket