Hadoop Compression Analysis Experimental Results

1. Filesystem counters

Filesystem counters are used to analysis experimental results. The following are the typical built-in filesystem counters.

Local file system	FILE_BYTES_READ
	FILE_BYTES_WRITTEN
HDFS file system	HDFS_BYTES_READ
	HDFS_BYTES_WRITTEN

FILE_BYTES_READ is the number of bytes read by local file system. Assume all the map input data comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On the other hand, the input file of reducers are data on the reduce-side local disks which are fetched from map-side disks. Therefore, FILE_BYTES_READ denotes the total bytes read by reducers.

FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.

HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits.

HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.

Note that since HDFS and local file systems are different file systems so the data from the two file systems will never overlap.

2. Comparison of compression in three places

1) No compression

Counter	Map	Reduce	Total
FILE_BYTES_READ	0	4,579,057,545	4,579,057,545
FILE_BYTES_WRITTEN	6,645,450,502	6,110,995,198	12,756,445,700
HDFS_BYTES_READ	6,616,478,456	0	6,616,478,456
HDFS_BYTES_WRITTEN	0	4,584,208,671	4,584,208,671

2) Only compressing input

Counter	Map	Reduce	Total
FILE_BYTES_READ	11,989,663,380	6,645,432,358	18,635,095,738
FILE_BYTES_WRITTEN	18,633,824,537	6,645,432,358	25,279,256,895
HDFS_BYTES_READ	1,049,256,588	0	1,049,256,588
HDFS_BYTES_WRITTEN	0	6,653,296,922	6,653,296,922

We can see that HDFS_BYTES_READ is significantly reduced. This indicates that the total bytes read by mappers from HDFS is significantly reduced.
3) Only compressing intermediate map ouput

Counter	Map	Reduce	Total
FILE_BYTES_READ	0	1,020,775,438	1,020,775,438
FILE_BYTES_WRITTEN	990,084,052	1,020,775,438	2,010,859,490
HDFS_BYTES_READ	6,616,478,456	0	6,616,478,456
HDFS_BYTES_WRITTEN	0	6,653,296,922	6,653,296,922

We can see that FILE_BYTES_READ and FILE_BYTES_WRITTEN is significantly reduced. This means that data transfer between node local file systems is significantly reduced.
4) Only compressing output

Counter	Map	Reduce	Total
FILE_BYTES_READ	0	6,645,432,490	6,645,432,490
FILE_BYTES_WRITTEN	6,645,450,502	6,645,432,490	13,290,882,992
HDFS_BYTES_READ	6,616,478,456	0	6,616,478,456
HDFS_BYTES_WRITTEN	0	997,479,368	997,479,368

We can see that HDFS_BYTES_WRITTEN is significantly reduced. This suggests that the final output to HDFS is significantly reduced.

3. Comparison of different compression formats: gzip, lzo

Compression	File	Size(GB)	Compression Time(s)	Decompression Time(s)
None	logs	8.0	-	-
Gzip	logs.gz	1.3	241	72
LZO	logs.lzo	2.0	55	35
Snappy	-	4.2	40	27

As we can see, the LZO file is slightly larger than the corresponding gzip file, but both are much smaller than the original uncompressed file. Additionally, the LZO file compressed nearly five times faster, and decompressed over two times faster.

Also we can see that Snappy file is larger than the corresponding lzo file, but is still half of the original file. In addition, Snappy compress and decompress even more faster than LZO. In sum, Snappy is faster in compress and decompress time but less efficient in terms of compression ratio.

Tech Insights

1. Filesystem counters

2. Comparison of compression in three places

3. Comparison of different compression formats: gzip, lzo

About Us

Contact Form