1. Filesystem counters

Filesystem counters are used to analysis experimental results. The following are the typical built-in filesystem counters.
Local file system
FILE_BYTES_READ
FILE_BYTES_WRITTEN
HDFS file system
HDFS_BYTES_READ 
HDFS_BYTES_WRITTEN
FILE_BYTES_READ is the number of bytes read by local file system. Assume all the map input data comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On the other hand, the input file of reducers are data on the reduce-side local disks which are fetched from map-side disks. Therefore, FILE_BYTES_READ denotes the total bytes read by reducers.

FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.

HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits. 

HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.

Note that since HDFS and local file systems are different file systems so the data from the two file systems will never overlap.

2. Comparison of compression in three places

1) No compression
Counter
Map
Reduce
Total
FILE_BYTES_READ
0
4,579,057,545
4,579,057,545
FILE_BYTES_WRITTEN
6,645,450,502
6,110,995,198
12,756,445,700
HDFS_BYTES_READ
6,616,478,456
0
6,616,478,456
HDFS_BYTES_WRITTEN
0
4,584,208,671
4,584,208,671
2) Only compressing input
Counter
Map
Reduce
Total
FILE_BYTES_READ
11,989,663,380
6,645,432,358
18,635,095,738
FILE_BYTES_WRITTEN
18,633,824,537
6,645,432,358
25,279,256,895
HDFS_BYTES_READ
1,049,256,588
0
1,049,256,588
HDFS_BYTES_WRITTEN
0
6,653,296,922
6,653,296,922
We can see that HDFS_BYTES_READ is significantly reduced. This indicates that the total bytes read by mappers from HDFS is significantly reduced.
3) Only compressing intermediate map ouput
Counter
Map
Reduce
Total
FILE_BYTES_READ
0
1,020,775,438
1,020,775,438
FILE_BYTES_WRITTEN
990,084,052
1,020,775,438
2,010,859,490
HDFS_BYTES_READ
6,616,478,456
0
6,616,478,456
HDFS_BYTES_WRITTEN
0
6,653,296,922
6,653,296,922
We can see that FILE_BYTES_READ and FILE_BYTES_WRITTEN is significantly reduced. This means that data transfer between node local file systems is significantly reduced.
4) Only compressing output
Counter
Map
Reduce
Total
FILE_BYTES_READ
0
6,645,432,490
6,645,432,490
FILE_BYTES_WRITTEN
6,645,450,502
6,645,432,490
13,290,882,992
HDFS_BYTES_READ
6,616,478,456
0
6,616,478,456
HDFS_BYTES_WRITTEN
0
997,479,368
997,479,368
We can see that HDFS_BYTES_WRITTEN is significantly reduced. This suggests that the final output to HDFS is significantly reduced.

3. Comparison of different compression formats: gzip, lzo

Compression
File
Size(GB)
Compression
Time(s)
Decompression
Time(s)
None
logs
8.0
-
-
Gzip
logs.gz
1.3
241
72
LZO
logs.lzo
2.0
55
35
Snappy
-
4.2
40
27
As we can see, the LZO file is slightly larger than the corresponding gzip file, but both are much smaller than the original uncompressed file.  Additionally, the LZO file compressed nearly five times faster, and decompressed over two times faster.

Also we can see that Snappy file is larger than the corresponding lzo file, but is still half of the original file. In addition, Snappy compress and decompress even more faster than LZO. In sum, Snappy is faster in compress and decompress time but less efficient in terms of compression ratio.
Previous Post Next Post