1. Filesystem counters
Filesystem counters are used to analysis experimental results. The following are the typical built-in filesystem counters.
Local
file system
|
FILE_BYTES_READ
|
FILE_BYTES_WRITTEN
|
|
HDFS
file system
|
HDFS_BYTES_READ
|
HDFS_BYTES_WRITTEN
|
FILE_BYTES_READ is the
number of bytes read by local file system. Assume all the map input data
comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On
the other hand, the input file of reducers are data on the reduce-side
local disks which are fetched from map-side disks. Therefore,
FILE_BYTES_READ denotes the total bytes read by reducers.
FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.
HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits.
HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.
Note that since HDFS and local file systems are different file systems so the data from the two file systems will never overlap.
FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.
HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits.
HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.
Note that since HDFS and local file systems are different file systems so the data from the two file systems will never overlap.
2. Comparison of compression in three places
1) No compression
2) Only compressing input
We can see that HDFS_BYTES_READ is significantly reduced. This indicates that the total bytes read by mappers from HDFS is
significantly reduced.
3) Only compressing intermediate map ouput
We can see that FILE_BYTES_READ and FILE_BYTES_WRITTEN is significantly reduced. This means that data transfer between node
local file systems is significantly reduced.
4) Only compressing output
We can see that HDFS_BYTES_WRITTEN is significantly reduced. This suggests that the final output to HDFS is significantly
reduced.
Counter |
Map |
Reduce |
Total |
FILE_BYTES_READ |
0
|
4,579,057,545
|
4,579,057,545
|
FILE_BYTES_WRITTEN |
6,645,450,502
|
6,110,995,198
|
12,756,445,700
|
HDFS_BYTES_READ |
6,616,478,456
|
0
|
6,616,478,456
|
HDFS_BYTES_WRITTEN |
0
|
4,584,208,671
|
4,584,208,671
|
Counter |
Map |
Reduce |
Total |
FILE_BYTES_READ |
11,989,663,380
|
6,645,432,358
|
18,635,095,738
|
FILE_BYTES_WRITTEN |
18,633,824,537
|
6,645,432,358
|
25,279,256,895
|
HDFS_BYTES_READ |
1,049,256,588
|
0
|
1,049,256,588
|
HDFS_BYTES_WRITTEN |
0
|
6,653,296,922
|
6,653,296,922
|
3) Only compressing intermediate map ouput
Counter |
Map |
Reduce |
Total |
FILE_BYTES_READ |
0
|
1,020,775,438
|
1,020,775,438
|
FILE_BYTES_WRITTEN |
990,084,052
|
1,020,775,438
|
2,010,859,490
|
HDFS_BYTES_READ |
6,616,478,456
|
0
|
6,616,478,456
|
HDFS_BYTES_WRITTEN |
0
|
6,653,296,922
|
6,653,296,922
|
4) Only compressing output
Counter |
Map |
Reduce |
Total |
FILE_BYTES_READ |
0
|
6,645,432,490
|
6,645,432,490
|
FILE_BYTES_WRITTEN |
6,645,450,502
|
6,645,432,490
|
13,290,882,992
|
HDFS_BYTES_READ |
6,616,478,456
|
0
|
6,616,478,456
|
HDFS_BYTES_WRITTEN |
0
|
997,479,368
|
997,479,368
|
3. Comparison of different compression formats: gzip, lzo
Compression
|
File
|
Size(GB)
|
Compression
Time(s) |
Decompression
Time(s) |
None
|
logs
|
8.0
|
-
|
-
|
Gzip
|
logs.gz
|
1.3
|
241
|
72
|
LZO
|
logs.lzo
|
2.0
|
55
|
35
|
Snappy
|
-
|
4.2
|
40
|
27
|
Also we can see that Snappy file is larger than the corresponding lzo file, but is still half of the original file. In addition, Snappy compress and decompress even more faster than LZO. In sum, Snappy is faster in compress and decompress time but less efficient in terms of compression ratio.