Hive performance optimization is a larger topic
on its own and is very specific to the queries you are
using. Infact each query in a query file needs separate performance tuning to
get the most robust results.
I'll try to list a few approaches in general used for performance optimization
Limit the data flow down the queries
When you are on a hive query the volume of data
that flows each level down is the factor that decides performance. So if you
are executing a script that contains a sequence of hive QL, make sure that the
data filtration happens on the first
few stages rather than bringing unwanted data to bottom. This will give you significant performance numbers as the
queries down the lane will have very less data to crunch on.
This is a common bottle neck when some existing
SQL jobs are ported to hive, we just
try to execute the same sequence of SQL steps in hive as well which becomes a
bottle neck on the performance. Understand the requirement or the existing SQL
script and design your hive job considering data flow
Use hive merge files
Hive queries are parsed into map only and map
reduce job. In a hive script there will lots of hive
queries. Assume one of your queries is parsed
to a mapreduce job and the output files from the job are very small, say 10 mb. In such a case the subsequent query
that consumes this data may generate more
number of map tasks and would be inefficient.
If you have more jobs on the same data set then all the jobs will get inefficient. In such scenarios if you
enable merge files in hive, the first
query would run a merge job at the end there by merging small files into
larger ones. This is controlled
using the following parameters
hive.merge.mapredfiles=true
hive.merge.mapfiles=true (true by default in hive)
using the following parameters
hive.merge.mapredfiles=true
hive.merge.mapfiles=true (true by default in hive)
For more control over merge files you can tweak
these properties as well
hive.merge.size.per.task (the
max final size of a file after the merge task)
hive.merge.smallfiles.avgsize (the
merge job is triggered only if the average output filesizes is less than the
specified value)
The default values for the above properties are
hive.merge.size.per.task=256000000
hive.merge.smallfiles.avgsize=16000000
When you enable merge an extra map only job is triggered, whether this job gets you an optimization or an over head is totally dependent on your use case or the queries.
Join Optimizations
Joins are very expensive.Avoid it if possible.
If it is required try to use join optimizations as map
joins, bucketed map joins etc