The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL syntax. To know how to use Hive please read
In this post I will cover the techniques one can use to tune Hive SQL performances.
Hive Architecture Overview
It is very important to understand the Hive architecture if you want to tune the hive overall performance. So I will begin with a brief description of the architecture.
As in the above reference diagram, the hive architecture can be broken down into the below components:
- Hive Client: This is the interface used to submit the query to Hive. Some the client can be Hive CLI, Beeline, Hue Hive Editor and Pyhive.
- Driver/Compiler: This component is responsible to maintain the hive sessions while the hive query executes. The sql query submitted is compiled here after consulting the metastore. This will also generate the query plan for the execution of the query.
- Execution Engine: The engine is responsible to convert the execution plan into a physical plan and send the task for compute to the nodes.
- Metastore: This is the database that hold all the metadata for the hive data warehouse. It contains all the schema information of the table, databases, functions etc.
- Hadoop: This is the part that has actual data/files like in HDFS or AWS S3. It also incorporate the resource manager like YARN or an AWS EMR master node.
Right Hive Client
Hive was open sourced in 2010. It has evolved since then. It is important to choose a right client to use of the recent advancements. The most important point to remember is:
HiveServer2(HS2) is the successor to HiveServer1 which has been deprecated. We need to use client that will use HS2. HS2 support multi-concurrency and authentication. It also provides better support for JDBC and ODBC clients. Some of the HS2 client that you can use are :
- pyHive: https://github.com/dropbox/PyHive
- Oracle SQL Developer:https://community.cloudera.com/t5/Community-Articles/Connect-Oracle-SQL-Developer-to-Hive/ta-p/244491
Right Execution Engine
- Map Reduce: By default, Hive queries is run on Map Reduce execution engine. Map Reduce is framework that will divide the job into parallel executable tasks. It works quite good. However, there are few settings that you would want to like to set to take maximum advantage of the MapReduce execution
- hive.execution.engine=mr : This sets the execution engine to map reduce
- hive.exec.parallel=true : This is defaulted to false ¯\_(ツ)_/¯. Setting to true will allow tasks that can be run in parallel to run in parallel. For example jobs, processing different source tables before a join will be run in parallel
- hive.exec.parallel.thread.number=8: You can tune this value as per the your job and available resources.
- Spark/Tez : As per the performance benchmarkings, using spark or tez as execution will give 100% better performance than Map Reduce. Among spark or tez, which one is better? you may ask. Honestly, don’t know. Hortonworks would claim Tez, Databricks might claim spark. Its little politics between the projects. However, I prefer spark as it is much friendly developer framework and widely adopted and supported. Tez was purposely built to run on Hadoop YARN so there is better integration with Hadoop based projects. Spark on the other hand can be run even standalone. Keep in mind, the two framework are both running queries in memory,so they would clog the available resources. The below properties will help in setting the engine
- hive.execution.engine=spark [OR]
- hive.prewarm.enabled=true : Will prewarm the containers and allow quicker execution
- hive.spark.client.connect.timeout: If this is not properly set, you may experience intermittent issue when running hive on spark. The client will disconnect if the application master does not connect to hive client within the configured time. This is critical especially when the cluster is busy
- hive.spark.client.server.connect.timeout: Similar to the above property, the handshake timeout between client and spark. Again if not configured correctly can cause, abrupt timeout timeout issues.
Leave a comment below, if you have experienced a performance improvement by switching to spark or tez from MR.
Right Data Structure
Hive query performance is impacted on how we store the underlying data and how well hive knows the data. Some useful tips can be as below:
Choosing the right file format is important for faster retrieval of the data needed by the hive jobs. There are several file formats to choose from TextFile (Default File format), SequenceFile, Avro, ORC, Parquet. Each of this file format have their characteristics. TextFile are simple and most commonly found in Hadoop. SequenceFile is binary format so generally smaller than TextFiles. SequenceFile however are not good for Hive SQL types data. ORC and Parquet are columnar data format which are hottest file format at the moment. Instead of laying the row of data together, column data is placed together.
Compression is also important factor when data is saved to disk. A splittable compressed data like bzip2 compressed can be processed in parallel. However a non-spittable format like snappy or zip needs to loaded on a container as a whole and processed. This may cause OOM exceptions.
A large hive table is difficult to query. Simply because, query needs to read all of the data behind the table. To avoid this, the table can be partitioned into sub-directories. It is important to choose a partition key that has low cardinality, otherwise you end up with multiple very small size partitions making it more resource consuming to read data. You can also have multiple level of partitions in a table.
Bucketing is another way Hive can distribute data. Joins can become quicker if the joining columns and bucketing columns are the same. In order to enable bucketing set the below properties in hive config:
Vectorization and CBO
Vectorization allows processing the a batch of rows instead of a single row of data at a time. This greatly reduces the CPU usage for queries involving scanning, filtering, joins, etc. You can find the detailed design here :
Please note at the time of writing Parquet is not fully supported with Vectorization. Its still work in progress.
https://issues.apache.org/jira/browse/HIVE-14826 . ORC file format is recommended at the time.
Cost Based Optimization if enabled will modify the execution plan based on the cost it computes for the query plan. It would do join column reordering and algorithms selection based on the query.
There may be scenarios/use cases where multiple process would like to insert data into a Hive Table. However, when data is inserted into Hive table, the process acquires an exclusive lock on the table. This will cause other processes to wait for the process having the lock to finish the insertion and release the lock.
You can facilitate parallel insertions by inserting records in different partitions. As the process inserting the record, will lock only the partition it is writing to, the other processes can write the data to their partition.
To read more on locking follow the link here: https://cwiki.apache.org/confluence/display/Hive/Locking
I hope you find the tips helpful. Please leave a comment below if you have some other practice that I may have missed out. Also, let me know if you need details on any tip I have outlined above. Thanks for reading.