Timestamp values in parquet files are saved as int96 values by data processing frameworks like Hive and Impala. But there is a slight difference between the way these int96 values are saved and read by Hive and Impala. This difference can cause inconsistencies and issues with data management of the timestamp values. In this post we will doing a deep dive in this difference and also look at how int96 values are converted to date and time.
Int96 to Date Time
Its important to understand how the int96 values are read to a date and time value. So lets look at couple of examples on how to convert a int96 hex values that is stored in parquet file by big data tools like Hive, Spark and Impala.
Step 1: Convert to Hex value
-17158328637768917883033344768 = -0x3771096B43B3FFFF2679DB00
Step 2: Get the 2’s complement – Note this is required for negative values like in this example. If value is positive, skip to step 3
-0x3771096B43B3FFFF2679DB00 = -1101110111000100001001011010110100001110110011111111111111111100100110011110011101101100000000 (in binary) 2's complement= 10001000111011110110100101001011110001001100000000000000000011011001100001100010010100000000 Back to Hex value : 0x88EF694BC4C0000D9862500
Step 3: Separate the date and time part from the hex value
12 bytes of the int96 values are broken as last 4 bytes as date and remaining 8 bytes as time value. For this example : 0x88EF694BC4C0000D9862500 : Date part= D9 86 25 00 Time part= 08 8E F6 94 BC 4C 00 00
Step 4: Calculate date
Date part = D9 86 25 00 Reverse bytes = 00 25 86 D9 0x002586D9 = 2459353 in decimal 2459353 is Julian days.
Step 5: Calculate time
Time part = 08 8E F6 94 BC 4C 00 00 Reverse bytes = 00 00 4C BC 94 F6 8E 08 Hexadecimal to Decimal = 84372836748808 84372836748808 is nanoseconds
84372836748808 nanoseconds translates to 23:26:12 in 24HR time.
Step 6: Club the date and time
The example int96 value -17158328637768917883033344768=18th May 2021 23:26:12
This conversion steps are important, you that you can look at any parquet file data and know what does the value depict human understandable date and time value. Thus you can also compare the difference in the value stored by data processing tools like Spark/Hive and Impala
Spark and Hive Int96
Spark and Hive normalize the timestamp from local timezone to UTC. This normalized value is then written to the parquet files as int96 value. Spark and Hive when reads any timestamp int96 values, it treats them as UTC values and therefore adjust the time value. So the UTC values are shown as local timezone value when queried.
For example: If the current time in Europe/London timezone (which is GMT+1 during daylight saving) is 19/06/2021 15:00:00 and you use spark to write the current time to a parquet file like below:
spark.sql("INSERT INTO PARQUET_TABLE VALUES ( now() )")
Inspect the parquet file written above using command like parquet-tools dump <file_name> and checked the int96 value. You will find the int96 value will actually have value 19/06/2021 14:00:00. So the timezone information is lost and the value is stored.
Now, when you read the same parquet file using spark, the output will show the time with the correct local time i.e. 19/06/2021 15:00:00
Hive behaves similar to spark when reading and writing int96 values from and to parquet files
Impala stores and retrieves the
TIMESTAMP values verbatim, with no adjustment for the time zone. So going back to the example in the earlier section. If Impala wrote the current time to the parquet files and you inspected the int96 value, you will find, the value is 19/06/2021 15:00:00. So the local time is saved as is in the parquet files. When you use impala to read the parquet file, the time shown will be 19/06/2021 15:00:00.
So no adjustment to the time value before reading and writing int96 values to parquet file.
Due to the difference between the way the above tools treat int96 values in parquet files, there may be issues in reporting and data processing. For e.g. A time value written by Impala is saved as 19/06/2021 23:00:00 and spark read the file and translated the datetime to 20/06/2021 00:00:00, assuming the time is in UTC. Vice-versa, if Spark has written datetime as 20/06/2021 00:00:00 BST, in parquet file value=19/06/2021 23:00:00. Impala will therefore read the parquet file time value 19/06/2021 23:00:00.
The data processing application need to keep this difference in mind. There would be a need for work around in Impala and Spark/Hive application to offset this time discrepancies. There are also some useful configuration that can used on both spark and impala applications. This configuration work because the parquet file metadata store the creator name. Parquet file written by Impala will have creator= impala version <version and build info>. Similarly a file written by spark will have creator= parquet-mr version <version and build info>. Each tools can therefore differentiate the parquet written by impala and spark and therefore treat int96 values differently, if configured. Refer to the below properties to workaround this issue.
- Spark :
- spark.sql.parquet.int96TimestampConversion=true: Set this config to change the interpretation of TIMESTAMP values read from Parquet files that were written by Impala, to match the Impala behavior. After this property is set, spark will not adjust the time value to local timezone and will read the time value as is.
- convert_legacy_hive_parquet_utc_timestamps=true: Set this config to make Impala recognize the Parquet data files written by Hive and Spark, and apply the same UTC-to-local-timezone conversion logic during the query as Hive and Spark do.
I hope this article is helpful to understand the INT96 timestamp and the edge case issue caused by incompatibility between different tools. Thanks for your time to read this one.