INT96 Timestamps
Timestamp values in parquet files are saved as int96 values by data processing frameworks like Hive and Impala. But there is a slight difference between the way these int96 values…
Read more »Spark Streaming for Batch Job
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Here I will be talking about and demonstrating structured…
Read more »Spark and JSON
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. We can define complex nested…
Read more »Spark AI Summit Europe – The Experience
I attended the Spark AI summit 2018 in London from 3rd – 04th October. Well it’s was in London and I couldn’t have missed the opportunity to know/meet the latest…
Read more »Monitoring Spark Application with Prometheus
Prometheus is an open-source systems monitoring and alerting toolkit. It is basically a key value pair time series data model. In this blog post, we will use prometheus to monitor a…
Read more »Apache Spark Unit Testing
Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. Apache Spark is included in almost all of the Hadoop distributions. Apache Spark is the hottest…
Read more »