int96 banner

INT96 Timestamps

Timestamp values in parquet files are saved as int96 values by data processing frameworks like Hive and Impala. But there is a slight difference between the way these int96 values…

Read more »
spark-streaming-banner

Spark Streaming for Batch Job

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Here I will be talking about and demonstrating structured…

Read more »
Hive Performance Tuning

Hive Performance Tuning

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL syntax. To know how to use Hive please read https://cwiki.apache.org/confluence/display/Hive/Tutorial…

Read more »

Spark ML Classification

Quoting from wiki, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership…

Read more »

Spark and JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. We can define complex nested…

Read more »

Spark AI Summit Europe – The Experience

I attended the Spark AI summit 2018 in London from 3rd – 04th October. Well it’s was in London and I couldn’t have missed the opportunity to know/meet the latest…

Read more »

Monitoring Spark Application with Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit. It is basically a key value pair time series data model. In this blog post, we will use prometheus to monitor a…

Read more »

Apache Spark Unit Testing

Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. Apache Spark is included in almost all of the Hadoop distributions. Apache Spark is the hottest…

Read more »
global hadoop

Hadoop to explore data

Big data by definition denotes datasets that are so large or complex that traditional data processing application frameworks and software are inadequate to deal with them. Hadoop is the answer…

Read more »