Sanjay Mishra, Author at Devrats Journal

INT96 Timestamps

Sanjay Mishra 20th June 2021 Leave a Comment

Timestamp values in parquet files are saved as int96 values by data processing frameworks like Hive and Impala. But there is a slight difference between the way these int96 values…

Apache Spark

Spark Streaming for Batch Job

Sanjay Mishra 26th April 2020 Leave a Comment

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Here I will be talking about and demonstrating structured…

Analytics

Hive Performance Tuning

Sanjay Mishra 6th October 2019 Leave a Comment

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL syntax. To know how to use Hive please read https://cwiki.apache.org/confluence/display/Hive/Tutorial…

Machine Learning

Spark ML Classification

Sanjay Mishra 28th January 2019 Leave a Comment

Quoting from wiki, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership…

Apache Spark

Spark and JSON

Sanjay Mishra 13th November 2018 4 Comments

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. We can define complex nested…

Apache Spark

Spark AI Summit Europe – The Experience

Sanjay Mishra 7th October 2018 Leave a Comment

I attended the Spark AI summit 2018 in London from 3rd – 04th October. Well it’s was in London and I couldn’t have missed the opportunity to know/meet the latest…

Apache Spark

Monitoring Spark Application with Prometheus

Sanjay Mishra 11th September 2018 3 Comments

Prometheus is an open-source systems monitoring and alerting toolkit. It is basically a key value pair time series data model. In this blog post, we will use prometheus to monitor a…

Apache Spark

Apache Spark Unit Testing

Sanjay Mishra 14th June 2018 2 Comments

Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. Apache Spark is included in almost all of the Hadoop distributions. Apache Spark is the hottest…

Analytics

Hadoop to explore data

Sanjay Mishra 18th November 2017 3 Comments

Big data by definition denotes datasets that are so large or complex that traditional data processing application frameworks and software are inadequate to deal with them. Hadoop is the answer…

Author: Sanjay Mishra