I attended the Spark AI summit 2018 in London from 3rd – 04th October. Well it’s was in London and I couldn’t have missed the opportunity to know/meet the latest and the greatest of Apache Spark. It was in Excel London which was a massive and fantastic venue. I guess it was just perfect for such an international event. We all know Apache Spark has a massive community from the github stars, stackoverflow posts, etc. The same was evident from the hustle and bustle at the registration and keynote auditorium. I grabbed a nice comfy seat in the auditorium eager to hear from the Matei, Ali and the company.
The summit keynotes made it absolutely clear, what the name of the summit had indicated. Data + Spark + AI is where the focus is going to be throughout the summit. PoojaSome notable points from the keynotes were as follows:
- Matei talked about his new pet project MLflow. MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. Follow the link to read more on MLflow. It will soon be available to the customers of Databricks platform. There were other sessions during the conference for MLflow deep dives.
- Reynold talked about project Hydrogen. This was the opening keynote and pin pointed the need to enhance spark project to support distributed deep learning frameworks, new scheduler i.e. Gang scheduler. I think project spark wants to be the unified engine for machine learning just like it is for big data parallel analytical processing. It’s a work in progress, so let’s wait, watch and contribute. 🙂
There were many other talks during the keynote sessions showcasing how AI first approach has helped grow businesses. Shell also had a demo to show AI has helped improve safety at their pump stations. The demo was cool. I am looking forward to using Pytorch which was talked about by Soumith, creator of Pytorch.
Now there several interesting talks on different topics related to Apache Spark and AI. The sessions were broken down to several streams like developer, deep dives, machine learning, research and use cases. You can say there were definitely sessions that would interest everyone. Well I didn’t follow any stream specific sessions and went for a mix bag.
Tech Deep Dives
I attended the tech deep dive sessions on Query Execution and Bucketing by Jacek Laskowski. He explained the internals of a Spark SQL execution. The role of analyzers, logical and physical plan in the query executions. Bucketing was quite interesting and great to know. Bucketing can be applied over and above partitioning to avoid costly shuffle operation in some scenarios. Leave a comment if you want me to do a detailed post on either of this topic.
There was a nice talk on how to develop your spark code in smaller pieces and test them individually. I have already done a blog for unit testing spark code. Click here to read more. In the session, I got to know another library to test your spark transformations called spark-testing-base provided by Holden Karu.
There was a session to describe the test approaches taken to ensure the correctness and performance of Spark SQL. How the fuzz testing is used to generate test SQLs and are ran against the spark master and older versions. The demo to use flame graphs for investigating any performance degradation was quite cool.
Another interesting framework I came to know of is Apache Spark Serving. It’s an Engine for Deploying Spark Jobs as Distributed Web Services. I am looking forward to deep dive into the framework and know more. If I manage to have a good hold of the framework I will let you all know. ; )
There was a very nice session by Brooke Weing and Jules Damji to quickly compare the 3 hot deep learning frameworks Tensorflow, Keras and Pytorch. Keras sounded like a winner for a python loving deep learning newbie. : ) There was an insightful session on how machine learning can be used to find the effectiveness of your social media post.
There were few sessions on Apache Spark on K8s. Running spark on kubernetes is still experimental, but there are some who are using spark on K8s successfully. Developing spark libraries was another good session. Best practices like the impact of caching and broadcasting data while building libraries was put forward with some nice example. Especially, when we have no idea how the end users may actually the libraries. Lastly, I also attended the session for spark lineage. Spline is a project that helps data lineage tracking and visualization for apache spark jobs. I also had a quick chat with Marek on the future work planned for spline and discussed some existing issues with Atlas integration and table saving.
And that was the wrap! It was overall a very unique experience. The next summit in Europe will be in Amsterdam on 15th Oct. I will be there and will catch you, if you plan to have a couple of Spark+AI days. 😀