This blog has moved

Thanks for visiting this blog. rohitsm.com has moved to Github Pages. All the content hosted here will continue to remain accessible at the URL oldblog.rohitsm.com.

Thursday, July 4, 2019

Serializing Spark dataframes to Avro using KafkaAvroSerializer

I recently worked on a project that used Spark Structured Streaming using Apache Spark, Confluent Schema Registry and Apache Kafka. Due to some versioning constraints between the various components, I had to write a custom implementation of the KafkaAvroSerializer class for serializing Spark Dataframes into Avro format. The serialized data was then published to Kafka. This post is based on the examples specified in the Confluent documentation here.

In newer versions of Confluent Schema Registry, lot of the implementations detailed below have been simplified and much easier to use. The standard recommended usage of the Confluent KafkaAvroSerializer is fairly simple in that it requires you to set it as one of the Kafka properties that is used when initializing a KafkaProducer:

val kafkaProperties = new Properties();
props.put(...)
...
...
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class
val producer = new KafkaProducer(props);

This abstracts out many of the implementation specifics and details. The way this works is that when the object to be published to Kafka is sent using the KafkaProducer, internally the KafkaAvroSerializer does the following:

Thursday, January 17, 2019

My 2018 Reading Challenge



Towards the end of 2017, I found myself falling behind on my habit of reading books. It wasn't that I was not reading lesser than usual or not reading at all. It was just that I spent most of my time reading newspapers and magazines; mostly the latter, which is something I enjoyed very much. Due to this, when it came to reading books, I did not have much to be happy about.

It was during this time that I came across a friend using Goodreads to keep tracking of his "to read" list and progressing through them during the course of the year. I decided to follow suit and joined the Goodreads 2018 Reading Challenge and set myself a target of 20 books. During the course of this challenge, I came across a wide variety of books across various genres - from inspiring memoirs, spellbinding narratives and political thrillers to some that were slow and painful to progress through and had to be dropped halfway. By the end of the year, I had read a total of 17 books and had a couple of abandoned ones in various stages of progress.

This post, after a rather long time, lists down some of my favourite books from last year's reading challenge. In no order of preference, they are:

Bad Blood by John Carreyrou


If I had to pick a favourite among the 17 books that I read last year, it would have to be Bad Blood written by the WSJ journalist (and Pulitzer Prize winner) John Carreyrou on the rise and fall of the infamous startup Theranos. When I mentioned some books that were "spellbinding narratives" above, I had this book in mind. I stumbled across this book while browsing one of Bill Gates' reading recommendations. I was sold on his review and gave this book a try and boy, did I have a hard time putting it down. It was a spectacular read. I won't be able to do justice to the review that this book deserves, so I recommend that you read Bill Gates' review of it here