Apache Kafka basics part-1

Venkateswaran
3 min readMay 3, 2021

--

What is the story of Kafka? What is the purpose of Kafka? What kind of problems will it solve? In a nutshell, Apache Kafka is a distributed event streaming platform. Before we dive into Kafka you need to understand what a distributed system is and what a streaming platform is.

The problem of the existing centralized system

Consider an application that has multiple data source models, for example, an online shopping application where there is a transaction process model that sends transaction details into the target system, as well as an invoice model that also sends invoice details into the target system. What if the source application has an additional model? Is the target capable of handling this much load? There is an answer to this question: vertical scaling, but vertical scaling is not the proper solution for handling a huge amount of data since it is neither cost-effective nor infrastructure-managing.

https://www.youtube.com/watch?v=eDaap984M5g&list=PLlBQ_B5-H_xj_123v_xKFAAqEBSY5cIEE&index=1
Thanks to the Data Engineering Minds youtube channel for this picture

No scalability: because it is a monolithic system, we cannot scale it as we wish

If the system goes down, there is another way to save data

In the event of a system failure, there is another option to save data, there is no master/slave type architecture.

The system cannot be maintained if it goes down because there is no way to migrate data to some other target

There is no parallel process for reading/writing, so this will cause a huge reduction in throughput

Distributed Systems

https://www.youtube.com/watch?v=eDaap984M5g&list=PLlBQ_B5-H_xj_123v_xKFAAqEBSY5cIEE&index=1
Thanks to the Data Engineering Minds youtube channel for this picture

In this case, an application has multiple data processing models that it sends to the target system and we need to do “horizontal scaling” by forming a cluster of resources (group of the computer).

Adding machines as per demand and scaling them up and down according to what we need to make horizontal scaling a very simple process.

A distributed system is highly fault-tolerant because if one of the nodes fails, other nodes immediately take over its responsibilities

Because the data is replicated across nodes, the system is highly available with zero downtime and also has high throughput.

Streaming platform

thanks to Data Engineering Minds youtube cannel for this screen shot
Thanks to the Data Engineering Minds youtube channel for this picture

User data (database) connects to a system for analytical purposes (data warehouse), and the same source connects to another targeting system called ML for predicting user behaviour. The two targeting systems connect to the same source, while two sources and targets connect to each other. Ideally, we would have some kind of centralized place to store all data, then all sources would distribute data into the centralized place, and all targets would consume it. Distributed systems use that kind of centralized place for both source and target with less wiring and pipelining approaches. Kafka is well suited to this approach, as it is both a pub/sub and queue-based messaging system.

Apache Kafka is a distributed event streaming platform

Now that we know what a centralized system and a distributed system are, we can say that Apache Kafka has a distributed system + streaming platform

An overview of Kafka’s life:

  • An open-source distributed streaming message platform developed at Linkedin.
  • This application was written in Scala and Java.
  • There is support for various client platforms such as C++, Java, Go, and.NET.
  • Data pipelines, streaming applications and real-time data pipelines are some applications for real-time data pipelines
  • Utilized at Linkedin, Netflix, Twitter, and Apple, among others.

Stay tuned for an upcoming post about Kafka’s architecture.

--

--