Posts

Data streams between different apps with Confluent & Apache Kafka

Kafka™ is an open-sourced distributed streaming platform, based on the concept of transaction log where different processes communicate using messages published and processed in a cluster, the core of the service, over one or more servers. The cluster stores streams of records written by producers, partitioned in different “partitions” within different categories called “topics”, where every record consists of a key, a value and a timestamp. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one or many consumers that subscribe to the data written to it. Furthermore Kafka uses a custom binary TCP-based protocol to guarantee a secure and consistent communication.

kafka ecosystem

Once upon a time…

Written in Scala and Java, Kafka was originally developed as internal project in LinkedIn to solve the problem of communication between infrastructures with continuous streams of data (see details here).
The Kafka creator Jay Kreps together with co-founders are considered pioneers because they have improved this project and they realized that there was a business opportunity to create and commercialize Kafka as independent project adopted by Apache Software Foundation. Within a few years Kafka has become a fundamental infrastructure by thousands of companies. The importance of this project led Jay Kreps and his team to focus on it by founding Confluent (see details here), a new streaming platform that works with Kafka ecosystem and improves it by adding open connectors, the REST Proxy service and the Schema Registry. The first provides a RESTful interface to a Kafka cluster, the latter provides serializers that plug into Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in the Avro format. All these functions should now be available in Confluent Cloud™: the new platform allows to run Kafka and Confluent services in public cloud servers.

confluent cloud flag logo

Build your streaming platform

In this article I’ll explain how to get up and running with a very simple streaming platform on a single server (standalone mode) using Kafka Connect and two open supported connectors: the Twitter source connector and the MongoDB sink connector to read data from Twitter, process and store them in a MongoDB database. Source connectors import data from another system to Kafka and Sink connectors export data.

A Kafka deployment surely will not work without Zookeeper. Indeed it plays an important role in the Kafka ecosystem, as described in details in this Quora question. This is used for:

  1. Electing a broker as controller to manage leader/follower relationship between partitions;
  2. Managing the partitions as parts of the cluster;
  3. Managing the topic configuration information;
  4. Quotas – how much data is each client allowed to read and write;
  5. ACLs – what process is allowed to write and read to which topic.

This test will be executed in Linux OS where Confluent is officially supported and JVM correctly configured. The following list contains every tool you need:

Apache Kafka logo confluent logo MongoDB logo Twitter logo

Running Confluent Platform

Start Zookeeper, the Kafka brokers and the Schema Registry. Every service will run in its own terminal (see here for details). Assuming that $CONFLUENT_HOME refers to the root of your Confluent Platform installation:

 

Running Kafka-source-connect-twitter

Build the cloned source code for example with Maven (see here for details):

Put the JAR file location into your CLASSPATH

Visit https://apps.twitter.com/ and Create a New App (if not already done) to obtain the required access keys and setting up the connector properties located in twitter-source.properties. Here is an example:

Then start a Kafka Connect source instance. A new terminal will be open in Twitter source root directory

Verify everything works well opening a consumer instance to receive data:

 

Running MongoDB-sink-connector

This time use Gradle to build the cloned source code

Export the JAR location in CLASSPATH including the old one

Modify sink.properties file by following this example

Then run the MongoDB-Sink-Connector as example in standalone mode

You can view the new data in MongoDB stored in the collection with the same name of the topic choosen with additional OFFSETS collection.