Kafka Connect tutorial: how connectors, sinks, and sources work (2023)

Acquiring data from superior systems

Kafka Connect tutorial: how connectors, sinks, and sources work (1)

Kafka Connect is an Apache Kafka® component that is used to perform streaming integration between Kafka and other systems such as databases, cloud services, search indexes, file systems, and key-value stores.

If you're new to Kafka, you might want to take a look atApache Kafka 101course before you start with this course.

Kafka Connect makes it easy to stream data from multiple sources to Kafka and stream data from Kafka to multiple destinations. The diagram you see here shows a small sample of these sources and sinks (targets). There are literally hundreds of different connectors available for Kafka Connect. Some of the most popular are:

  • RDBMS (Oracle, SQL Server, Db2, Postgres, MySQL)
  • Cloud Object Storage (Amazon S3, Azure Blob Storage, Google Cloud Storage)
  • Message queues (ActiveMQ, IBM MQ, RabbitMQ)
  • NoSQL and document stores (Elasticsearch, MongoDB, Cassandra)
  • Cloud data warehouses (Snowflake, Google BigQuery, Amazon Redshift)

Cloud-managed confluent connectors

Kafka Connect tutorial: how connectors, sinks, and sources work (2)

In the next modules of the course, we will focus more on getting Kafka Connect running, but for now, you should know that one of the nice features of Kafka Connect is that it is flexible.

You can run Kafka Connect yourself or take advantage of the numerous fully managed connectors available in Confluent Cloud for a completely cloud-based integration solution. In addition to managed connectors, Confluent provides fully managed Apache Kafka, Schema Registry, and ksqlDB stream processing.

How Kafka Connect works

Kafka Connect tutorial: how connectors, sinks, and sources work (3)

(Video) Kafka Connect Tutorial For Beginners | Zero to Hero

Kafka Connect runs in its own process, independent of Kafka brokers. It is distributed, scalable and fault tolerant, providing the same features you know and love in Kafka itself.

But the best thing about Kafka Connect is that it requires no programming to use. It is completely configuration-based, making it accessible to a wide range of users - not just developers. In addition to data ingestion and output, Kafka Connect can also perform light data transformations as it is being transferred.

Whenever you want to stream data to Kafka from another system or stream data from Kafka to another place, Kafka Connect should be the first thing that comes to mind. Let's look at some common Kafka Connect use cases.

Stream streams

Kafka Connect tutorial: how connectors, sinks, and sources work (4)

Kafka Connect can be used to ingest real-time event streams from a data source and stream them to a target system for analysis. In this particular example, our data source is a transactional database.

We have a Kafka connector that queries the database for updates and translates the information into real-time events that it generates in Kafka.

That in itself is great, but there are some other useful things we get by adding Kafka to the mix:

  • First, putting Kafka between the source and target systems means that we are building a loosely coupled system. In other words, it is relatively easy to change the source or target without affecting the others.
  • Additionally, Kafka acts as a data buffer, applying back pressure as needed.
  • Also, because we use Kafka, we know that the system as a whole is scalable and fault tolerant.

Because Kafka stores data in a configurable time span per data unit (topic), it is possible to stream the same original data to multiple destinations. This means you only need to migrate data to Kafka once, allowing it to be used by many different downstream technologies to meet different business requirements, and even share the same data across different areas of the company.

(Video) Connectors, Configuration, Converters, and Transforms | Kafka Connect 101 (2023)

Writing to Datastores from Kafka

Kafka Connect tutorial: how connectors, sinks, and sources work (5)

As another use case, you might want to save the data created by the application on the target system. This can of course apply to many different application use cases, but suppose we have an application that generates a series of logging events and we would like these events to be also stored in a document store or persisted in a relational database.

Imagine you added this logic directly to your application. You'd have to write a decent amount of standard code for that to happen, and whatever code you add to your application to accomplish this will have nothing to do with the application's business logic. Also, you'd have to maintain that extra code, specify how to scale it with your application, how to deal with crashes, reboots, etc.

Instead, you can add a few simple lines of code to generate data directly into Kafka and let Kafka Connect do the rest. As we saw in the last example, when moving data to Kafka, we can freely configure Kafka connectors to move data to whatever sub-datastore we need, and this is fully decoupled from the app itself.

Evolve processing from old systems to new ones

Kafka Connect tutorial: how connectors, sinks, and sources work (6)

Before the advent of newer technologies (such as NoSQL stores, event streaming platforms, and microservices), relational databases (RDBMSs) were the de facto place where all application data was stored. These data stores still play an extremely important role in the systems we build — but… not always. Sometimes you'll want Kafka to serve as a messaging intermediary between independent services, as well as a permanent storage system. The two approaches are very different, but unlike technological changes in the past, there is a smooth transition between the two.

Using Change Data Capture (CDC), it is possible to extract every INSERT, UPDATE, and even DELETE from the database to the Kafka event stream. And we can do it in near real time. Using the underlying database transaction logs and lightweight queries, CDC has very little impact on the source database, meaning an existing application can continue to run without any changes, while new applications can be built, driven by a stream of events captured from the underlying database. When the original application writes something to the database - for example, an order is accepted - any application subscribed to the Kafka event stream will be able to take action based on those events - for example, a new order fulfillment service.

For more information, seeIngesting Kafka data with CDC.

(Video) Apache Kafka 101: Kafka Connect (2023)

Make systems work in real time

Kafka Connect tutorial: how connectors, sinks, and sources work (7)

Building systems in real time is extremely valuable as many organizations store data in databases and will continue to do so!

But the real value of data is that we can access it as close to the moment it is generated as possible. By using Kafka Connect to capture data shortly after it is stored in the database and turn it into a stream of events, you can create much more value. This unlocks the data so it can be moved elsewhere, such as adding to a search index or analytics cluster. Alternatively, the event stream can be used to run applications when data in the database changes, such as recalculating an account balance or making recommendations.

Why not write your own integrations?

All of this sounds great, but you're probably asking, "Why Kafka Connect? Why not write your own integrations?”

Apache Kafka has its own high-performance producer and consumer APIs and client libraries available in many languages, including C/C++, Java, Python, and Go. So why not just write your own code to get the data out of the system and write it to Kafka - wouldn't it make sense to write a short piece of consumer code to read from the topic and push it to the target system?

The problem is that if you're going to do it right, you need to be able to consider and handle failures, restarts, logging, elastic scaling up and down, and running across multiple nodes. And that's it before you think about serialization and data formats. Of course, once you've done all these things, you've written something that probably resembles Kafka Connect, but without the many years of development, testing, production validation, and community that exists around Kafka Connect. Even if you built a better mousetrap, is all the time spent writing this code to solve this problem worth it? Would your effort result in something that will significantly differentiate your company from other companies dealing with similar integration?

The bottom line is that integrating external data systems into Kafka is a problem solved. There may be a few extreme cases where a bespoke solution is appropriate, but overall, you'll find that Kafka Connect will be the first thing you think of when you need to integrate a data system with Kafka.

Get Kafka up and running in minutes with Confluent Cloud

In this course, we will introduce you to Kafka Connect through hands-on exercises that will allow you to create and use data with Confluent Cloud. If you haven't signed up for Confluent Cloud yet, sign up now so that when the first exercise asks you to sign in, you'll be ready to do so.

(Video) Sink Kafka Topic to Database Table | Build JDBC Sink Connector | Confluent Connector | Kafka Connect

  1. Go to the registration page:https://www.confluent.io/confluent-cloud/tryfree/and enter your contact details and password. Then click the Start Free button and wait for the verification email.

Kafka Connect tutorial: how connectors, sinks, and sources work (8)

  1. Click the link in the confirmation email, then follow (or skip) the prompts until you reach the Create Cluster page. Here you can see the different types of clusters that are available along with their costs. For this course, a basic cluster will suffice and will maximize your free usage credits. With Basic selected, click the Start Setup button.

Kafka Connect tutorial: how connectors, sinks, and sources work (9)

  1. Select your preferred cloud provider and region, then clickTo continue.

Kafka Connect tutorial: how connectors, sinks, and sources work (10)

  1. Review your selections and give your cluster a name, then click Start Cluster. This may take several minutes.

  2. While waiting for your cluster to be provisioned, be sure to add your promo code101CALLfor an additional $25 free use(Details). From the menu in the upper right corner, select Administration | Billing & Payments, then click the Payment Details tab. From there, click on the + Promo Code link and enter the code.

Kafka Connect tutorial: how connectors, sinks, and sources work (11)

You can now complete upcoming exercises as well as take advantage of everything Confluent Cloud has to offer!

FAQs

How does Kafka sink connector work? ›

The Kafka Connect JDBC Sink connector allows you to export data from Apache Kafka® topics to any relational database with a JDBC driver. This connector can support a wide variety of databases. The connector polls data from Kafka to write to the database based on the topics subscription.

What is source and sink connectors in Kafka? ›

The connector that takes data from a Producer and feeds them into a topic is called source connector. The connector that takes data from a Topic and delivers them to a Consumer is called Sink Connector.

How does Kafka source connector works? ›

Worker model: A Kafka Connect cluster consists of a set of Worker processes that are containers that execute Connectors and Tasks . Workers automatically coordinate with each other to distribute work and provide scalability and fault tolerance.

What is a connector in Kafka Connect? ›

Kafka Connectors are pluggable components responsible for interfacing with external Data Systems to facilitate data sharing between them and Kafka. They simplify the process of importing data from external systems to Kafka and exporting data from Kafka to external systems.

What are sinks in Kafka? ›

A Connector (Sink) is a an application for reading data from Kafka, which underneath creates and uses a Kafka consumer client code. This page will use a File Sink Connector to get the desired data and save it to an external file.

What is the difference between Kafka connector and Kafka stream? ›

Kafka Streams is an API for writing client applications that transform data in Apache Kafka. You usually do this by publishing the transformed data onto a new topic. The data processing itself happens within your client application, not on a Kafka broker. Kafka Connect is an API for moving data into and out of Kafka.

How do I stop Kafka sink connector? ›

You can stop a specific connector by deleting the connector using the REST API [1]. You would need to make this REST call for every connector. If you have a lot of connectors running, you could write a little script that fetches the list of connectors [2] and deletes them one at a time, in a loop.

What is the difference between Kafka Connect and Kafka rest? ›

Kafka Stream is the Streams API to transform, aggregate, and process records from a stream and produces derivative streams. Kafka Connect is the connector API to create reusable producers and consumers (e.g., stream of changes from DynamoDB). The Kafka REST Proxy is used to producers and consumer over REST (HTTP).

Where does Kafka connector run? ›

But where do the tasks actually run? Kafka Connect runs under the Java virtual machine (JVM) as a process known as a worker. Each worker can execute multiple connectors. When you look to see if Kafka Connect is running, or want to look at its log file, it's the worker process that you're looking at.

What is the limit of Kafka Connect connector? ›

A maximum of 16384 GiB of storage per broker. A cluster that uses IAM access control can have up to 3000 TCP connections per broker at any given time.

How does Kafka work internally? ›

Kafka depends on Operating system Page Cache. Page cache is like files/data written to Disk, its index/meta cached on Ram by Operating system. Kafka takes the advantage of this. Kafka internal code gets the message from producer, then it writes to memory (page cache) and then it writes to disk.

When should I use Kafka connector? ›

A common Kafka Connect use case is orchestrating real-time streams of events from a data source to a target for analytics. By having Kafka sit between the systems, the total system becomes loosely coupled, meaning that you can easily switch out the source or target, or stream to multiple targets, for example.

How many tasks are there in Kafka connector? ›

4 topics, 5 partitions each - the Kafka connection will spawn 10 tasks, each handling data from 2 topic partitions.

How do I add connectors to Kafka connect? ›

Kafka Connect isolates each plugin so that the plugin libraries do not conflict with each other. To manually install a connector: Find your connector on Confluent Hub and download the connector ZIP file. Extract the ZIP file contents and copy the contents to the desired location.

How do I configure a connector in Kafka? ›

Installation Prerequisites
  1. The Kafka connector supports the following package versions: ...
  2. The Kafka connector is built for use with Kafka Connect API 3.2.3. ...
  3. Configure Kafka with the desired data retention time and/or storage limit.
  4. Install and configure the Kafka Connect cluster.

What is source vs sink? ›

Sink and Source are terms used to define the flow of direct current in an electric circuit. A sinking input or output circuit provides a path to ground for the electric load. A sourcing input or output provides the voltage source for the electric load.

What is sink vs source data? ›

Source and sink are used in a data flow analysis. The source is where data comes from, the sink is where it ends. With regards to application security, source and sink are frequently used for taint analysis. Data is "tainted" if it comes from an insecure source such as a file, the network, or the user.

Which is source and sink? ›

The photosynthetically active parts of a plant are referred to as the source. The areas of active growth and areas of storage are referred to as sinks. However, a source is not always a source, and a sink is not always a sink.

What are the types of connectors in Kafka? ›

Kafka Connect includes two types of connectors:
  • Source connector: Source connectors ingest entire databases and stream table updates to Kafka topics. ...
  • Sink connector: Sink connectors deliver data from Kafka topics to secondary indexes, such as Elasticsearch, or batch systems such as Hadoop for offline analysis.

What is connector vs task in Kafka Connect? ›

Tasks are the main actor in the data model for Connect. Each connector instance coordinates a set of tasks that actually copy the data. By allowing the connector to break a single job into many tasks, Kafka Connect provides built-in support for parallelism and scalable data copying with very little configuration.

What is the alternative to Kafka connector? ›

Known for its speed, ease of use, reliability, and capability of cross-platform replication, Amazon Kinesis is one of the most popular Kafka Alternatives. It is used for many purposes, including geospatial data connected to users, social networking data, and IoT sensors.

How do I know if Kafka connector is running? ›

You can use the REST API to view the current status of a connector and its tasks, including the ID of the worker to which each was assigned. Connectors and their tasks publish status updates to a shared topic (configured with status.

How do I delete all connectors in Kafka connect? ›

rmoff/connector-status.md

It will delete all of the defined connectors currently loaded in Kafka Connect. This uses the Confluent CLI, available as part of the Confluent Platform 3.3 or later.

Is Kafka Connect push or pull? ›

Once you have Kafka up and running, it's time to feed it with the data. When collecting data, there are two fundamental choices to make: Are we going to poll the data periodically (pull), or will the data be sent to us (push)? The answer is both!

What is the difference between ZooKeeper and Kafka Connect? ›

In general, ZooKeeper provides an in-sync view of the Kafka cluster. Kafka, on the other hand, is dedicated to handling the actual connections from the clients (producers and consumers) as well as managing the topic logs, topic log partitions, consumer groups ,and individual offsets.

What is the difference between dataflow and Kafka Connect? ›

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. Google Cloud Dataflow belongs to "Real-time Data Processing" category of the tech stack, while Kafka can be primarily classified under "Message Queue".

Does Kafka connect need zookeeper? ›

In Kafka architecture, Zookeeper serves as a centralized controller for managing all the metadata information about Kafka producers, brokers, and consumers. However, you can install and run Kafka without Zookeeper.

How do you increase throughput in Kafka connect? ›

Increasing the number of partitions and the number of brokers in a cluster will lead to increased parallelism of message consumption, which in turn improves the throughput of a Kafka cluster; however, the time required to replicate data across replica sets will also increase.

How many writes per second can Kafka handle? ›

To learn more about Kafka's performance, benchmarking, and tuning: Benchmark Your Dedicated Apache Kafka® Cluster on Confluent Cloud. Benchmarking Apache Kafka®: 2 Million Writes Per Second (On Three Cheap Machines)

How many connections does a Kafka producer have? ›

connection Kafka Producer Config represents the maximum number of unacknowledged requests that the client will send on a single connection before blocking. The default value is 5.

How much memory does Kafka Connect need? ›

Memory. Kafka uses heap space very carefully and does not require setting heap sizes more than 6 GB. It can run optimally with 6 GB of RAM for heap space. This will result in a file system cache of up to 28–30 GB on a 32 GB machine.

How does Kafka ingest data? ›

The first step in Kafka for Data Ingestion requires producing data to Kafka. There are multiple components reading from external sources such as Queues, WebSockets, or REST Services. Consequently, multiple Kafka Producers are deployed, each delivering data to a distinct topic, which will comprise the source's raw data.

How data is transferred in Kafka? ›

With Kafka, all messages to consumers are sent via the broker by using a KafkaProducer . As they are sent, messages are assigned to topics by the sending process. Within topics, messages are assigned to partitions either explicitly or implicitly via the hashcode of the key associated with the message.

How data is stored in Kafka? ›

Kafka brokers splits each partition into segments. Each segment is stored in a single data file on the disk attached to the broker. By default, each segment contains either 1 GB of data or a week of data, whichever limit is attained first.

Why not to use Kafka Connect? ›

Why Not Kafka Connect? There are a few situations in which Kafka Connect may not be the most appropriate solution for data integration: Complex data transformation: Kafka Connect is primarily designed for moving data between systems, so it may not be well-suited for complex data transformation tasks.

Can Kafka message be consumed multiple times? ›

commit to true , it means that Kafka shifts offset as soon as it sends batched messages to Consumer — and doesn't take care of whether Consumers handled messages or not. In this case, Kafka guarantees that each message sends only once.

What are the four major components of Kafka? ›

Overview of Kafka Architecture

The compute layer consists of four core components—the producer, consumer, streams, and connector APIs, which allow Kafka to scale applications across distributed systems.

How do I know how many brokers are running in Kafka? ›

The output of the kafka-broker-api-versions.sh script will provide you with a list of available brokers along with their supported API versions. Each broker will be listed on a line in the output, starting with the broker ID, the host and port, and supported API versions.

How do I check Kafka connector log? ›

Use the Connect API. After you have Confluent Platform and Kafka Connect and your connectors running, you can check the log levels and change log levels using Connect API endpoints. Changes made through the API are not permanent. That is, any changes made using the API do not change properties in the connect-log4j.

How do I run Kafka connector locally? ›

  1. Download Kafka from apache kafka. ...
  2. Build KCQuickstartSinkConnector jar from source code given below. ...
  3. Get console setup as shown in (to get a better view of commands running per screen)
  4. Copy paste KCQuickstartSinkConnector jar to folder “kafka_2.10–0.10.2.1/libs/”
  5. Start Kafka zookeeper.
  6. Start Kafka server.
Jan 29, 2020

What is the default port for Kafka connector? ›

Since Kafka Connect is intended to be run as a service, it also supports a REST API for managing connectors. By default this service runs on port 8083 .

How does MongoDB Kafka connector work? ›

Overview. The MongoDB Connector for Apache Kafka is a Confluent-verified connector that persists data from Apache Kafka topics as a data sink into MongoDB as well as publishes changes from MongoDB into Kafka topics as a data source.

How to install Kafka sink connector? ›

Install Kafka Connector manually
  1. Navigate to the Kafka Connect Scylladb Sink github page and clone the repository.
  2. Using a terminal, open the source code (src) folder.
  3. Run the command mvn clean install .
  4. Run the Integration Tests in an IDE. If tests fail run mvn clean install -DskipTests .

How does Debezium connector work? ›

Debezium is a set of source connectors for Apache Kafka Connect. Each connector ingests changes from a different database by using that database's features for change data capture (CDC).

What is the difference between Debezium connector and Kafka connector? ›

Debezium platform has a vast set of CDC connectors, while Kafka Connect comprises various JDBC connectors to interact with external or downstream applications. However, Debeziums CDC connectors can only be used as a source connector that captures real-time event change records from external database systems.

How many tasks for Kafka connector? ›

4 topics, 5 partitions each - the Kafka connection will spawn 10 tasks, each handling data from 2 topic partitions.

How do I write Kafka source connector? ›

In the following sections, we'll cover the essential components that will get you up and running with your new Kafka connector.
  1. Step 1: Define your configuration properties. ...
  2. Step 2: Pass configuration properties to tasks. ...
  3. Step 3: Task polling. ...
  4. Step 4: Create a monitoring thread.
Oct 23, 2019

How do I run Kafka connectors? ›

How to Use Kafka Connect - Get Started
  1. Install a Connect Plugin.
  2. Configure and Run Workers.
  3. Configuring Key and Value Converters.
  4. Connect Producers and Consumers.
  5. Source Connector Auto Topic Creation.
  6. Connect Reporter.
  7. ConfigProvider Interface.
  8. Shut Down Kafka Connect.

Is Debezium a sink connector? ›

The Debezium JDBC connector is a Kafka Connect sink connector, and therefore requires the Kafka Connect runtime.

What happens if Debezium goes down? ›

If the application crashes unexpectedly, then upon restart the application's consumer will look up the last recorded offsets for each topic, and start consume events from the last offset for each topic.

How do data connectors work? ›

A data connector collects data from a variety of sources – which could include a host of different databases, files, softwares, CRMs, analytics platforms and more – and delivering it to one predetermined singular destination.

What is the difference between Debezium and JDBC source connector? ›

The main differences between Debezium and JDBC Connector are: Debezium is used only as a Kafka source and JDBC Connector can be used as Kafka source and sink.

Does Debezium use Kafka Connect? ›

Debezium is built on top of Apache Kafka and provides a set of Kafka Connect compatible connectors. Each of the connectors works with a specific database management system (DBMS).

Videos

1. Building your First Connector for Kafka Connect
(The ASF)
2. Kafka Connect Sources and Sinks
(Sabrina Clark)
3. Kafka Connect in Action : S3 Sink
(Robin Moffatt)
4. kafka connect
(Rahul Lokurte)
5. How to install Kafka Connect connector plugins
(Robin Moffatt)
6. Deploying Kafka Connect Source and Sink Connectors
(Gary Stafford)
Top Articles
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated: 22/09/2023

Views: 6030

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.