Confluent Inc.

11/19/2024 | Press release | Distributed by Public on 11/19/2024 17:24

CDC and Data Streaming: Capture Database Changes in Real Time with Debezium PostgreSQL Connector

Introduction

In today's data-driven world, staying ahead means acting on the most up-to-date information. That's where change data capture (CDC) comes in.

CDC is a design pattern that tracks your database tables, capturing every row-level insert, update, and delete as it happens. This real-time monitoring allows downstream systems to react to changes instantly, without batch-based updates and resource-intensive full scans. As the complexity of your data architecture grows, CDC is the key to keeping data systems synchronized across an organization efficiently.

A lot has been written about this topic, so in this blog, we'll dive deeper into CDC connectors, exploring their important role in data streaming followed by an in-depth example implementation of Confluent's fully managed, Debezium CDC connector. Continue reading below to learn more about the power of real-time data!

CDC connectors in data streaming

In its early days, CDC was the go-to solution for database replication and migration. Traditional methods involved batch processes, causing delays and data inconsistency. CDC introduced an efficient way to replicate databases and stores in real time. However, with the rising demands of real-time analytics from internal stakeholders and event-driven applications from customers, CDC has catapulted itself into a new role. Beyond database replication, CDC is now a key component of data streaming platforms, providing real-time insights and triggering event-driven workflows.

Imagine an e-commerce application where a new order instantly triggers a series of automatic actions: updates to inventory, a confirmation email to the customer, and scheduling of delivery, or imagine updates to customer behavior data, enabling marketing teams to make personalized recommendations in real time. Both of these scenarios rely on a CDC connector to deliver real-time changes.

So the question is: Why is CDC such a perfect fit for data streaming with Apache Kafka®? Let's break it down:

  • Real-time data flow: CDC seamlessly captures and streams data changes from source systems directly into Kafka, powering real-time use cases that enhance user experiences and provide businesses with instant operational insights. This continuous, low-latency data flow enables organizations to make rapid data-driven decisions.

  • Data integrity and fault tolerance: CDC connectors preserve the exact sequence of changes, ensuring consistency between source databases and Kafka topics. Combined with Kafka's immutable log, this approach makes it easy to audit and replay data. More importantly, CDC and Kafka fundamentally decouple source and sink systems, reducing the risk of downtime and breakage from brittle point-to-point integrations.

  • Scalable and flexible data delivery: CDC enables a scalable approach to data integration by efficiently capturing database changes into Kafka topics. Developers can then use stream processing tools like Flink to transform and route this data in real time, creating reusable data products. This flexibility allows a single data stream to support multiple downstream systems and use cases without additional load on source systems.

At Confluent, we're committed to making real-time data integration powerful and seamless. Today, we offer over 80 pre-built, fully managed connectors all built on the robust foundation of the open source Kafka Connect framework. With our fully managed connectors, you can spin up a connector within a few minutes, reducing the typical 3-6 month cycle of designing, building, and testing.

Implementing CDC with Debezium

Let's dive into an example of how Debezium PostgreSQL CDC connector can be leveraged to kick-start your data streaming pipeline. Before diving into our hands-on example, let's explore what makes Debezium critical for modern data architectures.

Debezium is an open source distributed platform for CDC. It's designed to monitor databases and detect real-time changes, enabling you to capture them as events and stream them to other systems or applications. Debezium has gained popularity for several reasons:

  • Real-time data synchronization: Enables near real-time synchronization between databases and other systems.

  • Microservices architecture support: Facilitates event-driven architectures and microservices communication.

  • Data integration: Simplifies complex ETL processes by streaming changes directly from source systems.

  • Variety of supported databases: Works with multiple databases including MySQL, PostgreSQL, MongoDB, and others.

  • Open source and community-driven: Benefits from continuous improvements and wide adoption.

Debezium connectors integrate seamlessly with Confluent Cloud for several reasons:

  • Kafka integration: Integrates seamlessly with Apache Kafka, the de facto data streaming technology.

  • CDC capabilities: Provides robust change data capture capabilities for various databases.

  • Complementary technology: Enhances Confluent's data streaming platform by enabling real-time data ingestion from popular databases.

Confluent's v2 Debezium connectors introduced several improvements that simplify connector usage. While some changes were necessary due to breaking changes in Debezium v1.9.6 and beyond, the v2 connectors offer significant improvements over v1 in the following areas:

  • Enhanced performance: Optimized for higher throughput and reduced resource consumption.

  • Improved error handling: Enhanced management of network issues and other failures.

  • New configuration options: More flexibility in connector setup and behavior.

  • Expanded metadata: Additional metadata fields for easier tracing and debugging.

  • Kafka Connect framework alignment: Better adherence to Kafka Connect standards for improved compatibility.

Now let's continue with the example implementation of a Debezium PostgreSQL CDC connector.

The first step is to get the transaction data changes into Kafka. This is done by setting up a connector specific to the source database. The "Connectors" tab on Confluent Cloud displays a list of available connectors based on the cloud provider of your Kafka cluster. Select the PostgreSQL CDC Source V2 connector, which captures all transactions in your Postgres database and routes them to Kafka tables. Note that the connector will automatically create and configure Kafka topics specific to the amount of data coming in. Before you begin, ensure that Confluent Cloud has the necessary permissions to access your system and refer to private networking options if needed.

Continue with the following steps:

  1. Select the authentication mechanism-choose between creating a new API key or using an existing one. You can specify a service account to connect to the Kafka broker.

  2. Configure the PostgreSQL source DB-enter the hostname, port, and other credentials to access the database.

  3. Configure the connector to capture changes and send it to the right topic-the output record format must be specified in the connector configuration. While creating a topic, a schema can be associated with it (if using Avro, Protobuf, or JSON Schema). You also have additional options to specify advanced configurations, like SMTs.

  4. Ensure the connector sizing is as required-note that the Postgres connector only supports one task.

  5. Review the configuration and launch the connector.

Here is a sample Postgres connector creation workflow:

Once the connector is created, you can view the processing metrics on the connector dashboard.

Common mistakes while configuring the connector

Although the configuration of CDC connectors is not different from other Confluent Cloud connectors, here are some of the most common mistakes that our users encounter during connector creation. Most of these are related to source database configuration and access controls.

  • Incorrect database permissions: Not granting the necessary permissions to the user account used by the connector, including REPLICATION permissions

  • Resolution: Grant necessary permissions: GRANT SELECT, REPLICATION, CONNECT ON DATABASE to the required users.

  • Misconfigured PostgreSQL settings: DB settings needed for replication

    • Resolution: Set wal_level to logical in PostgreSQL configuration and max_wal_senders and max_replication_slots to appropriate values.

  • Improper SSL configuration: Access to the DB

    • Resolution:

      • Enable SSL in postgresql.conf: ssl = on

      • Properly configure SSL certificates and keys

      • In connector config, set database.sslmode to require

      • Provide correct paths to SSL certificates and keys

  • Insufficient monitoring leading to suboptimal performance: Set up monitoring to detect and rectify operational issues

    • Resolution:

      • Set up monitoring using Confluent Connect's Metrics API

      • Implement alerts for connector status changes

      • Use JMX metrics for detailed performance monitoring

Best practices: PostgreSQL v2 source connector

Table selection and filtered publications

When implementing a connector setup where you need to:

  • Capture only specific tables from your PostgreSQL database

  • Configure different tables for different connectors

  • Optimize replication performance

It is strongly recommended to utilize filtered publications. This approach offers several advantages:

  1. Improved efficiency by limiting change data capture to only the required tables.

  2. Reduced network bandwidth usage.

  3. Better resource utilization on both the database and connector sides.

  4. More granular control over which tables are replicated by each connector.

Transaction management

Long-running transactions can significantly impact connector performance and reliability.

Potential issues include:

  • Extended delays in streaming changes

  • Increased memory usage on the database server

  • Growing WAL (Write-Ahead Log) size

  • Potential connector restart complications

We recommend you take the following measures:

  • Avoid keeping transactions open for prolonged durations

  • Implement appropriate transaction timeouts

  • Monitor for long-running transactions

  • Consider implementing application-level safeguards against prolonged transactions

Replication slot management

It is strongly discouraged to manually modify replication slots that are actively being used by a running connector. Doing so can lead to:

  • Data loss or duplication

  • Connector instability

  • Inconsistent streaming behavior

  • Potential system crashes

Regular maintenance of inactive replication slots is crucial:

  • Inactive slots continue to retain WAL files

  • Retained WAL files consume disk space

  • Excessive WAL retention can lead to disk space issues

  • Can impact database performance and backup strategies

Summary

In this blog post, we explored the inception of change data capture (CDC) and examined how it has become an ideal fit for data streaming, while also reviewing how Confluent simplifies both data ingestion and distribution. We then detailed an example implementation of a PostgreSQL CDC connector, showcasing how straightforward it is and sharing best practices to follow when integrating your data source.

However, If you are not using PostgreSQL as your source system, our fully managed connector portfolio provides a range of CDC connectors for diverse database systems: MySQL CDC, Oracle CDC, SQL Server CDC, DynamoDB CDC, Salesforce CDC, and MongoDB CDC.

By leveraging Confluent's fully managed CDC connectors, you're not just simplifying your data integration-you're ensuring high throughput and performance, allowing you to focus on your core business instead of managing the infrastructure.

Ready to build data streaming pipelines with CDC? Sign up for a free trial of Confluent Cloud to explore new features. New sign-ups receive $400 to spend within Confluent Cloud during their first 30 days. Use the code CL60BLOG for an additional $60 of free usage.*

The preceding outlines our general product direction and is not a commitment to deliver any material, code, or functionality. The development, release, timing, and pricing of any features or functionality described may change. Customers should make their purchase decisions based on services, features, and functions that are currently available.

Confluent and associated marks are trademarks or registered trademarks of Confluent, Inc.

Apache®, Apache Kafka®, and Apache Flink® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by the Apache Software Foundation is implied by using these marks. All other trademarks are the property of their respective owners.