The Cloud-Native Kafka: Boosting Data Synchronization Reliability and Consistency

2023-09-08 15:12

Chainbase

2023-09-08 15:12

来源链接

订阅此专栏

收藏此文章

Kafka_Tech_Blog_88b1796870.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20230908%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20230908T071128Z&X-Amz-Expires=3600&X-Amz-Signature=486ac1b26a34c9c4d13ad76ebf0ebde016343702e6f5cc7bac0209267f245252&X-Amz-SignedHeaders=host&x-id=GetObject

Background

In the current blockchain landscape, many business scenarios require real-time access to both on-chain and off-chain data for further data value processing. Examples include SmartMoney transaction monitoring based on data filtering, and real-time TVL (Total Value Locked) transaction decision-making based on data statistics. These scenarios demand support for rich streaming datasets, and Chainbase provides an open “dataset” based real-time Sync module to address such challenges.

Challenges

Currently, most real-time message streaming solutions on the market are primarily based on WebSocket and Webhook models, neither of which can guarantee “exactly once” semantics, ensuring that messages are consumed exactly once.

WebSocket can achieve real-time communication, but due to network instability and message transmission complexity, it may lead to issues such as message misordering, loss, or duplication.

Furthermore, HTTP requests are stateless, meaning they do not ensure request reliability and orderliness. If errors occur during network transmission or if the server fails to handle retries and idempotence correctly, it may result in duplicate or lost requests, thereby failing to guarantee “exactly once” semantics.

In contrast, Kafka offers the advantage of supporting message replayability. As a message queue, Kafka can persist messages to disk. Thus, while ensuring “at least once” semantics, it also supports message replayability, possessing characteristics of “effectively once” and “standby replay.”

This is particularly important for streaming applications built on the premise of no data loss. Only systems with message replay capabilities can ensure the reliability and consistency of data processing, meeting “exactly once” semantics. For example, in real-time computation, log collection, and analysis scenarios, if errors occur, processing can be restarted from scratch. This makes Kafka better suited to complex real-time data processing requirements.

Key Concepts

Basic Introduction

Kafka is a distributed stream processing platform designed for high-capacity, low-latency data transfer and persistent storage. It employs a publish-subscribe message queue model and relies on distributed log storage to process messages. Topics serve as destinations for message publication and can be divided into multiple partitions distributed across different storage nodes. Producers are responsible for publishing messages to specific topics, while consumers fetch messages from Kafka based on subscribed topics and partitions. Kafka uses offsets to track consumption state and ensure message order, while also utilizing distributed log storage for message persistence.

In our managed Kafka service, to ensure end-to-end data sequencing in blockchain scenarios, we restrict topics to a single partition.

Blockchain Data ETL to Kafka

caption

Currently, we support data acquisition from multiple chains (Ethereum, Binance Smart Chain, Polygon, etc.) and data types (logs, transactions, transfers, etc.) via Kafka topics.

Raw Data
We have built our own RPC nodes for multiple chains, providing stable and reliable raw data with field formatting consistent with leading ETL tools. You can refer to [2] for specific schema details. Blockchain data is transparent and highly composable. We provide raw data types such as blocks, logs, transactions, traces, contracts, etc. Users can freely combine these raw data based on their own business scenarios. Below is an example of a transaction message.
{ "type": "transaction", "hash": "0xd310285f40c898ab6da873f4a4b769fb838cc618be495cf880c3a9e3f2e6dea4", "nonce": 24719, "transaction_index": 1, "from_address": "0x759ec1b3326de6fd4ba316f65a6f689c4e4c3092", "to_address": "0x759ec1b3326de6fd4ba316f65a6f689c4e4c3092", "value": 100000, "gas": 31000, "gas_price": 449542706643, "input": "0x", "block_timestamp": 1689253991, "block_number": 17684799, "block_hash": "0x2b2e1e5cfce445a1ff5227eaff0ec8d6c335cf6b7e00e0bc05abc468bcdc9b89", "max_fee_per_gas": 526493200000, "max_priority_fee_per_gas": 426493200000, "transaction_type": 2, "receipt_cumulative_gas_used": 141613, "receipt_gas_used": 21000, "receipt_contract_address": null, "receipt_root": null, "receipt_status": 1, "receipt_effective_gas_price": 449542706643, "item_id": "transaction_0xd310285f40c898ab6da873f4a4b769fb838cc618be495cf880c3a9e3f2e6dea4", "item_timestamp": "2023-07-13T13:13:11Z" }
Processed Data
To better meet the needs of blockchain transaction monitoring and other scenarios, we have enriched and aggregated data such as Transfers to provide a more comprehensive and insightful view of transaction behavior on the chain. In the future, we will also provide more enriched data types to help customers obtain real-time on-chain data faster and more accurately.
{ "value": "41063984478246667122325862", "block_number": 17862713, "block_timestamp": 1691407127, "block_hash": "0x5a134e7a3da9c51581f15f962f2beded4f87449dc708bd79df2ef148a4a219fc", "transaction_index": 59, "transaction_hash": "0x4ff499e45269101ef2444975013c78abafd612dfef7f603e62dca884d4f79158", "contract_address": "0x6d23e40e776c5d3505f0a8e2b434425121d9818e", "log_index": 159, "from_address": "0xb6b6768a21cc3354dbc413739432c8d8213c1628", "to_address": "0xe38b9f6d7a58aca5531e383d4357825a075c65ec", "token_metas": { "symbol": "DCI", "name": "DeltaCore Integrations", "decimals": 18 } }

Kafka Operation and Maintenance

Our professional operations team has years of experience in massive data operations.

With our managed Kafka service, customers can get started immediately with simplified management, automated deployment, comprehensive monitoring, and maintenance features. It’s truly a lifesaver. Specifically:

Simplified Management: Managed Kafka relieves customers of management burdens. It offers automated cluster configuration, deployment, monitoring, and maintenance, reducing the operational team’s workload.
High Reliability: Managed Kafka provides a high availability and redundancy mechanism, ensuring message persistence and reliability. With backup and fault recovery strategies, it can effectively address hardware failures or other unexpected situations.
Elastic Scalability: Managed Kafka dynamically adjusts “capacity and throughput” according to demand, providing elasticity and scalability. Through a simple control interface or API, you can increase or decrease the size of your Kafka cluster as the workload changes.
Security Hardening: Managed Kafka offers data encryption, authentication, and access control features. This ensures data confidentiality and integrity, helping organizations comply with compliance requirements.

Benefits

End-to-End Data Consistency
Based on network protocols like WebSocket or Webhook, “end-to-end data consistency” cannot be guaranteed. For example, when a client sends a data packet to the server, the packet traverses the network and goes through multiple routers and network devices. At a particular router or network device, due to network congestion or technical issues, the packet may be lost. Lost packets don’t reach the server, resulting in data loss. The server may wait for the packets to arrive within a specific time interval, and if they aren’t received within that time, it considers the data lost. In contrast, Kafka provides the possibility of achieving “end-to-end data consistency” for our servers.
Rich Ecosystem
Integrating Kafka into ETL (Extract, Transform, Load) tools, whether open-source or commercial, is a common practice. Kafka, as a high-performance, scalable distributed message queue system, provides reliable data transfer and persistent storage. Integrating Kafka into ETL offers several advantages.
Firstly, Kafka, as a reliable data pipeline, can be used as a data source or destination in the ETL process. Kafka’s message queue model ensures data sequencing and reliability. Secondly, Kafka’s high throughput and low latency characteristics enable ETL to process and transfer large volumes of data more quickly. Moreover, Kafka can be integrated with other components and tools, such as real-time processing engines (like Apache Flink and Apache Spark Streaming) and open-source ETL tools like Airbyte, and streaming databases like RisingWave. Through Kafka integration, ETL tools can implement real-time data processing, data streaming, and data exchange functions, helping organizations manage and analyze massive data more effectively and support real-time decision-making and insights.
Open Dataset
Combining open datasets with Kafka’s real-time subscription and synchronization provides a variety of data sources, analysis models, and real-time analysis scenarios for blockchain data analysis systems. This includes accurate data verification and promotes data sharing and openness. This combination provides a more comprehensive and accurate view of blockchain data, drives technological advancement, and accelerates problem-solving.

About Chainbase

Chainbase is an open Web3 data infrastructure for accessing, organizing, and analyzing on-chain data at scale. It helps people to better utilize on-chain data by combining rich datasets with open computing technologies in one data platform. Chainbase’s ultimate goal is to increase data freedom in the crypto.

More than 5,000 developers actively interact with our platform and over 200Mn data requests per day as their data backend and integrate chainbase into their main workflow. Additionally, we are working with ~10 top-tier public chains as first-tier validators and managing over US $500Mn tokens non-custodially as a validator provider. Find out more at: chainbase.com