[CARBONDATA-4305] Support Carbondata Streamer tool for incremental fetch and merge...
authorakashrn5 <akashnilugal@gmail.com>
Wed, 1 Sep 2021 13:54:24 +0000 (19:24 +0530)
committerkunal642 <kunalkapoor642@gmail.com>
Fri, 26 Nov 2021 05:48:43 +0000 (11:18 +0530)
commit18840af9c1f7154b58e3c397dfc5a4440674bcee
tree176cc39b13b3d20c7521a6de61906dd2324464b6
parent3be05d2a44d805cf763df05cbeacce2d90a44da0
[CARBONDATA-4305] Support Carbondata Streamer tool for incremental fetch and merge from kafka and DFS Sources

Why is this PR needed?
In the current Carbondata CDC solution, if any user wants to integrate it with a streaming source then he
need to write a separate spark application to capture changes which is an overhead. We should be able to
incrementally capture the data changes from primary databases and should be able to incrementally ingest
the same in the data lake so that the overall latency decreases. The former is taken care of using
log-based CDC systems like Maxwell and Debezium. Here is a solution for the second aspect using Apache Carbondata.

What changes were proposed in this PR?
Carbondata streamer tool is a spark streaming application which enables users to incrementally ingest data
from various sources, like Kafka(standard pipeline would be like MYSQL => debezium => (kafka + Schema registry) => Carbondata Streamer tool)
and DFS into their data lakes. The tool comes with out-of-the-box support for almost all types of schema
evolution use cases. With the streamer tool only add column support is given with drop column and
other schema changes capability in line in the upcoming days. Please refer to design document for
more details about usage and working of the tool.

This closes #4235
15 files changed:
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
integration/spark/pom.xml
integration/spark/src/main/scala/org/apache/carbondata/streamer/AvroDFSSource.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/carbondata/streamer/AvroKafkaSource.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/carbondata/streamer/CarbonDStream.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/carbondata/streamer/CarbonDataStreamer.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/carbondata/streamer/CarbonDataStreamerException.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/carbondata/streamer/CarbonStreamerConfig.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/carbondata/streamer/SchemaSource.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/carbondata/streamer/Source.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/carbondata/streamer/SourceFactory.scala [new file with mode: 0644]
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala
pom.xml