[CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added
authorpratyakshsharma <pratyaksh13@gmail.com>
Wed, 27 Oct 2021 08:24:37 +0000 (13:54 +0530)
committerkunal642 <kunalkapoor642@gmail.com>
Mon, 15 Nov 2021 13:25:18 +0000 (18:55 +0530)
commit3be05d2a44d805cf763df05cbeacce2d90a44da0
tree88a6c3281176648235518f14b28c5f1268101e42
parent07b41a5382f554646f231e192cf39c8f28302a05
[CARBONDATA-4296]: schema evolution, enforcement and deduplication utilities added

Why is this PR needed?
This PR adds schema enforcement, schema evolution and deduplication capabilities for
carbondata streamer tool specifically. For the existing IUD scenarios, some work
needs to be done to handle it completely, for example -
1. passing default values and storing them in table properties.

Changes proposed for the phase 2 -
1. Handling delete use cases with upsert operation/command itself. Right now we
consider update as delete + insert. With the new streamer tool, it is possible that
user sets upsert as the operation type and incoming stream has delete records as well.
What changes were proposed in this PR?

Configs and utility methods are added for the following use cases -
1. Schema enforcement
2. Schema evolution - add column, delete column, data type change scenario
3. Deduplicate the incoming dataset against incoming dataset itself. This is useful
in scenarios where incoming stream of data has multiple updates for the same record
and we want to pick the latest.
4. Deduplicate the incoming dataset against existing target dataset. This is useful
when operation type is set as INSERT and user does not want to insert duplicate records.

This closes #4227
common/src/main/java/org/apache/carbondata/common/exceptions/sql/CarbonSchemaException.java [new file with mode: 0644]
core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetCommand.scala
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/merge/CarbonMergeDataSetUtil.scala
integration/spark/src/main/scala/org/apache/spark/sql/execution/strategy/DDLHelper.scala
integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala