hudi.git
5 months ago[HUDI-3451] Delete metadata table when the write client disables MDT (#5186)
YueZhang [Sat, 2 Apr 2022 11:01:06 +0000 (19:01 +0800)] 
[HUDI-3451] Delete metadata table when the write client disables MDT (#5186)

* Add checks for metadata table init to avoid possible out-of-sync

* Revise the logic to reuse existing table config

* Revise docs and naming

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
5 months ago[HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check...
Y Ethan Guo [Sat, 2 Apr 2022 03:17:02 +0000 (20:17 -0700)] 
[HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check (#5204)

5 months ago[HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209)
Y Ethan Guo [Sat, 2 Apr 2022 03:14:07 +0000 (20:14 -0700)] 
[HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209)

5 months ago[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark...
xiarixiaoyao [Fri, 1 Apr 2022 20:20:24 +0000 (04:20 +0800)] 
[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)

* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical

5 months ago[HUDI-3468][RFC-49] Support sync with DataHub (#5022)
Raymond Xu [Fri, 1 Apr 2022 19:27:01 +0000 (12:27 -0700)] 
[HUDI-3468][RFC-49] Support sync with DataHub (#5022)

5 months ago[HUDI-3225] [RFC-45] for async metadata indexing (#4640)
Sagar Sumit [Fri, 1 Apr 2022 18:49:23 +0000 (00:19 +0530)] 
[HUDI-3225] [RFC-45] for async metadata indexing (#4640)

* Add RFC for async metadata indexing

Add more details

* Add changes since last discussion

* Add another race condition handling

* Update rfc

6 months ago[HUDI-3763] Fixing hadoop conf class loading for inline reading (#5194)
Sivabalan Narayanan [Fri, 1 Apr 2022 15:27:40 +0000 (08:27 -0700)] 
[HUDI-3763] Fixing hadoop conf class loading for inline reading (#5194)

6 months ago[HUDI-3769] Optimize the logs of HoodieMergeHandle and BufferedConnectWriter (#5200)
董可伦 [Fri, 1 Apr 2022 13:17:49 +0000 (21:17 +0800)] 
[HUDI-3769] Optimize the logs of HoodieMergeHandle and BufferedConnectWriter (#5200)

6 months ago[HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880)
Danny Chan [Fri, 1 Apr 2022 12:46:51 +0000 (20:46 +0800)] 
[HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880)

6 months ago[HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957)
ForwardXu [Fri, 1 Apr 2022 02:01:41 +0000 (10:01 +0800)] 
[HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957)

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* fix comments

* fix comments

* fix comments

6 months ago[HUDI-3743] Support DELETE_PARTITION for metadata table (#5169)
Sagar Sumit [Fri, 1 Apr 2022 01:29:17 +0000 (06:59 +0530)] 
[HUDI-3743] Support DELETE_PARTITION for metadata table (#5169)

In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer.

- Add a new API in HoodieTableMetadataWriter
- Current only supported for Spark metadata writer

6 months ago[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
Sagar Sumit [Thu, 31 Mar 2022 20:03:12 +0000 (01:33 +0530)] 
[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)

- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up

6 months ago[HUDI-2777] Improve HoodieSparkSqlWriter write performance (#5187)
liuhe0702 [Thu, 31 Mar 2022 19:48:47 +0000 (03:48 +0800)] 
[HUDI-2777] Improve HoodieSparkSqlWriter write performance (#5187)

6 months ago[HUDI-3020] Utility to create manifest file (#5153)
codejoyan [Thu, 31 Mar 2022 14:22:03 +0000 (19:52 +0530)] 
[HUDI-3020] Utility to create manifest file (#5153)

Co-authored-by: joyan <joyan.sil@walmart.com>
6 months ago[HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet...
xiarixiaoyao [Thu, 31 Mar 2022 12:09:26 +0000 (20:09 +0800)] 
[HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet file (#5168)

* [MINOR][SPARK] fixed the per regression by enable vectorizeReader for parquet file

* address comments

* add perf result

6 months ago[HUDI-3732] Fixing rollback validation (#5157)
Sivabalan Narayanan [Thu, 31 Mar 2022 11:55:24 +0000 (04:55 -0700)] 
[HUDI-3732] Fixing rollback validation (#5157)

* Fixing rollback validation

* Adding tests

6 months ago[HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489)
ForwardXu [Thu, 31 Mar 2022 07:35:39 +0000 (15:35 +0800)] 
[HUDI-3135] Make delete partitions lazy to be executed by the cleaner  (#4489)

As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted.

- Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted.
- Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted
- CleanActionExecutor is fixed to delete partitions if any (as per clean plan)
- Same info is added to HoodieCleanMetadata
- Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions.

Co-authored-by: sivabalan <n.siva.b@gmail.com>
6 months ago[HUDI-3733] Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli (#5158)
Sivabalan Narayanan [Thu, 31 Mar 2022 07:30:49 +0000 (00:30 -0700)] 
[HUDI-3733] Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli (#5158)

* Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
6 months ago[HUDI-3692] MetadataFileSystemView includes compaction in timeline (#5110)
Yuwei XIAO [Thu, 31 Mar 2022 06:24:59 +0000 (14:24 +0800)] 
[HUDI-3692] MetadataFileSystemView includes compaction in timeline (#5110)

6 months ago[HUDI-3713] Guarding archival for multi-writer (#5138)
Sivabalan Narayanan [Thu, 31 Mar 2022 05:44:31 +0000 (22:44 -0700)] 
[HUDI-3713] Guarding archival for multi-writer  (#5138)

6 months ago[MINOR][DOCS] Update hudi-utilities-slim-bundle docs (#5184)
Y Ethan Guo [Thu, 31 Mar 2022 04:48:54 +0000 (21:48 -0700)] 
[MINOR][DOCS] Update hudi-utilities-slim-bundle docs (#5184)

6 months ago[HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173)
YueZhang [Thu, 31 Mar 2022 03:26:37 +0000 (11:26 +0800)] 
[HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
6 months ago[HUDI-3750] Fix NPE when build HoodieFileIndex (#5134)
KnightChess [Thu, 31 Mar 2022 02:19:05 +0000 (10:19 +0800)] 
[HUDI-3750] Fix NPE when build HoodieFileIndex (#5134)

Co-authored-by: wulingqi <wulingqi@baijiahulian.com>
6 months ago[MINOR] Fixing flakiness in TestHoodieSparkMergeOnReadTableRollback.testRollbackWithD...
Sivabalan Narayanan [Thu, 31 Mar 2022 02:07:22 +0000 (19:07 -0700)] 
[MINOR] Fixing flakiness in TestHoodieSparkMergeOnReadTableRollback.testRollbackWithDeltaAndCompactionCommit (#5183)

6 months ago[HUDI-3700] Add hudi-utilities-slim-bundle excluding hudi-spark-datasource modules...
Y Ethan Guo [Thu, 31 Mar 2022 01:08:35 +0000 (18:08 -0700)] 
[HUDI-3700] Add hudi-utilities-slim-bundle excluding hudi-spark-datasource modules (#5176)

6 months ago[HUDI-3681] Provision additional hudi-spark-bundle with different versions (#5171)
Y Ethan Guo [Thu, 31 Mar 2022 00:35:56 +0000 (17:35 -0700)] 
[HUDI-3681] Provision additional hudi-spark-bundle with different versions (#5171)

6 months ago[HUDI-3355] Issue with out of order commits in the timeline when ingestion writers...
xiarixiaoyao [Wed, 30 Mar 2022 22:54:25 +0000 (06:54 +0800)] 
[HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy (#4962)

6 months ago[HUDI-3736] Fix null pointer when key not specified (#5167)
Nicolas Paris [Wed, 30 Mar 2022 22:11:26 +0000 (00:11 +0200)] 
[HUDI-3736] Fix null pointer when key not specified (#5167)

6 months ago[HUDI-3536] Add hudi-datahub-sync implementation (#5155)
Raymond Xu [Wed, 30 Mar 2022 21:38:02 +0000 (14:38 -0700)] 
[HUDI-3536] Add hudi-datahub-sync implementation (#5155)

6 months ago[MINOR] Repeated execution of update status (#5089)
Bo Cui [Wed, 30 Mar 2022 21:30:06 +0000 (05:30 +0800)] 
[MINOR] Repeated execution of update status (#5089)

6 months ago[HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path...
YueZhang [Wed, 30 Mar 2022 21:23:37 +0000 (05:23 +0800)] 
[HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path listing (#5100)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
6 months ago[HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152)
YueZhang [Wed, 30 Mar 2022 21:18:08 +0000 (05:18 +0800)] 
[HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
6 months ago[HUDI-3653] Cleaning up bespoke Column Stats Index implementation (#5062)
Alexey Kudinkin [Wed, 30 Mar 2022 17:01:43 +0000 (10:01 -0700)] 
[HUDI-3653] Cleaning up bespoke Column Stats Index implementation (#5062)

6 months ago[MINOR] Fix dates as per UTC in TestDataSkippingUtils (#5166)
Sagar Sumit [Wed, 30 Mar 2022 14:33:14 +0000 (20:03 +0530)] 
[MINOR] Fix dates as per UTC in TestDataSkippingUtils (#5166)

* Fix timezone in test

6 months ago[minor] Follow 3178, fix the flink metadata table compaction (#5175)
Danny Chan [Wed, 30 Mar 2022 12:45:29 +0000 (20:45 +0800)] 
[minor] Follow 3178, fix the flink metadata table compaction (#5175)

6 months ago[HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170)
harshal [Wed, 30 Mar 2022 05:34:49 +0000 (11:04 +0530)] 
[HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170)

6 months ago[HUDI-3485] Adding scheduler pool configs for async clustering (#5043)
Sivabalan Narayanan [Wed, 30 Mar 2022 01:27:45 +0000 (18:27 -0700)] 
[HUDI-3485] Adding scheduler pool configs for async clustering (#5043)

6 months ago[HUDI-3741] Fix flink bucket index bulk insert generates too many small files (#5164)
Danny Chan [Wed, 30 Mar 2022 00:18:36 +0000 (08:18 +0800)] 
[HUDI-3741] Fix flink bucket index bulk insert generates too many small files (#5164)

6 months ago[HUDI-2520] Fix CTAS statment issue when sync to hive (#5145)
ForwardXu [Tue, 29 Mar 2022 19:25:31 +0000 (03:25 +0800)] 
[HUDI-2520] Fix CTAS statment issue when sync to hive (#5145)

6 months ago[HUDI-3549] Removing dependency on "spark-avro" (#4955)
Alexey Kudinkin [Tue, 29 Mar 2022 18:44:47 +0000 (11:44 -0700)] 
[HUDI-3549] Removing dependency on "spark-avro"  (#4955)

Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.

6 months ago[HUDI-2520] Fix drop partition issue when sync to hive (#5147)
ForwardXu [Tue, 29 Mar 2022 18:28:19 +0000 (02:28 +0800)] 
[HUDI-2520] Fix drop partition issue when sync to hive (#5147)

6 months ago[HUDI-3731] Fixing Column Stats Index record Merging sequence missing `columnName...
Alexey Kudinkin [Tue, 29 Mar 2022 15:39:56 +0000 (08:39 -0700)] 
[HUDI-3731] Fixing Column Stats Index record Merging sequence missing `columnName` (#5159)

* Added `DataSkippingFailureMode` to control how DS handles failures in the flow (either "strict", when exception would be thrown, or "fallback" when it will just fallback to the full-scan)

* Make sure tests execute in `DataSkippingFailureMode.Strict`

* Fixed Column Stats Index record merging sequence missing `columnName`

6 months ago[MINOR] Move Experiemental to javadoc (#5161)
Raymond Xu [Tue, 29 Mar 2022 04:07:59 +0000 (21:07 -0700)] 
[MINOR] Move Experiemental to javadoc (#5161)

6 months ago[HUDI-3736] Fix default dynamodblock url default value (#4967)
Nicolas Paris [Tue, 29 Mar 2022 03:31:46 +0000 (05:31 +0200)] 
[HUDI-3736] Fix default dynamodblock url default value (#4967)

6 months ago[HUDI-2520] Fix drop table issue when sync to Hive (#5143)
leesf [Tue, 29 Mar 2022 02:34:12 +0000 (10:34 +0800)] 
[HUDI-2520] Fix drop table issue when sync to Hive (#5143)

6 months ago[HUDI-3728] Set the sort operator parallelism for flink bucket bulk insert (#5154)
Danny Chan [Tue, 29 Mar 2022 01:52:35 +0000 (09:52 +0800)] 
[HUDI-3728] Set the sort operator parallelism for flink bucket bulk insert (#5154)

6 months ago[HUDI-3722] Fix truncate hudi table's error (#5140)
ForwardXu [Tue, 29 Mar 2022 01:44:18 +0000 (09:44 +0800)] 
[HUDI-3722] Fix truncate hudi table's error (#5140)

6 months ago[HUDI-2566] Adding multi-writer test support to integ test (#5065)
Sivabalan Narayanan [Mon, 28 Mar 2022 21:05:00 +0000 (14:05 -0700)] 
[HUDI-2566] Adding multi-writer test support to integ test (#5065)

6 months ago[HUDI-2757] Implement Hudi AWS Glue sync (#5076)
Raymond Xu [Mon, 28 Mar 2022 18:54:59 +0000 (11:54 -0700)] 
[HUDI-2757] Implement Hudi AWS Glue sync (#5076)

6 months ago[HUDI-3720] Fix the logic of reattempting pending rollback (#5148)
Y Ethan Guo [Mon, 28 Mar 2022 18:54:31 +0000 (11:54 -0700)] 
[HUDI-3720] Fix the logic of reattempting pending rollback (#5148)

6 months ago[HUDI-3539] Flink bucket index bucketID bootstrap optimization. (#5093)
Shawy Geng [Mon, 28 Mar 2022 11:50:36 +0000 (19:50 +0800)] 
[HUDI-3539] Flink bucket index bucketID bootstrap optimization. (#5093)

* [HUDI-3539] Flink bucket index bucketID bootstrap optimization.

Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>
6 months ago[HUDI-3538] Support Compaction Command Based on Call Procedure Command for Spark...
huberylee [Mon, 28 Mar 2022 06:11:35 +0000 (14:11 +0800)] 
[HUDI-3538] Support Compaction Command Based on Call Procedure Command for Spark SQL (#4945)

* Support Compaction Command Based on Call Procedure Command for Spark SQL

* Addressed review comments

6 months ago[MINOR] Fix call command parser use spark3.2 (#5144)
ForwardXu [Mon, 28 Mar 2022 03:13:44 +0000 (11:13 +0800)] 
[MINOR] Fix call command parser use spark3.2 (#5144)

6 months ago[HUDI-3724] Fixing closure of ParquetReader (#5141)
Sivabalan Narayanan [Mon, 28 Mar 2022 01:36:15 +0000 (18:36 -0700)] 
[HUDI-3724] Fixing closure of ParquetReader (#5141)

6 months ago[HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… (#5137)
xiarixiaoyao [Sun, 27 Mar 2022 18:01:43 +0000 (02:01 +0800)] 
[HUDI-3719] High performance costs of AvroSerizlizer in DataSource wr… (#5137)

* [HUDI-3719] High performance costs of AvroSerizlizer in DataSource writing

* add benchmark framework which modify from spark
add avroSerDerBenchmark

6 months ago[MINOR] Relaxing cleaner and archival configs (#5142)
Sivabalan Narayanan [Sun, 27 Mar 2022 16:26:24 +0000 (09:26 -0700)] 
[MINOR] Relaxing cleaner and archival configs (#5142)

6 months ago[HUDI-3604] Adjust the order of timeline changes in rollbacks (#5114)
Y Ethan Guo [Sun, 27 Mar 2022 05:37:44 +0000 (22:37 -0700)] 
[HUDI-3604] Adjust the order of timeline changes in rollbacks (#5114)

6 months ago[HUDI-3716] OOM occurred when use bulk_insert cow table with flink BUCKET index ...
Danny Chan [Sun, 27 Mar 2022 01:13:58 +0000 (09:13 +0800)] 
[HUDI-3716] OOM occurred when use bulk_insert cow table with flink BUCKET index (#5135)

6 months ago[HUDI-3709] Fixing `ParquetWriter` impls not respecting Parquet Max File Size limit...
Alexey Kudinkin [Sat, 26 Mar 2022 21:51:36 +0000 (14:51 -0700)] 
[HUDI-3709] Fixing `ParquetWriter` impls not respecting Parquet Max File Size limit (#5129)

6 months ago[HUDI-3612] Clustering strategy should create new TypedProperties when modifying...
RexAn [Sat, 26 Mar 2022 10:46:03 +0000 (18:46 +0800)] 
[HUDI-3612] Clustering strategy should create new TypedProperties when modifying it (#5027)

6 months ago[HUDI-3435] Do not throw exception when instant to rollback does not exist in metadat...
Danny Chan [Sat, 26 Mar 2022 03:42:54 +0000 (11:42 +0800)] 
[HUDI-3435] Do not throw exception when instant to rollback does not exist in metadata table active timeline (#4821)

6 months ago[HUDI-3396] Refactoring `MergeOnReadRDD` to avoid duplication, fetch only projected...
Alexey Kudinkin [Fri, 25 Mar 2022 16:32:03 +0000 (09:32 -0700)] 
[HUDI-3396] Refactoring `MergeOnReadRDD` to avoid duplication, fetch only projected columns (#4888)

6 months ago[MINOR] fix QuickstartUtils move (#5133)
ForwardXu [Fri, 25 Mar 2022 14:34:35 +0000 (22:34 +0800)] 
[MINOR] fix QuickstartUtils move (#5133)

6 months ago[HUDI-3563] Make quickstart examples covered by CI tests (#5082)
ForwardXu [Fri, 25 Mar 2022 08:37:17 +0000 (16:37 +0800)] 
[HUDI-3563] Make quickstart examples covered by CI tests (#5082)

6 months ago[HUDI-3711] Fix typo in MaxwellJsonKafkaSourcePostProcessor.Config#PRECOMBINE_FIELD_T...
wangxianghu [Fri, 25 Mar 2022 07:02:54 +0000 (11:02 +0400)] 
[HUDI-3711] Fix typo in MaxwellJsonKafkaSourcePostProcessor.Config#PRECOMBINE_FIELD_TYPE_PROP (#5096)

6 months ago[HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping...
Alexey Kudinkin [Fri, 25 Mar 2022 05:27:15 +0000 (22:27 -0700)] 
[HUDI-3594] Supporting Composite Expressions over Data Table Columns in Data Skipping flow (#4996)

6 months ago[HUDI-3678] Fix record rewrite of create handle when 'preserveMetadata' is true ...
Danny Chan [Fri, 25 Mar 2022 03:48:50 +0000 (11:48 +0800)] 
[HUDI-3678] Fix record rewrite of create handle when 'preserveMetadata' is true (#5088)

6 months ago[HUDI-3580] Claim RFC number 48 for LogCompaction action RFC (#5128)
Surya Prasanna [Fri, 25 Mar 2022 03:26:04 +0000 (20:26 -0700)] 
[HUDI-3580] Claim RFC number 48 for LogCompaction action RFC (#5128)

6 months ago[HUDI-3703] Reset taskID in restoreWriteMetadata (#5122)
Zhaojing Yu [Fri, 25 Mar 2022 02:18:28 +0000 (10:18 +0800)] 
[HUDI-3703] Reset taskID in restoreWriteMetadata (#5122)

6 months ago[HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Y Ethan Guo [Fri, 25 Mar 2022 02:04:53 +0000 (19:04 -0700)] 
[HUDI-1180] Upgrade HBase to 2.4.9 (#5004)

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
6 months ago[HUDI-3701] Flink bulk_insert support bucket hash index (#5118)
Danny Chan [Fri, 25 Mar 2022 01:01:42 +0000 (09:01 +0800)] 
[HUDI-3701] Flink bulk_insert support bucket hash index (#5118)

6 months ago[HUDI-3638] Make ZookeeperBasedLockProvider serializable (#5112)
Y Ethan Guo [Fri, 25 Mar 2022 00:59:47 +0000 (17:59 -0700)] 
[HUDI-3638] Make ZookeeperBasedLockProvider serializable (#5112)

6 months ago[HUDI-3624] Check all instants before starting a commit in metadata table (#5098)
Y Ethan Guo [Fri, 25 Mar 2022 00:13:58 +0000 (17:13 -0700)] 
[HUDI-3624] Check all instants before starting a commit in metadata table (#5098)

6 months ago[HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer (#5127)
Y Ethan Guo [Thu, 24 Mar 2022 23:42:44 +0000 (16:42 -0700)] 
[HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer (#5127)

6 months ago[HUDI-3689] Fix delta streamer tests (#5124)
Raymond Xu [Thu, 24 Mar 2022 21:19:53 +0000 (14:19 -0700)] 
[HUDI-3689] Fix delta streamer tests (#5124)

6 months ago[HUDI-3706] Downgrade maven surefire and failsafe version (#5123)
Y Ethan Guo [Thu, 24 Mar 2022 16:31:46 +0000 (09:31 -0700)] 
[HUDI-3706] Downgrade maven surefire and failsafe version (#5123)

6 months ago[HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120)
Raymond Xu [Thu, 24 Mar 2022 16:10:33 +0000 (09:10 -0700)] 
[HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer (#5120)

6 months ago[HUDI-3689] Remove Azure CI cache (#5121)
Raymond Xu [Thu, 24 Mar 2022 12:39:11 +0000 (05:39 -0700)] 
[HUDI-3689] Remove Azure CI cache (#5121)

6 months ago[HUDI-3684] Fixing NPE in `ParquetUtils` (#5102)
Alexey Kudinkin [Thu, 24 Mar 2022 12:07:38 +0000 (05:07 -0700)] 
[HUDI-3684] Fixing NPE in `ParquetUtils` (#5102)

* Make sure nulls are properly handled in `HoodieColumnRangeMetadata`

6 months ago[HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117)
Sagar Sumit [Thu, 24 Mar 2022 10:18:35 +0000 (15:48 +0530)] 
[HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117)

* Remove glob pattern basePath from the deltastreamer tests.

* [HUDI-3689] Fix file scheme config

for CI failure in TestHoodieRealTimeRecordReader

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
6 months ago[minor] Checks the data block type for archived timeline (#5106)
Danny Chan [Thu, 24 Mar 2022 06:10:43 +0000 (14:10 +0800)] 
[minor] Checks the data block type for archived timeline (#5106)

6 months agoFixing non partitioned all files record in MDT (#5108)
Sivabalan Narayanan [Thu, 24 Mar 2022 02:26:39 +0000 (19:26 -0700)] 
Fixing non partitioned all files record in MDT (#5108)

6 months ago[HUDI-3642] Handle NPE due to empty requested replacecommit metadata (#5090)
Sagar Sumit [Wed, 23 Mar 2022 19:13:02 +0000 (00:43 +0530)] 
[HUDI-3642] Handle NPE due to empty requested replacecommit metadata (#5090)

6 months ago[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize config...
Rajesh Mahindra [Tue, 22 Mar 2022 02:56:31 +0000 (19:56 -0700)] 
[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)

- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
6 months ago[HUDI-3640] Set SimpleKeyGenerator as default in 2to3 table upgrade for Spark engine...
Y Ethan Guo [Tue, 22 Mar 2022 00:35:06 +0000 (17:35 -0700)] 
[HUDI-3640] Set SimpleKeyGenerator as default in 2to3 table upgrade for Spark engine (#5075)

6 months ago[HUDI-1436]: Provide an option to trigger clean every nth commit (#4385)
Pratyaksh Sharma [Tue, 22 Mar 2022 00:06:30 +0000 (05:36 +0530)] 
[HUDI-1436]: Provide an option to trigger clean every nth commit (#4385)

- Provided option to trigger clean every nth commit with default number of commits as 1 so that existing users are not affected.
Co-authored-by: sivabalan <n.siva.b@gmail.com>
6 months ago[HUDI-3559] Flink bucket index with COW table throws NoSuchElementException
wxp4532 [Fri, 11 Mar 2022 06:07:52 +0000 (14:07 +0800)] 
[HUDI-3559] Flink bucket index with COW table throws NoSuchElementException

Actually method FlinkWriteHelper#deduplicateRecords does not guarantee the records sequence, but there is a
implicit constraint: all the records in one bucket should have the same bucket type(instant time here),
the BucketStreamWriteFunction breaks the rule and fails to comply with this constraint.

close apache/hudi#5018

6 months ago[MINOR] Fixing sparkUpdateNode for record generation (#5079)
Sivabalan Narayanan [Mon, 21 Mar 2022 04:56:30 +0000 (21:56 -0700)] 
[MINOR] Fixing sparkUpdateNode for record generation (#5079)

6 months ago[HUDI-3665] Support flink multiple versions (#5072)
Danny Chan [Mon, 21 Mar 2022 02:34:50 +0000 (10:34 +0800)] 
[HUDI-3665] Support flink multiple versions (#5072)

6 months ago[MINOR] Remove flaky assert in TestInLineFileSystem (#5069)
Y Ethan Guo [Sun, 20 Mar 2022 22:58:30 +0000 (15:58 -0700)] 
[MINOR] Remove flaky assert in TestInLineFileSystem (#5069)

6 months ago[HUDI-3663] Fixing Column Stats index to properly handle first Data Table commit...
Alexey Kudinkin [Sun, 20 Mar 2022 04:54:13 +0000 (21:54 -0700)] 
[HUDI-3663] Fixing Column Stats index to properly handle first Data Table commit (#5070)

* Fixed metadata conversion util to extract schema from `HoodieCommitMetadata`

* Fixed failure to fetch columns to index in empty table

* Abort indexing seq in case there are no columns to index

* Fallback to index at least primary key columns, in case no writer schema could be obtained to index all columns

* Fixed `getRecordFields` incorrectly ignoring default value

* Make sure Hudi metadata fields are also indexed

6 months ago[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication (#4877)
Alexey Kudinkin [Sat, 19 Mar 2022 05:32:16 +0000 (22:32 -0700)] 
[HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication (#4877)

Refactoring Spark DataSource Relations to avoid code duplication.

Following Relations were in scope:

- BaseFileOnlyViewRelation
- MergeOnReadSnapshotRelaation
- MergeOnReadIncrementalRelation

6 months ago[HUDI-3659] Reducing the validation frequency with integ tests (#5067)
Sivabalan Narayanan [Fri, 18 Mar 2022 16:45:33 +0000 (09:45 -0700)] 
[HUDI-3659] Reducing the validation frequency with integ tests (#5067)

6 months ago[HUDI-3656] Adding medium sized dataset for clustering and minor fixes to integ tests...
Sivabalan Narayanan [Fri, 18 Mar 2022 16:44:56 +0000 (09:44 -0700)] 
[HUDI-3656] Adding medium sized dataset for clustering and minor fixes to integ tests (#5063)

6 months ago[HUDI-3598] Row Data to Hoodie Record Operator parallelism needs to always be consist...
JerryYue-M [Fri, 18 Mar 2022 02:47:29 +0000 (10:47 +0800)] 
[HUDI-3598] Row Data to Hoodie Record Operator parallelism needs to always be consistent with input operator (#5049)

for chaining purpose

Co-authored-by: jerryyue <jerryyue@didiglobal.com>
6 months ago[MINOR] HoodieFileScanRDD could print null path (#5056)
RexAn [Thu, 17 Mar 2022 19:53:45 +0000 (03:53 +0800)] 
[MINOR] HoodieFileScanRDD could print null path (#5056)

Co-authored-by: Rex An <bonean131@gmail.com>
6 months ago[HUDI-2439] Replace RDD with HoodieData in HoodieSparkTable and commit executors...
Raymond Xu [Thu, 17 Mar 2022 11:17:56 +0000 (19:17 +0800)] 
[HUDI-2439] Replace RDD with HoodieData in HoodieSparkTable and commit executors (#4856)

- Adopt HoodieData in Spark action commit executors
- Make Spark independent DeleteHelper, WriteHelper, MergeHelper in hudi-client-common
- Make HoodieTable in WriteClient APIs have raw type to decouple with Client's generic types

6 months ago[HUDI-3645] Fix NPE caused by multiple threads accessing non-thread-safe HashMap...
冯健 [Thu, 17 Mar 2022 08:50:28 +0000 (16:50 +0800)] 
[HUDI-3645] Fix NPE caused by multiple threads accessing non-thread-safe HashMap (#5028)

- Change HashMap in HoodieROTablePathFilter to ConcurrentHashMap

6 months ago[HUDI-3494] Consider triggering condition of MOR compaction during archival (#4974)
Y Ethan Guo [Thu, 17 Mar 2022 05:28:11 +0000 (22:28 -0700)] 
[HUDI-3494] Consider triggering condition of MOR compaction during archival (#4974)

6 months ago[HUDI-3404] Automatically adjust write configs based on metadata table and write...
Y Ethan Guo [Thu, 17 Mar 2022 05:25:04 +0000 (22:25 -0700)] 
[HUDI-3404] Automatically adjust write configs based on metadata table and write concurrency mode (#4975)