hudi.git
5 months ago[MINOR] Inline the partition path logic into the builder (#5310)
Danny Chan [Wed, 13 Apr 2022 11:24:39 +0000 (19:24 +0800)] 
[MINOR] Inline the partition path logic into the builder (#5310)

5 months ago[HUDI-3868] Disable the sort input for flink streaming append mode (#5309)
Danny Chan [Wed, 13 Apr 2022 06:21:08 +0000 (14:21 +0800)] 
[HUDI-3868] Disable the sort input for flink streaming append mode (#5309)

5 months ago[HUDI-3867] Disable Data Skipping by default (#5306)
Alexey Kudinkin [Wed, 13 Apr 2022 05:51:12 +0000 (22:51 -0700)] 
[HUDI-3867] Disable Data Skipping by default (#5306)

5 months ago[HUDI-3855] Fixing `FILENAME_METADATA_FIELD` not being correctly updated in `HoodieMe...
Alexey Kudinkin [Wed, 13 Apr 2022 00:42:15 +0000 (17:42 -0700)] 
[HUDI-3855] Fixing `FILENAME_METADATA_FIELD` not being correctly updated in `HoodieMergeHandle` (#5296)

Fixing FILENAME_METADATA_FIELD not being correctly updated in HoodieMergeHandle, in cases when old-record is carried over from existing file as is.

- Revisited HoodieFileWriter API to accept HoodieKey instead of HoodieRecord
- Fixed FILENAME_METADATA_FIELD not being overridden in cases when simply old record is carried over
- Exposing standard JVM's debugger ports in Docker setup

5 months ago[HUDI-3859] Fix spark profiles and utilities-slim dep (#5297)
Raymond Xu [Tue, 12 Apr 2022 22:33:08 +0000 (15:33 -0700)] 
[HUDI-3859] Fix spark profiles and utilities-slim dep (#5297)

5 months ago[HUDI-3838] Moved the getPartitionColumns logic to driver. (#5303)
Vinoth Govindarajan [Tue, 12 Apr 2022 22:03:00 +0000 (15:03 -0700)] 
[HUDI-3838] Moved the getPartitionColumns logic to driver. (#5303)

5 months ago[MINOR] Integ Test Reducing partitions for log running multi partition yaml (#5300)
satishm [Tue, 12 Apr 2022 16:15:17 +0000 (21:45 +0530)] 
[MINOR] Integ Test Reducing partitions for log running multi partition yaml (#5300)

5 months ago[HUDI-3843] Make flink profiles build with scala-2.11 (#5279)
Raymond Xu [Tue, 12 Apr 2022 15:33:48 +0000 (08:33 -0700)] 
[HUDI-3843] Make flink profiles build with scala-2.11 (#5279)

5 months ago[HUDI-3838] Implemented drop partition column feature for delta streamer code path...
Vinoth Govindarajan [Tue, 12 Apr 2022 12:40:30 +0000 (05:40 -0700)] 
[HUDI-3838] Implemented drop partition column feature for delta streamer code path (#5294)

* [HUDI-3838] Implemented drop partition column feature for delta streamer code path

* Ensure drop partition table config is updated in hoodie.props

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
5 months ago[HUDI-3839] Fixing incorrect selection of MT partitions to be updated (#5274)
Alexey Kudinkin [Tue, 12 Apr 2022 08:07:52 +0000 (01:07 -0700)] 
[HUDI-3839] Fixing incorrect selection of MT partitions to be updated (#5274)

* Fixing incorrect selection of MT partitions to be updated

* Ensure that metadata partitions table config is inherited correctly

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
5 months ago[HUDI-3799] Fixing not deleting empty instants w/o archiving (#5261)
Sivabalan Narayanan [Tue, 12 Apr 2022 04:02:43 +0000 (21:02 -0700)] 
[HUDI-3799] Fixing not deleting empty instants w/o archiving (#5261)

5 months ago[HUDI-3844] Update props in indexer based on table config (#5293)
Sagar Sumit [Mon, 11 Apr 2022 22:16:06 +0000 (03:46 +0530)] 
[HUDI-3844] Update props in indexer based on table config (#5293)

5 months ago[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution (#5275)
Alexey Kudinkin [Mon, 11 Apr 2022 19:45:53 +0000 (12:45 -0700)] 
[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution (#5275)

Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI.

This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that.

- Added appropriate aligning for the transposed CSI projection

5 months ago[MINOR] fixing timeline server for integ tests (#5289)
Sivabalan Narayanan [Mon, 11 Apr 2022 14:14:51 +0000 (07:14 -0700)] 
[MINOR] fixing timeline server for integ tests (#5289)

5 months ago[HUDI-3817] shade parquet dependency for hudi-hadoop-mr-bundle (#5250)
RexXiong [Mon, 11 Apr 2022 12:44:46 +0000 (20:44 +0800)] 
[HUDI-3817] shade parquet dependency for hudi-hadoop-mr-bundle (#5250)

Co-authored-by: lvshuang.xjs <lvshuang.xjs@alibaba-inc.com>
5 months ago[HUDI-3798] Fixing ending of a transaction by different owner and removing some extra...
Sivabalan Narayanan [Mon, 11 Apr 2022 04:46:07 +0000 (21:46 -0700)] 
[HUDI-3798] Fixing ending of a transaction by different owner and removing some extraneous methods in trxn manager (#5255)

5 months ago[HUDI-3847] Fix NPE due to null schema in HoodieMetadataTableValidator (#5284)
Y Ethan Guo [Mon, 11 Apr 2022 00:59:29 +0000 (17:59 -0700)] 
[HUDI-3847] Fix NPE due to null schema in HoodieMetadataTableValidator (#5284)

5 months ago[HUDI-3842] Integ tests for non partitioned datasets (#5276)
Sivabalan Narayanan [Mon, 11 Apr 2022 00:09:48 +0000 (17:09 -0700)] 
[HUDI-3842] Integ tests for non partitioned datasets (#5276)

- Adding non-partitioned support to integ tests
- Fixing some of the test yamls and properties

5 months ago[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs...
Alexey Kudinkin [Sun, 10 Apr 2022 17:43:47 +0000 (10:43 -0700)] 
[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs (#5244)

Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.

- Fixing Data Skipping configuration to respect MT configs (on the Read path)
- Tightening up DS handling of cases when no top-level columns are in the target query
- Enhancing tests to cover all possible case

5 months ago[HUDI-3834] Fixing performance hits in reading Column Stats Index (#5266)
Alexey Kudinkin [Sun, 10 Apr 2022 17:42:06 +0000 (10:42 -0700)] 
[HUDI-3834] Fixing performance hits in reading Column Stats Index (#5266)

Fixing performance hits in reading Column Stats Index:

[HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s)

Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.

5 months ago[MINOR] Fix typos in the comments of HoodieMergeHandle (#5271)
董可伦 [Sun, 10 Apr 2022 00:51:58 +0000 (08:51 +0800)] 
[MINOR] Fix typos in the comments of HoodieMergeHandle (#5271)

5 months ago[HUDI-3807] Add a new config to control the use of metadata index in HoodieBloomIndex...
Y Ethan Guo [Sat, 9 Apr 2022 19:30:11 +0000 (12:30 -0700)] 
[HUDI-3807] Add a new config to control the use of metadata index in HoodieBloomIndex (#5268)

5 months ago[HUDI-3837] Fix license and rat check settings (#5273)
Raymond Xu [Sat, 9 Apr 2022 18:01:18 +0000 (11:01 -0700)] 
[HUDI-3837] Fix license and rat check settings (#5273)

- add missing licenses
- fix CI setting to run rat plugin
- fix deploy script to include integ test modules

5 months ago[HUDI-3825] Fixing Column Stats Index updating sequence (#5267)
Alexey Kudinkin [Sat, 9 Apr 2022 06:14:08 +0000 (23:14 -0700)] 
[HUDI-3825] Fixing Column Stats Index updating sequence (#5267)

5 months ago[MINOR] Update README of docker build setup (#5256)
Y Ethan Guo [Fri, 8 Apr 2022 23:12:25 +0000 (16:12 -0700)] 
[MINOR] Update README of docker build setup (#5256)

5 months ago[HUDI-3571] Spark datasource continuous checkpoint should have own fs variable (...
satishm [Fri, 8 Apr 2022 11:16:01 +0000 (16:46 +0530)] 
[HUDI-3571] Spark datasource continuous checkpoint should have own fs variable (#5265)

5 months ago[HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259)
Alexey Kudinkin [Fri, 8 Apr 2022 10:28:31 +0000 (03:28 -0700)] 
[HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259)

* Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record

* Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME`

* Cleaned up `HoodieBackedTableMetadataWriter`

* Make sure REPLACE_COMMITS are handled as well

5 months ago[HUDI-3827] Promote the inetAddress picking strategy for NetworkUtils#getHostname...
Danny Chan [Fri, 8 Apr 2022 06:33:56 +0000 (14:33 +0800)] 
[HUDI-3827] Promote the inetAddress picking strategy for NetworkUtils#getHostname (#5260)

5 months ago[HUDI-3781] fix spark delete sql can not delete record (#5215)
KnightChess [Fri, 8 Apr 2022 06:26:40 +0000 (14:26 +0800)] 
[HUDI-3781] fix spark delete sql can not delete record (#5215)

5 months ago[HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252)
Sagar Sumit [Fri, 8 Apr 2022 04:29:36 +0000 (09:59 +0530)] 
[HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252)

* Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent)
to get the partition.

* If the list of log file paths in the split is empty, then fallback to usual behaviour.

5 months ago[HUDI-3823] Fix hudi-hive-sync-bundle to include HBase dependencies and shading ...
Y Ethan Guo [Fri, 8 Apr 2022 00:30:33 +0000 (17:30 -0700)] 
[HUDI-3823] Fix hudi-hive-sync-bundle to include HBase dependencies and shading (#5257)

5 months ago[HUDI-3810] Fixing lazy read for metadata log record readers (#5241)
Sivabalan Narayanan [Thu, 7 Apr 2022 22:40:51 +0000 (15:40 -0700)] 
[HUDI-3810] Fixing lazy read for metadata log record readers (#5241)

5 months ago[HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234)
Y Ethan Guo [Thu, 7 Apr 2022 20:03:03 +0000 (13:03 -0700)] 
[HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234)

5 months ago[HUDI-3571] Spark datasource continuous ingestion tool (#5156)
Sivabalan Narayanan [Thu, 7 Apr 2022 18:13:46 +0000 (11:13 -0700)] 
[HUDI-3571] Spark datasource continuous ingestion tool (#5156)

5 months ago[HUDI-3643] Fix hive count exception when the table is empty and the path depth is...
董可伦 [Thu, 7 Apr 2022 11:21:03 +0000 (19:21 +0800)] 
[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 (#5051)

5 months ago[HUDI-3805] Delete existing corrupted requested rollback plan during rollback (#5245)
Y Ethan Guo [Thu, 7 Apr 2022 10:02:34 +0000 (03:02 -0700)] 
[HUDI-3805] Delete existing corrupted requested rollback plan during rollback (#5245)

5 months ago[HUDI-3096] fixed the bug that the cow table(contains decimalType) write by flink...
xiarixiaoyao [Thu, 7 Apr 2022 09:21:25 +0000 (17:21 +0800)] 
[HUDI-3096] fixed the bug that the cow table(contains decimalType) write by flink cannot be read by spark. (#4421)

5 months ago[HUDI-3808] Flink bulk_insert timestamp(3) can not be read by Spark (#5236)
Danny Chan [Thu, 7 Apr 2022 07:17:39 +0000 (15:17 +0800)] 
[HUDI-3808] Flink bulk_insert timestamp(3) can not be read by Spark (#5236)

5 months ago[HUDI-3739] Fix handling of the `isNotNull` predicate in Data Skipping (#5224)
Alexey Kudinkin [Wed, 6 Apr 2022 19:17:36 +0000 (12:17 -0700)] 
[HUDI-3739] Fix handling of the `isNotNull` predicate in Data Skipping (#5224)

- Fix handling of the isNotNull predicate in Data Skipping

5 months ago[HUDI-3340] Fix deploy_staging_jars command (#5243)
Raymond Xu [Wed, 6 Apr 2022 19:14:23 +0000 (12:14 -0700)] 
[HUDI-3340] Fix deploy_staging_jars command (#5243)

5 months ago[HUDI-3726] Switching from non-partitioned to partitioned key gen does not throw...
rkkalluri [Wed, 6 Apr 2022 17:35:32 +0000 (12:35 -0500)] 
[HUDI-3726] Switching from non-partitioned to partitioned key gen does not throw any exception (#5205)

5 months ago[HUDI-3340] Fix deploy_staging_jars for different profiles (#5240)
Raymond Xu [Wed, 6 Apr 2022 16:42:11 +0000 (09:42 -0700)] 
[HUDI-3340] Fix deploy_staging_jars for different profiles (#5240)

5 months ago[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208)
Alexey Kudinkin [Wed, 6 Apr 2022 16:11:08 +0000 (09:11 -0700)] 
[HUDI-3760] Adding capability to fetch Metadata Records by prefix  (#5208)

- Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats
- Index records pertaining to the columns being queried by, instead of reading out whole Index.
- Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read.

Brief change log
- Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
- Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
- Wiring key-prefix lookup t/h LogRecordScanner impls
- Cleaning up HoodieHFileReader impl

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
5 months ago[MINOR] Fixing build failure when using flink-1.13 (#5214)
BruceLin [Wed, 6 Apr 2022 08:07:20 +0000 (16:07 +0800)] 
[MINOR] Fixing build failure when using flink-1.13 (#5214)

5 months ago[HUDI-3800] Fixed preserve commit metadata for compaction for untouched records ...
Sivabalan Narayanan [Wed, 6 Apr 2022 07:56:53 +0000 (00:56 -0700)] 
[HUDI-3800] Fixed preserve commit metadata for compaction for untouched records (#5232)

5 months agoMoving to 0.12.0-SNAPSHOT on master branch.
Raymond Xu [Wed, 6 Apr 2022 07:24:10 +0000 (15:24 +0800)] 
Moving to 0.12.0-SNAPSHOT on master branch.

5 months ago[HUDI-3723] Fixed stack overflows in Record Iterators (#5235)
Alexey Kudinkin [Wed, 6 Apr 2022 03:12:13 +0000 (20:12 -0700)] 
[HUDI-3723] Fixed stack overflows in Record Iterators (#5235)

5 months ago[HUDI-3782] Fixing table config when any of the index is disabled (#5222)
Sagar Sumit [Wed, 6 Apr 2022 03:06:52 +0000 (08:36 +0530)] 
[HUDI-3782] Fixing table config when any of the index is disabled (#5222)

5 months ago[HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220)
Vinoth Govindarajan [Tue, 5 Apr 2022 15:58:13 +0000 (08:58 -0700)] 
[HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220)

* dbt example models to demonstrate hudi dbt integration

* Fixed readme text

5 months ago[HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop...
Yann Byron [Tue, 5 Apr 2022 08:31:41 +0000 (16:31 +0800)] 
[HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201)

5 months ago[HUDI-3795] Fix hudi-examples checkstyle and maven enforcer error (#5221)
ForwardXu [Tue, 5 Apr 2022 08:10:11 +0000 (16:10 +0800)] 
[HUDI-3795] Fix hudi-examples checkstyle and maven enforcer error (#5221)

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
5 months ago[HUDI-3780] improve drop partitions (#5178)
ForwardXu [Tue, 5 Apr 2022 03:52:33 +0000 (11:52 +0800)] 
[HUDI-3780] improve drop partitions (#5178)

5 months ago[HUDI-3290] Different file formats for the partition metadata file. (#5179)
Prashant Wason [Mon, 4 Apr 2022 15:08:20 +0000 (08:08 -0700)] 
[HUDI-3290] Different file formats for the partition metadata file. (#5179)

* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
5 months ago[HUDI-3534] [RFC-34] Added the implementation details for the BigQuery integration...
Vinoth Govindarajan [Sun, 3 Apr 2022 10:53:25 +0000 (03:53 -0700)] 
[HUDI-3534] [RFC-34] Added the implementation details for the BigQuery integration (#4503)

5 months ago[MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217)
Y Ethan Guo [Sun, 3 Apr 2022 10:42:14 +0000 (03:42 -0700)] 
[MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217)

5 months ago[HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207)
Sivabalan Narayanan [Sun, 3 Apr 2022 06:44:10 +0000 (23:44 -0700)] 
[HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207)

5 months ago[HUDI-3664] Fixing Column Stats Index composition (#5181)
Alexey Kudinkin [Sun, 3 Apr 2022 00:15:52 +0000 (17:15 -0700)] 
[HUDI-3664] Fixing Column Stats Index composition  (#5181)

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
5 months ago[HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations...
Sagar Sumit [Sat, 2 Apr 2022 22:22:57 +0000 (03:52 +0530)] 
[HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213)

5 months ago[HUDI-3357] MVP implementation of BigQuerySyncTool (#5125)
Vinoth Govindarajan [Sat, 2 Apr 2022 20:18:06 +0000 (13:18 -0700)] 
[HUDI-3357] MVP implementation of BigQuerySyncTool (#5125)

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
5 months ago[HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216)
Y Ethan Guo [Sat, 2 Apr 2022 20:16:17 +0000 (13:16 -0700)] 
[HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216)

5 months ago[HUDI-3771] flink supports sync table information to aws glue (#5202)
todd5167 [Sat, 2 Apr 2022 13:16:10 +0000 (21:16 +0800)] 
[HUDI-3771] flink supports sync table information to aws glue (#5202)

5 months ago[HUDI-3451] Delete metadata table when the write client disables MDT (#5186)
YueZhang [Sat, 2 Apr 2022 11:01:06 +0000 (19:01 +0800)] 
[HUDI-3451] Delete metadata table when the write client disables MDT (#5186)

* Add checks for metadata table init to avoid possible out-of-sync

* Revise the logic to reuse existing table config

* Revise docs and naming

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
5 months ago[HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check...
Y Ethan Guo [Sat, 2 Apr 2022 03:17:02 +0000 (20:17 -0700)] 
[HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check (#5204)

5 months ago[HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209)
Y Ethan Guo [Sat, 2 Apr 2022 03:14:07 +0000 (20:14 -0700)] 
[HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209)

5 months ago[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark...
xiarixiaoyao [Fri, 1 Apr 2022 20:20:24 +0000 (04:20 +0800)] 
[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)

* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical

5 months ago[HUDI-3468][RFC-49] Support sync with DataHub (#5022)
Raymond Xu [Fri, 1 Apr 2022 19:27:01 +0000 (12:27 -0700)] 
[HUDI-3468][RFC-49] Support sync with DataHub (#5022)

5 months ago[HUDI-3225] [RFC-45] for async metadata indexing (#4640)
Sagar Sumit [Fri, 1 Apr 2022 18:49:23 +0000 (00:19 +0530)] 
[HUDI-3225] [RFC-45] for async metadata indexing (#4640)

* Add RFC for async metadata indexing

Add more details

* Add changes since last discussion

* Add another race condition handling

* Update rfc

5 months ago[HUDI-3763] Fixing hadoop conf class loading for inline reading (#5194)
Sivabalan Narayanan [Fri, 1 Apr 2022 15:27:40 +0000 (08:27 -0700)] 
[HUDI-3763] Fixing hadoop conf class loading for inline reading (#5194)

5 months ago[HUDI-3769] Optimize the logs of HoodieMergeHandle and BufferedConnectWriter (#5200)
董可伦 [Fri, 1 Apr 2022 13:17:49 +0000 (21:17 +0800)] 
[HUDI-3769] Optimize the logs of HoodieMergeHandle and BufferedConnectWriter (#5200)

5 months ago[HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880)
Danny Chan [Fri, 1 Apr 2022 12:46:51 +0000 (20:46 +0800)] 
[HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880)

5 months ago[HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957)
ForwardXu [Fri, 1 Apr 2022 02:01:41 +0000 (10:01 +0800)] 
[HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957)

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* fix comments

* fix comments

* fix comments

5 months ago[HUDI-3743] Support DELETE_PARTITION for metadata table (#5169)
Sagar Sumit [Fri, 1 Apr 2022 01:29:17 +0000 (06:59 +0530)] 
[HUDI-3743] Support DELETE_PARTITION for metadata table (#5169)

In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer.

- Add a new API in HoodieTableMetadataWriter
- Current only supported for Spark metadata writer

5 months ago[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
Sagar Sumit [Thu, 31 Mar 2022 20:03:12 +0000 (01:33 +0530)] 
[HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)

- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up

5 months ago[HUDI-2777] Improve HoodieSparkSqlWriter write performance (#5187)
liuhe0702 [Thu, 31 Mar 2022 19:48:47 +0000 (03:48 +0800)] 
[HUDI-2777] Improve HoodieSparkSqlWriter write performance (#5187)

5 months ago[HUDI-3020] Utility to create manifest file (#5153)
codejoyan [Thu, 31 Mar 2022 14:22:03 +0000 (19:52 +0530)] 
[HUDI-3020] Utility to create manifest file (#5153)

Co-authored-by: joyan <joyan.sil@walmart.com>
5 months ago[HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet...
xiarixiaoyao [Thu, 31 Mar 2022 12:09:26 +0000 (20:09 +0800)] 
[HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet file (#5168)

* [MINOR][SPARK] fixed the per regression by enable vectorizeReader for parquet file

* address comments

* add perf result

5 months ago[HUDI-3732] Fixing rollback validation (#5157)
Sivabalan Narayanan [Thu, 31 Mar 2022 11:55:24 +0000 (04:55 -0700)] 
[HUDI-3732] Fixing rollback validation (#5157)

* Fixing rollback validation

* Adding tests

5 months ago[HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489)
ForwardXu [Thu, 31 Mar 2022 07:35:39 +0000 (15:35 +0800)] 
[HUDI-3135] Make delete partitions lazy to be executed by the cleaner  (#4489)

As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted.

- Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted.
- Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted
- CleanActionExecutor is fixed to delete partitions if any (as per clean plan)
- Same info is added to HoodieCleanMetadata
- Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions.

Co-authored-by: sivabalan <n.siva.b@gmail.com>
5 months ago[HUDI-3733] Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli (#5158)
Sivabalan Narayanan [Thu, 31 Mar 2022 07:30:49 +0000 (00:30 -0700)] 
[HUDI-3733] Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli (#5158)

* Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
5 months ago[HUDI-3692] MetadataFileSystemView includes compaction in timeline (#5110)
Yuwei XIAO [Thu, 31 Mar 2022 06:24:59 +0000 (14:24 +0800)] 
[HUDI-3692] MetadataFileSystemView includes compaction in timeline (#5110)

5 months ago[HUDI-3713] Guarding archival for multi-writer (#5138)
Sivabalan Narayanan [Thu, 31 Mar 2022 05:44:31 +0000 (22:44 -0700)] 
[HUDI-3713] Guarding archival for multi-writer  (#5138)

5 months ago[MINOR][DOCS] Update hudi-utilities-slim-bundle docs (#5184)
Y Ethan Guo [Thu, 31 Mar 2022 04:48:54 +0000 (21:48 -0700)] 
[MINOR][DOCS] Update hudi-utilities-slim-bundle docs (#5184)

5 months ago[HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173)
YueZhang [Thu, 31 Mar 2022 03:26:37 +0000 (11:26 +0800)] 
[HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
5 months ago[HUDI-3750] Fix NPE when build HoodieFileIndex (#5134)
KnightChess [Thu, 31 Mar 2022 02:19:05 +0000 (10:19 +0800)] 
[HUDI-3750] Fix NPE when build HoodieFileIndex (#5134)

Co-authored-by: wulingqi <wulingqi@baijiahulian.com>
5 months ago[MINOR] Fixing flakiness in TestHoodieSparkMergeOnReadTableRollback.testRollbackWithD...
Sivabalan Narayanan [Thu, 31 Mar 2022 02:07:22 +0000 (19:07 -0700)] 
[MINOR] Fixing flakiness in TestHoodieSparkMergeOnReadTableRollback.testRollbackWithDeltaAndCompactionCommit (#5183)

5 months ago[HUDI-3700] Add hudi-utilities-slim-bundle excluding hudi-spark-datasource modules...
Y Ethan Guo [Thu, 31 Mar 2022 01:08:35 +0000 (18:08 -0700)] 
[HUDI-3700] Add hudi-utilities-slim-bundle excluding hudi-spark-datasource modules (#5176)

5 months ago[HUDI-3681] Provision additional hudi-spark-bundle with different versions (#5171)
Y Ethan Guo [Thu, 31 Mar 2022 00:35:56 +0000 (17:35 -0700)] 
[HUDI-3681] Provision additional hudi-spark-bundle with different versions (#5171)

5 months ago[HUDI-3355] Issue with out of order commits in the timeline when ingestion writers...
xiarixiaoyao [Wed, 30 Mar 2022 22:54:25 +0000 (06:54 +0800)] 
[HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy (#4962)

5 months ago[HUDI-3736] Fix null pointer when key not specified (#5167)
Nicolas Paris [Wed, 30 Mar 2022 22:11:26 +0000 (00:11 +0200)] 
[HUDI-3736] Fix null pointer when key not specified (#5167)

5 months ago[HUDI-3536] Add hudi-datahub-sync implementation (#5155)
Raymond Xu [Wed, 30 Mar 2022 21:38:02 +0000 (14:38 -0700)] 
[HUDI-3536] Add hudi-datahub-sync implementation (#5155)

5 months ago[MINOR] Repeated execution of update status (#5089)
Bo Cui [Wed, 30 Mar 2022 21:30:06 +0000 (05:30 +0800)] 
[MINOR] Repeated execution of update status (#5089)

5 months ago[HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path...
YueZhang [Wed, 30 Mar 2022 21:23:37 +0000 (05:23 +0800)] 
[HUDI-3635] Fix HoodieMetadataTableValidator around comparison of partition path listing (#5100)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
5 months ago[HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152)
YueZhang [Wed, 30 Mar 2022 21:18:08 +0000 (05:18 +0800)] 
[HUDI-3647] HoodieMetadataTableValidator: check MDT was initialized at first (#5152)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
5 months ago[HUDI-3653] Cleaning up bespoke Column Stats Index implementation (#5062)
Alexey Kudinkin [Wed, 30 Mar 2022 17:01:43 +0000 (10:01 -0700)] 
[HUDI-3653] Cleaning up bespoke Column Stats Index implementation (#5062)

5 months ago[MINOR] Fix dates as per UTC in TestDataSkippingUtils (#5166)
Sagar Sumit [Wed, 30 Mar 2022 14:33:14 +0000 (20:03 +0530)] 
[MINOR] Fix dates as per UTC in TestDataSkippingUtils (#5166)

* Fix timezone in test

5 months ago[minor] Follow 3178, fix the flink metadata table compaction (#5175)
Danny Chan [Wed, 30 Mar 2022 12:45:29 +0000 (20:45 +0800)] 
[minor] Follow 3178, fix the flink metadata table compaction (#5175)

5 months ago[HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170)
harshal [Wed, 30 Mar 2022 05:34:49 +0000 (11:04 +0530)] 
[HUDI-3745] Support for spark datasource options in S3EventsHoodieIncrSource (#5170)

5 months ago[HUDI-3485] Adding scheduler pool configs for async clustering (#5043)
Sivabalan Narayanan [Wed, 30 Mar 2022 01:27:45 +0000 (18:27 -0700)] 
[HUDI-3485] Adding scheduler pool configs for async clustering (#5043)

5 months ago[HUDI-3741] Fix flink bucket index bulk insert generates too many small files (#5164)
Danny Chan [Wed, 30 Mar 2022 00:18:36 +0000 (08:18 +0800)] 
[HUDI-3741] Fix flink bucket index bulk insert generates too many small files (#5164)

5 months ago[HUDI-2520] Fix CTAS statment issue when sync to hive (#5145)
ForwardXu [Tue, 29 Mar 2022 19:25:31 +0000 (03:25 +0800)] 
[HUDI-2520] Fix CTAS statment issue when sync to hive (#5145)