hudi.git
6 months ago[HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile...
YueZhang [Fri, 25 Feb 2022 11:16:27 +0000 (19:16 +0800)] 
[HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() (#4810)

6 months ago[HUDI-3474] Add more document to Pipelines for the usage of this tool to build a...
Danny Chan [Fri, 25 Feb 2022 11:08:51 +0000 (19:08 +0800)] 
[HUDI-3474] Add more document to Pipelines for the usage of this tool to build a write pipeline (#4906)

6 months ago[HUDI-3401] fix NPE caused by incorrect beforeKeyGenClassName validation (#4774)
todd5167 [Fri, 25 Feb 2022 04:31:29 +0000 (12:31 +0800)] 
[HUDI-3401] fix NPE caused by incorrect  beforeKeyGenClassName validation (#4774)

6 months ago[HUDI-3429] Support clustering scheduleAndExecute for hudi-cli and add clustering...
YueZhang [Fri, 25 Feb 2022 04:28:38 +0000 (12:28 +0800)] 
[HUDI-3429] Support clustering scheduleAndExecute for hudi-cli and add clustering-cli Tests (#4817)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
6 months ago[HUDI-3493] Not table to get execution plan (#4894)
ForwardXu [Fri, 25 Feb 2022 01:04:44 +0000 (09:04 +0800)] 
[HUDI-3493] Not table to get execution plan (#4894)

6 months ago[HUDI-1296] Support Metadata Table in Spark Datasource (#4789)
Alexey Kudinkin [Thu, 24 Feb 2022 21:23:13 +0000 (13:23 -0800)] 
[HUDI-1296] Support Metadata Table in Spark Datasource (#4789)

* Bootstrapping initial support for Metadata Table in Spark Datasource

- Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication
- Bootstrapped HoodieBaseRelation
- Updated HoodieMergeOnReadRDD to be able to handle Metadata Table
- Modified MOR relations to be able to read different Base File formats (Parquet, HFile)

6 months ago[HUDI-3161] Add Call Produce Command for Spark SQL (#4535)
ForwardXu [Thu, 24 Feb 2022 15:45:37 +0000 (23:45 +0800)] 
[HUDI-3161] Add Call Produce Command for Spark SQL (#4535)

6 months ago[HUDI-3488] The flink small file list should exclude file slices with pending compact...
yanenze [Thu, 24 Feb 2022 06:45:03 +0000 (14:45 +0800)] 
[HUDI-3488] The flink small file list should exclude file slices with pending compaction (#4893)

# this happens when the async-compaction has been configured

Co-authored-by: yanenze <yanenze@keytop.com.cn>
6 months ago[HUDI-3480][HUDI-3481] Enchancements to integ test suite (#4884)
Sivabalan Narayanan [Wed, 23 Feb 2022 20:56:35 +0000 (15:56 -0500)] 
[HUDI-3480][HUDI-3481] Enchancements to integ test suite (#4884)

7 months ago[HUDI-3489] Unify config to avoid duplicate code (#4883)
leesf [Wed, 23 Feb 2022 13:14:30 +0000 (21:14 +0800)] 
[HUDI-3489] Unify config to avoid duplicate code (#4883)

7 months ago[HUDI-3486] Fix wrong field order for constructing HoodieMetadataColumnStats (#4875)
Y Ethan Guo [Wed, 23 Feb 2022 04:57:02 +0000 (20:57 -0800)] 
[HUDI-3486] Fix wrong field order for constructing HoodieMetadataColumnStats (#4875)

7 months ago[HUDI-3420] Remove duplicates type in HoodieClusteringGroup.avsc (#4808)
yuzhaojing [Wed, 23 Feb 2022 02:49:47 +0000 (10:49 +0800)] 
[HUDI-3420] Remove duplicates type in HoodieClusteringGroup.avsc (#4808)

Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
7 months agoAdd hive-standalone-metastore dependency to hudi-flink-bundle module (#4870)
从大数据到人工智能 [Wed, 23 Feb 2022 01:16:21 +0000 (09:16 +0800)] 
Add hive-standalone-metastore dependency to hudi-flink-bundle module (#4870)

7 months ago[MINOR] Fixing checkpoint management in S3IncrSource (#4871)
Sivabalan Narayanan [Tue, 22 Feb 2022 14:15:16 +0000 (09:15 -0500)] 
[MINOR] Fixing checkpoint management in S3IncrSource (#4871)

7 months ago[HUDI-3476] Remove the shade pattern for parquet for flink bundle jar (#4869)
Danny Chan [Tue, 22 Feb 2022 11:21:57 +0000 (19:21 +0800)] 
[HUDI-3476] Remove the shade pattern for parquet for flink bundle jar (#4869)

7 months ago[HUDI-3461] The archived timeline for flink streaming reader should not be reused...
Danny Chan [Tue, 22 Feb 2022 07:54:29 +0000 (15:54 +0800)] 
[HUDI-3461] The archived timeline for flink streaming reader should not be reused (#4861)

* Before the patch, the flink streaming reader caches the meta client thus the archived timeline,
  when fetching the instant details from the reused timeline, the exception throws
* Add a method in HoodieTableMetaClient to return a fresh new archived timeline each time

7 months ago[HUDI-3464] Fix wrong exception thrown from HiveSchemaProvider (#4865)
wangxianghu [Tue, 22 Feb 2022 06:20:20 +0000 (10:20 +0400)] 
[HUDI-3464] Fix wrong exception thrown from HiveSchemaProvider (#4865)

7 months ago[HUDI-2189] Adding delete partitions support to DeltaStreamer (#4787)
Sivabalan Narayanan [Tue, 22 Feb 2022 05:01:30 +0000 (00:01 -0500)] 
[HUDI-2189] Adding delete partitions support to DeltaStreamer (#4787)

7 months ago[MINOR] Fix typos and improve docs in HoodieMetadataConfig (#4867)
Y Ethan Guo [Tue, 22 Feb 2022 03:36:20 +0000 (19:36 -0800)] 
[MINOR] Fix typos and improve docs in HoodieMetadataConfig (#4867)

7 months ago[HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations...
Prashant Wason [Tue, 22 Feb 2022 02:53:03 +0000 (18:53 -0800)] 
[HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present using a config. (#4212)

Co-authored-by: sivabalan <n.siva.b@gmail.com>
7 months ago[HUDI-3423] upgrade spark to 3.2.1 (#4815)
Yann Byron [Tue, 22 Feb 2022 00:52:21 +0000 (08:52 +0800)] 
[HUDI-3423] upgrade spark to 3.2.1 (#4815)

7 months ago[HUDI-3042] Abstract Spark update Strategy to make code more clean and remove duplica...
RexAn [Mon, 21 Feb 2022 14:53:09 +0000 (22:53 +0800)] 
[HUDI-3042] Abstract Spark update Strategy to make code more clean and remove duplicates (#4845)

Co-authored-by: Hui An <hui.an@shopee.com>
7 months ago[HUDI-349]: Added new cleaning policy based on number of hours (#3646)
Pratyaksh Sharma [Mon, 21 Feb 2022 14:04:42 +0000 (19:34 +0530)] 
[HUDI-349]: Added new cleaning policy based on number of hours  (#3646)

7 months ago[HUDI-3455] Fixing checkpoint management in hoodie incr source (#4850)
Sivabalan Narayanan [Mon, 21 Feb 2022 13:19:57 +0000 (08:19 -0500)] 
[HUDI-3455] Fixing checkpoint management in hoodie incr source (#4850)

7 months ago[HUDI-3432] Fixing restore with metadata enabled (#4849)
Sivabalan Narayanan [Mon, 21 Feb 2022 12:55:30 +0000 (07:55 -0500)] 
[HUDI-3432] Fixing restore with metadata enabled (#4849)

* Fixing restore with metadata enabled

* Fixing test failures

7 months ago[HUDI-2732][RFC-38] Spark Datasource V2 Integration (#3964)
leesf [Mon, 21 Feb 2022 12:14:07 +0000 (20:14 +0800)] 
[HUDI-2732][RFC-38] Spark Datasource V2 Integration (#3964)

7 months ago[HUDI-2648] Retry FileSystem action instead of failed directly. (#3887)
YueZhang [Sun, 20 Feb 2022 20:31:31 +0000 (04:31 +0800)] 
[HUDI-2648] Retry FileSystem action instead of failed directly. (#3887)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
7 months ago[HUDI-3458] Fix BulkInsertPartitioner generic type (#4854)
Raymond Xu [Sun, 20 Feb 2022 18:51:58 +0000 (10:51 -0800)] 
[HUDI-3458] Fix BulkInsertPartitioner generic type (#4854)

7 months ago[MINOR] Moving spark scheduling configs out of DataSourceOptions (#4843)
Sivabalan Narayanan [Sun, 20 Feb 2022 18:49:18 +0000 (13:49 -0500)] 
[MINOR] Moving spark scheduling configs out of DataSourceOptions (#4843)

7 months ago[HUDI-3446] Supports batch reader in BootstrapOperator#loadRecords (#4837)
Bo Cui [Sat, 19 Feb 2022 13:21:48 +0000 (21:21 +0800)] 
[HUDI-3446] Supports batch reader in BootstrapOperator#loadRecords (#4837)

* [HUDI-3446] Supports batch Reader in BootstrapOperator#loadRecords

7 months ago[HUDI-3389] fix ColumnarArrayData ClassCastException issue (#4842)
stayrascal [Sat, 19 Feb 2022 02:56:41 +0000 (10:56 +0800)] 
[HUDI-3389] fix ColumnarArrayData ClassCastException issue (#4842)

* [HUDI-3389] fix ColumnarArrayData ClassCastException issue

* [HUDI-3389] remove MapColumnVector.java, RowColumnVector.java, and add test case for array<int> field

7 months ago[HUDI-3438] Avoid getSmallFiles if hoodie.parquet.small.file.limit is 0 (#4823)
RexAn [Fri, 18 Feb 2022 13:57:04 +0000 (21:57 +0800)] 
[HUDI-3438] Avoid getSmallFiles if hoodie.parquet.small.file.limit is 0 (#4823)

Co-authored-by: Hui An <hui.an@shopee.com>
7 months ago[HUDI-3430] Fix Deltastreamer to properly shut down the services upon failure (#4824)
Y Ethan Guo [Fri, 18 Feb 2022 13:44:56 +0000 (05:44 -0800)] 
[HUDI-3430] Fix Deltastreamer to properly shut down the services upon failure (#4824)

7 months agoHoodieSortedMergeHandle#close write data disorder (#4841)
luokey [Fri, 18 Feb 2022 09:31:38 +0000 (17:31 +0800)] 
HoodieSortedMergeHandle#close write data disorder (#4841)

Co-authored-by: 854194341@qq.com <loukey_7821>
7 months ago[HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712)
Sagar Sumit [Fri, 18 Feb 2022 04:47:06 +0000 (10:17 +0530)] 
[HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712)

Fix dependency conflict

Fix repairs command

Implement putIfAbsent for DDB lock provider

Add upgrade step and validate while fetching configs

Validate checksum for latest table version only while fetching config

Move generateChecksum to BinaryUtil

Rebase and resolve conflict

Fix table version check

7 months ago[HUDI-3439] Remove the hive shade pattern for flink bundle jar (#4833)
Danny Chan [Thu, 17 Feb 2022 14:42:39 +0000 (22:42 +0800)] 
[HUDI-3439] Remove the hive shade pattern for flink bundle jar (#4833)

7 months ago[HUDI-3442]Duplicate code calls for 'FlinkOptions.flatOptions' (#4832)
zhangxiang17 [Thu, 17 Feb 2022 03:04:09 +0000 (11:04 +0800)] 
[HUDI-3442]Duplicate code calls for 'FlinkOptions.flatOptions' (#4832)

7 months ago[HUDI-3426] Sync datasource clustering config (#4828)
Sagar Sumit [Thu, 17 Feb 2022 00:02:49 +0000 (05:32 +0530)] 
[HUDI-3426] Sync datasource clustering config (#4828)

7 months ago[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring (#4743)
Alexey Kudinkin [Wed, 16 Feb 2022 23:36:37 +0000 (15:36 -0800)] 
[HUDI-3280] Cleaning up Hive-related hierarchies after refactoring (#4743)

7 months ago[HUDI-3394] Check isWriteLockedByCurrentThread before unlock for InProcessLockProvide...
YueZhang [Wed, 16 Feb 2022 06:41:25 +0000 (14:41 +0800)] 
[HUDI-3394] Check isWriteLockedByCurrentThread before unlock for InProcessLockProvider (#4819)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
7 months ago[HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792)
Y Ethan Guo [Tue, 15 Feb 2022 21:41:47 +0000 (13:41 -0800)] 
[HUDI-3366] Remove hardcoded logic of disabling metadata table in tests (#4792)

7 months ago[HUDI-2931] Add config to disable table services (#4777)
Raymond Xu [Tue, 15 Feb 2022 14:49:53 +0000 (06:49 -0800)] 
[HUDI-2931] Add config to disable table services (#4777)

7 months agofix build & ci (#4822)
Yann Byron [Tue, 15 Feb 2022 11:40:40 +0000 (19:40 +0800)] 
fix build & ci (#4822)

7 months ago[HUDI-3204] fix problem that spark on TimestampKeyGenerator has no re… (#4714)
Yann Byron [Tue, 15 Feb 2022 04:38:38 +0000 (12:38 +0800)] 
[HUDI-3204] fix problem that spark on TimestampKeyGenerator has no re… (#4714)

7 months ago[HUDI-1576] Make archiving an async service (#4795)
Raymond Xu [Tue, 15 Feb 2022 02:15:06 +0000 (18:15 -0800)] 
[HUDI-1576] Make archiving an async service (#4795)

7 months ago[HUDI-3200] deprecate hoodie.file.index.enable and unify to use BaseFileOnlyViewRelat...
Yann Byron [Tue, 15 Feb 2022 01:38:01 +0000 (09:38 +0800)] 
[HUDI-3200] deprecate hoodie.file.index.enable and unify to use BaseFileOnlyViewRelation to handle (#4798)

7 months ago[HUDI-3398] Fix TableSchemaResolver for all file formats and metadata table (#4782)
YueZhang [Tue, 15 Feb 2022 00:02:47 +0000 (08:02 +0800)] 
[HUDI-3398] Fix TableSchemaResolver for all file formats and metadata table (#4782)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
7 months ago[HUDI-1657] Fix the build on aarch64, Fedora 33 (#4617)
Yuqi Gu [Mon, 14 Feb 2022 23:10:18 +0000 (07:10 +0800)] 
[HUDI-1657] Fix the build on aarch64, Fedora 33 (#4617)

7 months ago[MINOR] Prevent async service from starting twice (#4801)
Raymond Xu [Mon, 14 Feb 2022 19:06:31 +0000 (11:06 -0800)] 
[MINOR] Prevent async service from starting twice (#4801)

7 months ago[HUDI-3254] Introduce HoodieCatalog to manage tables for Spark Datasource V2 (#4611)
leesf [Mon, 14 Feb 2022 14:26:58 +0000 (22:26 +0800)] 
[HUDI-3254] Introduce HoodieCatalog to manage tables for Spark Datasource V2 (#4611)

7 months ago[HUDI-3417] Switch AbstractTableFileSystemView#filterBaseFileAfterPendingCompaction...
yuzhaojing [Mon, 14 Feb 2022 08:18:34 +0000 (16:18 +0800)] 
[HUDI-3417] Switch AbstractTableFileSystemView#filterBaseFileAfterPendingCompaction log level to debug (#4805)

Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
7 months ago[HUDI-3272] If `mode==ignore && tableExists`, do not execute write logic and sync...
董可伦 [Mon, 14 Feb 2022 03:52:00 +0000 (11:52 +0800)] 
[HUDI-3272] If `mode==ignore && tableExists`, do not execute write logic and sync hive (#4632)

7 months ago[HUDI-3412] TypedProperties no need to create new set when check key exist or not...
RexAn [Mon, 14 Feb 2022 03:33:29 +0000 (11:33 +0800)] 
[HUDI-3412] TypedProperties no need to create new set when check key exist or not (#4791)

Co-authored-by: Hui An <hui.an@shopee.com>
7 months ago[HUDI-3370] The files recorded in the commit may not match the actual ones for MOR...
YueZhang [Mon, 14 Feb 2022 03:12:52 +0000 (11:12 +0800)] 
[HUDI-3370] The files recorded in the commit may not match the actual ones for MOR Compaction (#4753)

* use HoodieCommitMetadata to replace writeStatuses computation

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
7 months ago[HUDI-2413] fix Sql source's checkpoint issue (#3648)
冯健 [Mon, 14 Feb 2022 02:37:48 +0000 (10:37 +0800)] 
[HUDI-2413] fix Sql source's checkpoint issue (#3648)

* [HUDI-2413] fix Sql source's checkpoint

* Fixing sql source checkpoint handling

* Fixing docs

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
7 months ago[MINOR] Fix typos in Spark client related classes (#4781)
Y Ethan Guo [Sun, 13 Feb 2022 14:41:58 +0000 (06:41 -0800)] 
[MINOR] Fix typos in Spark client related classes (#4781)

7 months ago[MINOR] unused import (#4799)
wangxianghu [Sat, 12 Feb 2022 09:11:37 +0000 (13:11 +0400)] 
[MINOR] unused import (#4799)

7 months ago[HUDI-3413]fix jackson parse error when empty message from JsonKafkaSource Using...
zhangxiang17 [Sat, 12 Feb 2022 07:37:29 +0000 (15:37 +0800)] 
[HUDI-3413]fix jackson parse error when empty message from JsonKafkaSource Using HoodieDeltaStreamer (#4794)

7 months ago[HUDI-3362] Fix restore to rollback pending clustering operations followed by other...
satishkotha [Fri, 11 Feb 2022 19:12:45 +0000 (11:12 -0800)] 
[HUDI-3362] Fix restore to rollback pending clustering operations followed by other rolling back other commits (#4772)

7 months ago[HUDI-3338] Custom relation instead of HadoopFsRelation (#4709)
Yann Byron [Fri, 11 Feb 2022 18:48:44 +0000 (02:48 +0800)] 
[HUDI-3338] Custom relation instead of HadoopFsRelation (#4709)

Currently, HadoopFsRelation will use the value of the real partition path as the value of the partition field. However, different from the normal table, Hudi will persist the partition value in the parquet file. And in some cases, it's different between the value of the real partition path and the value of the partition field.
So here we implement BaseFileOnlyViewRelation which lets Hudi manage its own relation.

7 months ago[HUDI-3402] Set TIMESTAMP_MICROS as the default value for hoodie.parquet.outputtimest...
Yann Byron [Fri, 11 Feb 2022 17:23:55 +0000 (01:23 +0800)] 
[HUDI-3402] Set TIMESTAMP_MICROS as the default value for hoodie.parquet.outputtimestamptype (#4749)

7 months ago[HUDI-2987] Update all deprecated calls to new apis in HoodieRecordPayload (#4681)
Sivabalan Narayanan [Fri, 11 Feb 2022 00:19:33 +0000 (19:19 -0500)] 
[HUDI-2987] Update all deprecated calls to new apis in HoodieRecordPayload (#4681)

7 months ago[HUDI-2610] pass the spark version when sync the table created by spark (#4758)
Yann Byron [Thu, 10 Feb 2022 15:35:28 +0000 (23:35 +0800)] 
[HUDI-2610] pass the spark version when sync the table created by spark (#4758)

* [HUDI-2610] pass the spark version when sync the table created by spark

* [MINOR] sync spark version in DataSourceUtils#buildHiveSyncConfig

7 months ago[HUDI-3395] Allow pass rollbackUsingMarkers to Hudi CLI rollback command (#4557)
wenningd [Thu, 10 Feb 2022 14:41:22 +0000 (06:41 -0800)] 
[HUDI-3395] Allow pass rollbackUsingMarkers to Hudi CLI rollback command (#4557)

Co-authored-by: Wenning Ding <wenningd@amazon.com>
7 months ago[HUDI-3333] fix that getNestedFieldVal breaks with Spark 3.2 (#4783)
Yann Byron [Thu, 10 Feb 2022 14:12:16 +0000 (22:12 +0800)] 
[HUDI-3333] fix that getNestedFieldVal breaks with Spark 3.2 (#4783)

7 months ago[HUDI-2432] Adding restore.requested instant and restore plan for restore action...
Sivabalan Narayanan [Thu, 10 Feb 2022 13:06:23 +0000 (08:06 -0500)] 
[HUDI-2432] Adding restore.requested instant and restore plan for restore action (#4605)

- This adds a restore plan and serializes it to restore.requested meta file in timeline. This also means that we are introducing schedule and execution phases for restore which was not present before.

7 months ago[HUDI-1847] Adding inline scheduling support for spark datasource path for compaction...
Sivabalan Narayanan [Thu, 10 Feb 2022 13:04:55 +0000 (08:04 -0500)] 
[HUDI-1847] Adding inline scheduling support for spark datasource path for compaction and clustering (#4420)

- This adds support in spark-datasource to just schedule table services inline so that users can leverage async execution w/o the need for lock service providers.

7 months ago[HUDI-3389] Bump flink version to 1.14.3 (#4776)
Danny Chan [Thu, 10 Feb 2022 03:32:01 +0000 (11:32 +0800)] 
[HUDI-3389] Bump flink version to 1.14.3 (#4776)

7 months ago[HUDI-3239] Convert `BaseHoodieTableFileIndex` to Java (#4669)
Alexey Kudinkin [Wed, 9 Feb 2022 23:42:08 +0000 (15:42 -0800)] 
[HUDI-3239] Convert `BaseHoodieTableFileIndex` to Java (#4669)

Converting BaseHoodieTableFileIndex to Java, removing Scala as a dependency from "hudi-common"

7 months ago[HUDI-3276] Rebased Parquet-based `FileInputFormat` impls to inherit from `MapredParq...
Alexey Kudinkin [Tue, 8 Feb 2022 20:21:45 +0000 (12:21 -0800)] 
[HUDI-3276] Rebased Parquet-based `FileInputFormat` impls to inherit from `MapredParquetInputFormat` (#4667)

Rebased Parquet-based FileInputFormat impls to inherit from MapredParquetInputFormat, to make sure that Hive is appropriately recognizing those impls and applying corresponding optimizations.

- Converted HoodieRealtimeFileInputFormatBase and HoodieFileInputFormatBase into standalone implementations that could be instantiated as standalone objects (which could be used for delegation)
- Renamed HoodieFileInputFormatBase > HoodieCopyOnWriteTableInputFormat, HoodieRealtimeFileInputFormatBase > HoodieMergeOnReadTableInputFormat
- Scaffolded HoodieParquetFileInputFormatBase for all Parquet impls to inherit from
- Rebased Parquet impls onto HoodieParquetFileInputFormatBase

7 months ago[HUDI-3361] Fixing missing begin checkpoint in HoodieIncremental pull (#4755)
Sivabalan Narayanan [Tue, 8 Feb 2022 17:03:07 +0000 (12:03 -0500)] 
[HUDI-3361] Fixing missing begin checkpoint in HoodieIncremental pull (#4755)

7 months ago[HUDI-3091] Making SIMPLE index as the default index type (#4659) 4786/head
Sivabalan Narayanan [Tue, 8 Feb 2022 09:32:18 +0000 (04:32 -0500)] 
[HUDI-3091] Making SIMPLE index as the default index type (#4659)

* [HUDI-3091] Making SIMPLE index as the default index type

* Fixing tests

* Traiging timeouts

* disable SIMPLE index for bootstrap tests

* removing test run start and end log statements

* Fixing simple index parallellism for some tests

* Disabling failing test for now

* reverting previous disable

* Reverting all changes

* fixing azure pipeline script

7 months agoAdding support for custom scheduler configs with streaming sink (#4762)
Sivabalan Narayanan [Tue, 8 Feb 2022 09:14:10 +0000 (04:14 -0500)] 
Adding support for custom scheduler configs with streaming sink (#4762)

7 months ago[HUDI-3320] Hoodie metadata table validator (#4721)
YueZhang [Tue, 8 Feb 2022 08:29:44 +0000 (16:29 +0800)] 
[HUDI-3320] Hoodie metadata table validator (#4721)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
7 months ago[HUDI-3312] Fixing spark yaml and adding hive validation to integ test suite (#4731)
Sivabalan Narayanan [Tue, 8 Feb 2022 05:40:36 +0000 (00:40 -0500)] 
[HUDI-3312] Fixing spark yaml and adding hive validation to integ test suite (#4731)

7 months ago[HUDI-3373] Add zero value metrics for empty data source and PROMETHEUS_PUSHGATEWAY...
Vinish Reddy [Mon, 7 Feb 2022 20:17:46 +0000 (01:47 +0530)] 
[HUDI-3373] Add zero value metrics for empty data source and PROMETHEUS_PUSHGATEWAY reporter (#4760)

7 months ago[HUDI-3058] Simplify Precommit file system view (#4570)
satishkotha [Mon, 7 Feb 2022 20:16:50 +0000 (12:16 -0800)] 
[HUDI-3058] Simplify Precommit file system view (#4570)

7 months ago[HUDI-3206] Unify Hive's MOR implementations to avoid duplication (#4559)
Alexey Kudinkin [Mon, 7 Feb 2022 19:06:28 +0000 (11:06 -0800)] 
[HUDI-3206] Unify Hive's MOR implementations to avoid duplication (#4559)

Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc)

- Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base)
- Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase
- Tidying up

7 months ago[HUDI-2941] Show _hoodie_operation in spark sql results (#4649)
ForwardXu [Mon, 7 Feb 2022 14:28:13 +0000 (22:28 +0800)] 
[HUDI-2941] Show _hoodie_operation in spark sql results (#4649)

7 months ago[HUDI-3360] Adding retries to deltastreamer for source errors (#4744)
Sivabalan Narayanan [Mon, 7 Feb 2022 13:10:06 +0000 (08:10 -0500)] 
[HUDI-3360] Adding retries to deltastreamer for source errors (#4744)

7 months ago[HUDI-2491] Expose HMS mode metastore uri config option for spark writer (#3962)
ehui [Mon, 7 Feb 2022 12:43:51 +0000 (20:43 +0800)] 
[HUDI-2491] Expose HMS mode metastore uri config option for spark writer (#3962)

7 months ago[HUDI-3369] New ScheduleAndExecute mode for HoodieCompactor and hudi-cli (#4750)
YueZhang [Mon, 7 Feb 2022 09:31:34 +0000 (17:31 +0800)] 
[HUDI-3369] New ScheduleAndExecute mode for HoodieCompactor and hudi-cli (#4750)

Schedule and execute compaction plan in one single mode.

7 months ago[HUDI-3344] Standard format for HoodieDataSourceExample.scala (#4717)
Qian.Sun [Mon, 7 Feb 2022 03:27:44 +0000 (11:27 +0800)] 
[HUDI-3344] Standard format for HoodieDataSourceExample.scala (#4717)

7 months ago[HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893)
Y Ethan Guo [Fri, 4 Feb 2022 04:24:04 +0000 (20:24 -0800)] 
[HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893)

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
7 months ago[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputForma...
Alexey Kudinkin [Thu, 3 Feb 2022 22:01:41 +0000 (14:01 -0800)] 
[HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR `FileInputFormat`s (#4556)

7 months ago[HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index...
Manoj Govindassamy [Thu, 3 Feb 2022 12:42:48 +0000 (04:42 -0800)] 
[HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups (#4352)

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

- Today, base files have bloom filter at their footers and index lookups
  have to load the base file to perform any bloom lookups. Though we have
  interval tree based file purging, we still end up in significant amount
  of base file read for the bloom filter for the end index lookups for the
  keys. This index lookup operation can be made more performant by having
  all the bloom filters in a new metadata partition and doing pointed
  lookups based on keys.

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Adding indexing support for clean, restore and rollback operations.
   Each of these operations will now be converted to index records for
   bloom filter and column stats additionally.

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Making hoodie key consistent for both column stats and bloom index by
   including fileId instead of fileName, in both read and write paths.

 - Performance optimization for looking up records in the metadata table.

 - Avoiding multi column sorting needed for HoodieBloomMetaIndexBatchCheckFunction

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - HoodieBloomMetaIndexBatchCheckFunction cleanup to remove unused classes

 - Base file checking before reading the file footer for bloom or column stats

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Updating the bloom index and column stats index to have full file name
   included in the key instead of just file id.

 - Minor test fixes.

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Fixed flink commit method to handle metadata table all partition update records

 - TestBloomIndex fixes

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - SparkHoodieBloomIndexHelper code simplification for various config modes

 - Signature change for getBloomFilters() and getColumnStats(). Callers can
   just pass in interested partition and file names, the index key is then
   constructed internally based on the passed in parameters.

 - KeyLookupHandle and KeyLookupResults code refactoring

 - Metadata schema changes - removed the reserved field

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Removing HoodieColumnStatsMetadata and using HoodieColumnRangeMetadata instead.
   Fixed the users of the the removed class.

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Extending meta index test to cover deletes, compactions, clean
   and restore table operations. Also, fixed the getBloomFilters()
   and getColumnStats() to account for deleted entries.

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Addressing review comments - java doc for new classes, keys sorting for
   lookup, index methods renaming.

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Consolidated the bloom filter checking for keys in to one
   HoodieMetadataBloomIndexCheckFunction instead of a spearate batch
   and lazy mode. Removed all the configs around it.

 - Made the metadata table partition file group count configurable.

 - Fixed the HoodieKeyLookupHandle to have auto closable file reader
   when checking bloom filter and range keys.

 - Config property renames. Test fixes.

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Enabling column stats indexing for all columns by default

 - Handling column stat generation errors and test update

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Metadata table partition file group count taken from the slices when
   the table is bootstrapped.

 - Prep records for the commit refactored to the base class

 - HoodieFileReader interface changes for filtering keys

 - Multi column and data types support for colums stats index

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - rebase to latest master and merge fixes for the build and test failures

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Extending the metadata column stats type payload schema to include
   more statistics about the column ranges to help query integration.

* [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

 - Addressing review comments

7 months ago[HUDI-3337] Fixing Parquet Column Range metadata extraction (#4705)
Alexey Kudinkin [Thu, 3 Feb 2022 01:58:05 +0000 (17:58 -0800)] 
[HUDI-3337] Fixing Parquet Column Range metadata extraction (#4705)

- Parquet Column Range metadata extraction utility was simplistically assuming that Decimal types are only represented by INT32, while they representation varies depending on precision.

- More details could be found here:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#DECIMAL

7 months ago[HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues (#4716)
Alexey Kudinkin [Wed, 2 Feb 2022 21:10:51 +0000 (13:10 -0800)] 
[HUDI-3322][HUDI-3343] Fixing Metadata Table Records Duplication Issues (#4716)

This change is addressing issues in regards to Metadata Table observing ingesting duplicated records leading to it persisting incorrect file-sizes for the files referred to in those records.

There are multiple issues that were leading to that:

- [HUDI-3322] Incorrect Rollback Plan generation: Rollback Plan generated for MOR tables was overly expansively listing all log-files with the latest base-instant as the ones that have been affected by the rollback, leading to invalid MT records being ingested referring to those.
- [HUDI-3343] Metadata Table including Uncommitted Log Files during Bootstrap: Since MT is bootstrapped at the end of the commit operation execution (after FS activity, but before committing to the timeline), it was actually incorrectly ingesting some files that were part of the intermediate state of the operation being committed.

This change will unblock Stack of PRs based off #4556

7 months ago[HUDI-431] Adding support for Parquet in MOR `LogBlock`s (#4333)
Alexey Kudinkin [Wed, 2 Feb 2022 19:35:05 +0000 (11:35 -0800)] 
[HUDI-431] Adding support for Parquet in MOR `LogBlock`s (#4333)

- Adding support for Parquet in MOR tables Log blocks

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
7 months ago[HUDI-3330] Remove fixture test tables for multi writer tests (#4704)
Raymond Xu [Wed, 2 Feb 2022 12:20:10 +0000 (04:20 -0800)] 
[HUDI-3330] Remove fixture test tables for multi writer tests (#4704)

7 months ago[HUDI-2589] RFC-37: Metadata table based bloom index (#3989)
Manoj Govindassamy [Tue, 1 Feb 2022 23:38:20 +0000 (15:38 -0800)] 
[HUDI-2589] RFC-37: Metadata table based bloom index (#3989)

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
7 months ago[HUDI-3368] Revert "[HUDI-3306] Upgrade rocksdb version (#4663)" (#4733)
Sivabalan Narayanan [Tue, 1 Feb 2022 19:18:38 +0000 (14:18 -0500)] 
[HUDI-3368] Revert "[HUDI-3306] Upgrade rocksdb version (#4663)" (#4733)

This reverts commit 6f1010799861b9e4b331d0bf5a473211200c15d1.

7 months ago[HUDI-3293] Fixing default value for clustering small file config to 300MB (#4662)
Sivabalan Narayanan [Tue, 1 Feb 2022 13:22:37 +0000 (08:22 -0500)] 
[HUDI-3293] Fixing default value for clustering small file config  to 300MB (#4662)

7 months ago[HUDI-3346] Fixing non existant marker dir handling in TwoToOnedowngrade (#4726)
Sivabalan Narayanan [Tue, 1 Feb 2022 13:21:55 +0000 (08:21 -0500)] 
[HUDI-3346] Fixing non existant marker dir handling in TwoToOnedowngrade (#4726)

7 months ago[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files...
jsbali [Tue, 1 Feb 2022 04:03:18 +0000 (09:33 +0530)] 
[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner (#3946)

Co-authored-by: sivabalan <n.siva.b@gmail.com>
7 months ago[HUDI-3292] Enabling lazy read by default for log blocks during compaction (#4661)
Sivabalan Narayanan [Tue, 1 Feb 2022 03:36:17 +0000 (22:36 -0500)] 
[HUDI-3292] Enabling lazy read by default for log blocks during compaction (#4661)

7 months ago[HUDI-3318] [RFC-46] Optimize Record Payload handling (#4697)
Alexey Kudinkin [Tue, 1 Feb 2022 02:33:35 +0000 (18:33 -0800)] 
[HUDI-3318] [RFC-46] Optimize Record Payload handling (#4697)

7 months ago[HUDI-3253] preferred to use the table's own location (#4608)
Yann Byron [Sat, 29 Jan 2022 08:39:42 +0000 (16:39 +0800)] 
[HUDI-3253] preferred to use the table's own location (#4608)

7 months ago[MINOR] Added log to debug checkpoint resumption when set to 0 (#4650)
Harsha Teja Kanna [Sat, 29 Jan 2022 04:08:25 +0000 (22:08 -0600)] 
[MINOR] Added log to debug checkpoint resumption when set to 0 (#4650)

7 months ago[HUDI-1977] Fix Hudi CLI tempview query issue (#4626)
peanut-chenzhong [Sat, 29 Jan 2022 02:39:08 +0000 (10:39 +0800)] 
[HUDI-1977] Fix Hudi CLI tempview query issue (#4626)