Ryan Blue [Mon, 1 Nov 2021 21:27:07 +0000 (14:27 -0700)]
Add version.txt for release 0.12.1
Kyle Bendickson [Sun, 31 Oct 2021 18:56:56 +0000 (11:56 -0700)]
Spark 3.2: Remove extra parens to fix checkstyle (#3386)
Kyle Bendickson [Fri, 29 Oct 2021 19:02:24 +0000 (12:02 -0700)]
ORC: Fix importing ORC files with float and double columns and test (#3332)
Chen Zhang [Tue, 26 Oct 2021 23:46:09 +0000 (07:46 +0800)]
Spark: Fix ClassCastException when using bucket UDF (#3368)
pvary [Tue, 26 Oct 2021 22:21:43 +0000 (00:21 +0200)]
Hive: Fix Catalogs.hiveCatalog method for default catalogs (#3338)
Kyle Bendickson [Sun, 10 Oct 2021 17:53:21 +0000 (10:53 -0700)]
Build: Fix ErrorProne NewHashMapInt warnings (#3260)
This should improve performance by allocating the correct size directly rather than reallocating later.
Szehon Ho [Wed, 20 Oct 2021 15:09:10 +0000 (08:09 -0700)]
Avro: Fix file import with correct row count (#3273)
Anton Okolnychyi [Fri, 1 Oct 2021 18:48:43 +0000 (11:48 -0700)]
Core: Validate concurrently added delete files in OvewriteFiles (#3199)
Anton Okolnychyi [Tue, 28 Sep 2021 19:34:39 +0000 (12:34 -0700)]
Core: Validate concurrently added delete files in RowDelta (#3195)
jshmchenxi [Mon, 11 Oct 2021 10:27:50 +0000 (18:27 +0800)]
Hive: Ensure tableLevelMutex is unlocked when uncommitted metadata delete fails (#3264)
Anton Okolnychyi [Mon, 13 Sep 2021 21:25:49 +0000 (11:25 -1000)]
Core: Optimize check for referenced data files in BaseRowDelta (#3071)
This change optimizes our check for referenced data files in BaseRowDelta by pushing down the conflict detection filter. Previously, we would open manifests even though they belonged to partitions out of our interest.
Ryan Blue [Tue, 19 Oct 2021 15:05:48 +0000 (08:05 -0700)]
Parquet: Fix map projection after map to key_value rename (#3309)
Ryan Blue [Tue, 19 Oct 2021 18:34:11 +0000 (11:34 -0700)]
Hotfix: Fix Flink test imports. (#3319)
Ryan Blue [Tue, 19 Oct 2021 11:56:26 +0000 (04:56 -0700)]
Flink: Fix CDC validation errors (#3258)
Omar Al-Safi [Tue, 12 Oct 2021 09:10:41 +0000 (11:10 +0200)]
Hive: Fix NoSuchMethodError of OrcTail with Hive3.x and Vectorized ORC (#3155)
Rajarshi Sarkar [Tue, 28 Sep 2021 19:46:24 +0000 (01:16 +0530)]
AWS: Add check to create staging directory if not exists for S3OutputStream (#3175)
xloya [Tue, 28 Sep 2021 18:05:01 +0000 (02:05 +0800)]
Data: Fix equality deletes with date/time types (#3135)
Co-authored-by: xiaojiebao <xiaojiebao@xiaomi.com>
Đặng Minh Dũng [Sun, 26 Sep 2021 22:55:37 +0000 (05:55 +0700)]
Core: Fix JDBC properties, only keep keys with jdbc. prefix (#3078)
Bijan Houle [Mon, 13 Sep 2021 15:38:16 +0000 (09:38 -0600)]
AWS: Fix DynamoDbCatalog.dropNamespace attr check (#3035)
Aman Rawat [Mon, 30 Aug 2021 17:51:47 +0000 (23:21 +0530)]
Core: Fix null value check for table properties (#3052)
Co-authored-by: rawataaryan9 <rawataaryan9@github.com>
Carl Steinbach [Mon, 9 Aug 2021 23:21:23 +0000 (16:21 -0700)]
Add version.txt for release 0.12.0
Szehon Ho [Mon, 9 Aug 2021 22:58:51 +0000 (15:58 -0700)]
Core: Add predicate pushdown for files metadata table (#2926)
Szehon Ho [Mon, 9 Aug 2021 21:55:41 +0000 (14:55 -0700)]
Spark: Fix random test failures in TestDeleteReachableFilesAction (#2951)
ismail simsek [Mon, 9 Aug 2021 20:58:25 +0000 (22:58 +0200)]
API: Validate identifier fields in Schema (#2943)
Wing Yew Poon [Mon, 9 Aug 2021 20:55:05 +0000 (13:55 -0700)]
Spark: Fix broken RepartitionByExpression in RewriteDelete for 3.1 (#2954)
The constructor of RepartitionByExpression changed between Spark 3.0 and 3.1.
There was an instance of constructing RepartitionByExpression that was missed in the original commit (#2512).
Jack Ye [Mon, 9 Aug 2021 19:58:27 +0000 (12:58 -0700)]
AWS: Fix concurrent modification integration test (#2948)
Jack Ye [Fri, 6 Aug 2021 22:43:31 +0000 (15:43 -0700)]
Doc: update README local build instructions (#2949)
Kyle Bendickson [Fri, 6 Aug 2021 21:14:26 +0000 (14:14 -0700)]
Docs: Add Hadoop conf overrides in Spark (#2922)
Steven Zhen Wu [Fri, 6 Aug 2021 14:53:40 +0000 (07:53 -0700)]
Flink: Add uidPrefix to operator name so that Flink web UI can show names for different iceberg sinks in a job (#2886)
Russell Spitzer [Thu, 5 Aug 2021 17:52:11 +0000 (12:52 -0500)]
Core: Fix nested schema projection in AllDataFilesTable (#2941)
Russell Spitzer [Thu, 5 Aug 2021 12:23:07 +0000 (07:23 -0500)]
Spark: Fix nested struct pruning (#2877)
* Spark: Support Nested Struct Pruning in DataTasks
Previously DataTasks would return full schemas for some tables and pruned schemas for others and would rely on the underlying framework to do the actual projection. This moves projection and pruning into the core responsibility of the task. This fixes an issue where Spark would be able to pushdown some nested struct predicates to a metadata table but we wouldn't recognize this when trying to do the projection in the framework. StaticDataTasks now support projection in their creation but only if it does not require pruning fields from within a struct which is an element of a List or Map.
Carl Steinbach [Thu, 5 Aug 2021 02:51:33 +0000 (19:51 -0700)]
Add list of Github collaborators to asf.yaml (#2909)
Anton Okolnychyi [Wed, 4 Aug 2021 14:35:22 +0000 (04:35 -1000)]
Flink: Add FlinkWriterFactory (#2924)
Carl Steinbach [Wed, 4 Aug 2021 01:21:53 +0000 (18:21 -0700)]
Add thenewstack.io article to list of blog posts (#2930)
Jack Ye [Tue, 3 Aug 2021 22:27:58 +0000 (15:27 -0700)]
Core: Allow creating v2 tables through table property (#2887)
Anton Okolnychyi [Tue, 3 Aug 2021 19:47:39 +0000 (09:47 -1000)]
Core: Support nulls in StructLike collections (#2929)
Anton Okolnychyi [Tue, 3 Aug 2021 17:29:44 +0000 (07:29 -1000)]
Flink: Switch to using SerializableTable (#2923)
Piotr Findeisen [Tue, 3 Aug 2021 16:21:16 +0000 (18:21 +0200)]
Parquet: Annotate UUID fields (#2913)
The spec mandates that UUID fields in Parquet have logical type "UUID"
(https://iceberg.apache.org/spec/#parquet). This is possible to fulfill
after
236615497bdc2c6fbedbd3acc41a4ed85c4a8bfd, as
`LogicalTypeAnnotation.uuidType` was added in Parquet 1.12.0.
Ryan Blue [Tue, 3 Aug 2021 14:39:58 +0000 (07:39 -0700)]
Core: Fix partition field IDs in table replacement (#2906)
Co-authored-by: Jun He <jun-he@users.noreply.github.com>
Ryan Blue [Sun, 1 Aug 2021 23:23:03 +0000 (16:23 -0700)]
Docs: Update Slack invite link (#2904)
* Docs: Update Slack invite link.
* Update the intro paragraph as well.
Ryan Blue [Sun, 1 Aug 2021 18:28:03 +0000 (11:28 -0700)]
Build: Fix site publishing in .asf.yaml.
Anton Okolnychyi [Sat, 31 Jul 2021 07:18:17 +0000 (21:18 -1000)]
Core: Add WriterFactory (#2873)
Saurabh Agarwal [Fri, 30 Jul 2021 11:48:22 +0000 (17:18 +0530)]
Api#2880: Close the underlying iterator in ClosingIterator in hasNext() call (#2881)
Flyangz [Thu, 29 Jul 2021 10:40:05 +0000 (18:40 +0800)]
Core: Add includeColumnStats option in FindFiles API (#2875)
openinx [Thu, 29 Jul 2021 07:35:07 +0000 (15:35 +0800)]
Core: Fix the NPE in DataFiles.Builder#copy (#2852)
Ted Gooch [Wed, 28 Jul 2021 23:50:51 +0000 (16:50 -0700)]
[python] Updating pyarrow dependencies (#2888)
Co-authored-by: tgooch <tgooch@netflix.com>
jun-he [Wed, 28 Jul 2021 19:34:22 +0000 (12:34 -0700)]
[Python] support BucketByteBuffer and BucketUUID (#2836)
* [Python] support BucketByteBuffer and BucketUUID
* Add additional unit tests for bucket hash methods.
Eduard Tudenhöfner [Wed, 28 Jul 2021 15:20:39 +0000 (17:20 +0200)]
Docs: Update Slack invite link (#2882)
Piotr Findeisen [Wed, 28 Jul 2021 11:42:29 +0000 (13:42 +0200)]
Docs: Avoid insinuating other file format is supported (#2883)
Daniel Weeks [Wed, 28 Jul 2021 01:26:25 +0000 (18:26 -0700)]
Docs: Add Tencent blog - Flink + Iceberg: How to Construct a Whole-scenario Real-time Data Warehouse (#2876)
Ryan Blue [Tue, 27 Jul 2021 15:59:53 +0000 (08:59 -0700)]
Core: Add validation for row-level deletes with rewrites (#2865)
Russell Spitzer [Tue, 27 Jul 2021 03:11:46 +0000 (22:11 -0500)]
Build: Run tests against Spark 3.0 and Spark 3.1
Due to an issue with the build.gradle both the spark test and test31 modules were both running with Spark 3.1. Removing the inter dependency fixes the issue and both grade tasks now run with the correct respective spark versions.
Ryan Blue [Tue, 27 Jul 2021 00:11:01 +0000 (17:11 -0700)]
Spec: Make contains_nan partition summary field optional in v2. (#2864)
Ryan Blue [Tue, 27 Jul 2021 00:10:44 +0000 (17:10 -0700)]
Spec: Add back distinct_counts in data_file metadata (#2805)
* Spec: Add back distinct_counts in data_file metadata.
* Update for review comments.
Carl Steinbach [Mon, 26 Jul 2021 22:30:09 +0000 (15:30 -0700)]
Docs: Add security page to ASF site (#2813)
This patch adds a page to the site docs which describes the process for
reporting security vulnerabilites found in Apache Iceberg.
Anton Okolnychyi [Mon, 26 Jul 2021 21:57:38 +0000 (11:57 -1000)]
Core: Support multiple specs in OutputFileFactory (#2858)
Dave Nielsen [Mon, 26 Jul 2021 02:26:35 +0000 (19:26 -0700)]
Docs: Add new blog - Apache Iceberg: An Architectural Look Under the Covers (#2856)
Anton Okolnychyi [Sat, 24 Jul 2021 01:03:46 +0000 (15:03 -1000)]
Core: Add DataWriter builders (#2857)
Anton Okolnychyi [Fri, 23 Jul 2021 21:28:02 +0000 (11:28 -1000)]
Core: Add table properties for Avro and Parquet delete files (#2851)
Piotr Findeisen [Thu, 22 Jul 2021 19:24:52 +0000 (21:24 +0200)]
API: Fix string bucketing with non-BMP characters (#2849)
Russell Spitzer [Wed, 21 Jul 2021 13:47:00 +0000 (08:47 -0500)]
Core: Adds SortRewriteStrategy (#2609)
A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
File C' (x: 41 - 60).
Jack Ye [Wed, 21 Jul 2021 01:56:57 +0000 (18:56 -0700)]
Doc: add documentation for JDBC and DynamoDB catalogs (#2831)
Piotr Findeisen [Tue, 20 Jul 2021 21:00:34 +0000 (23:00 +0200)]
Spec: Fix missing negative in binary/fixed hash examples (#2840)
Eduard Tudenhöfner [Tue, 20 Jul 2021 08:32:40 +0000 (10:32 +0200)]
Nessie: Bump Nessie to 0.8.3 / Rename auth_type to auth-type (#2834)
Kyle Bendickson [Mon, 19 Jul 2021 22:09:03 +0000 (15:09 -0700)]
SPARK: Allow spark catalogs to have hadoop configuration overrides p… (#2792)
Previously Iceberg Catalogs loaded into Spark would always use the Hadoop Configuration owned by the underlying Spark Session. This made it impossible to use a different set of configuration values which may be required to connect to a remote Catalog. This patch allows Spark catalogs to have hadoop configuration overrides per catalog permitting different configuration for different underlying Iceberg catalogs.
Eduard Tudenhöfner [Mon, 19 Jul 2021 10:57:09 +0000 (12:57 +0200)]
Move Assert.assertTrue(..) instance checks to AssertJ assertions (#2756)
Kyle Bendickson [Fri, 16 Jul 2021 15:34:38 +0000 (08:34 -0700)]
ORC: Upgrade ORC dependency to 1.6.9 (#2781)
Robert Stupp [Thu, 15 Jul 2021 22:21:27 +0000 (00:21 +0200)]
Bump Nessie to 0.8.2 + related changes (#2588)
* Bump Nessie to 0.8.2 + replace Gradle plugin with new JUnit extension
More changes in this PR in following commits.
Replace Gradle plugin with new JUnit extension.
See [Add JAX-RS tests and add JUnit/Jupyter extension](https://github.com/projectnessie/nessie/pull/1566)
* Changes required by Nessie-API changes
Apply changes to Iceberg required by API changes in Nessie:
* [Re-introduce wrapper classes for query params of CommitLog/Entries](https://github.com/projectnessie/nessie/pull/1595)
* [Server-side commit range filtering](https://github.com/projectnessie/nessie/pull/1596)
* [Add hashOnRef query param to support time travel on a named ref](https://github.com/projectnessie/nessie/pull/1589)
* [Only accept NamedRefs in REST API](https://github.com/projectnessie/nessie/pull/1583)
* Bugfix: must send the Contents.id of the existing table
Nessie's `Contents.id` is a random ID generated when the `Contents.Key` is first used (think:
CREATE TABLE) and must not be changed. This change addresses a bug in the Iceberg-Nesie code
that caused a new id for every change.
* Throw `CommitStateUnknownException` for `renameTable` as well
Follow-up of #2515
* Fix race-condition & save one roundtrip to Nessie during "commit"
When commiting a change, the Nessie-API now returns the hash of the commit for the change.
This returned hash should then be used as the "expected hash" for the next commit.
The previous approach was to commit the change to Nessie and then do another request to
retrieve the new hash of HEAD.
This old approach is prone to a race condition, namely when another commit happens after
"this" commit but before retrieving the "new HEAD", so "this" instance would wrongly
ignore the other commit's changes during conflict checks.
See [Let VersionStore.create()+commit() return the current hash](https://github.com/projectnessie/nessie/pull/1089)
Russell Spitzer [Thu, 15 Jul 2021 17:45:24 +0000 (12:45 -0500)]
Spark: Remove unused FileRewriteCoordinator code (#2819)
Since we changed our implementation of Spark3BinPackStrategy, we no longer need some
of the functionality that was previously in FileRewriteCoordinator. Here we remove
those functions and related test code.
sshkvar [Thu, 15 Jul 2021 17:06:44 +0000 (20:06 +0300)]
Add support for reading/writing timestamps without timezone. (#2757)
Previously Spark could not handle Iceberg tables which contained Timestamp.withoutTimeZone. New parameters are introduced to allow Timestamp without TimeZone to be treated as Timestamp with Timezone.
Co-authored-by: bkahloon <kahlonbakht@gmail.com>
Co-authored-by: shardulm94
Fokko Driesprong [Tue, 13 Jul 2021 15:59:22 +0000 (17:59 +0200)]
Core: Use Avro 1.10.1 (#1648)
Co-authored-by: Fokko Driesprong <fdriesprong@ebay.com>
Szehon Ho [Tue, 13 Jul 2021 14:58:04 +0000 (07:58 -0700)]
Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog (#2777)
Pushes down partition filters in Spark/Hive Import to underlying catalog instead of retrieving all partitions and then filtering.
Eduard Tudenhöfner [Tue, 13 Jul 2021 13:40:28 +0000 (15:40 +0200)]
Docs: Fix link to intellij-java-palantir-style.xml (#2817)
Anton Okolnychyi [Tue, 13 Jul 2021 05:49:37 +0000 (19:49 -1000)]
Spark: Add missing deprecation annotations for old actions (#2811)
Anton Okolnychyi [Tue, 13 Jul 2021 02:28:51 +0000 (16:28 -1000)]
Spark: Use JavaSparkContext.fromSparkContext instead of constructor (#2812)
Anton Okolnychyi [Tue, 13 Jul 2021 02:20:32 +0000 (16:20 -1000)]
API: Use delete instead of remove in action names (#2810)
daksha121 [Tue, 13 Jul 2021 00:16:36 +0000 (17:16 -0700)]
Spark: Add table property to skip delete snapshots in streaming (#2752)
jshmchenxi [Mon, 12 Jul 2021 17:24:19 +0000 (01:24 +0800)]
Spark: Parallelize task init when fetching locality info (#2800)
Marton Bod [Mon, 12 Jul 2021 11:31:58 +0000 (13:31 +0200)]
Upgrade to Tez 0.10.1 (#2790)
Eduard Tudenhoefner [Mon, 28 Jun 2021 10:27:03 +0000 (12:27 +0200)]
Reduce code duplication in VectorizedParquetDefinitionLevelReader
Eduard Tudenhoefner [Mon, 28 Jun 2021 08:23:19 +0000 (10:23 +0200)]
Reduce code duplication in VectorizedPageIterator
Eduard Tudenhoefner [Fri, 25 Jun 2021 16:48:11 +0000 (18:48 +0200)]
Reduce code duplication in VectorizedDictionaryEncodedParquetValuesReader
Eduard Tudenhoefner [Fri, 25 Jun 2021 15:58:09 +0000 (17:58 +0200)]
Reduce code duplication in VectorizedColumnIterator
Eduard Tudenhoefner [Fri, 25 Jun 2021 14:52:51 +0000 (16:52 +0200)]
Refactor VectorizedArrowReader
Eduard Tudenhoefner [Fri, 25 Jun 2021 14:43:28 +0000 (16:43 +0200)]
Don't use deprecated methods
Russell Spitzer [Sun, 11 Jul 2021 23:50:36 +0000 (18:50 -0500)]
Spark: Reimplement RewriteDatafilesAction with partial progress (#2591)
Eduard Tudenhöfner [Sat, 10 Jul 2021 23:15:27 +0000 (01:15 +0200)]
Build: Upgrade to JUnit 5 (#2797)
Russell Spitzer [Fri, 9 Jul 2021 18:50:17 +0000 (13:50 -0500)]
Docs: Fixes broken links to old spark doc page (#2801)
Russell Spitzer [Fri, 9 Jul 2021 14:45:03 +0000 (09:45 -0500)]
Build: Change Spark Versions to Support M1 Processors (#2795)
Spark's Snappy native lib support is missing M1 support in
our current build. Upgrading Spark upgrades Snappy to a version
which has these native libs. This has no effect on actual runtime
Spark for end users since we do not include Spark with our
release jars.
Ward Harris [Thu, 8 Jul 2021 07:30:18 +0000 (15:30 +0800)]
Core: Fix JdbcCatalog CATALOG_TABLE_NAME to be lowercase (#2778)
Jack Ye [Tue, 6 Jul 2021 20:50:37 +0000 (13:50 -0700)]
Build: bump up DiffPlug Spotless version (#2776)
Ryan Blue [Tue, 6 Jul 2021 19:57:10 +0000 (12:57 -0700)]
Spec: Update v2 change summary (#2762)
southernriver [Tue, 6 Jul 2021 14:59:29 +0000 (22:59 +0800)]
Style: Delete blank line of CachedClientPool.java (#2787)
Eduard Tudenhöfner [Mon, 5 Jul 2021 17:59:22 +0000 (19:59 +0200)]
Nessie: Properly format code in Nessie module (#2733)
Eduard Tudenhöfner [Fri, 2 Jul 2021 12:04:29 +0000 (14:04 +0200)]
Docs: Describe available Benchmarks and how to run them (#2767)
Karuppayya [Fri, 2 Jul 2021 00:25:48 +0000 (17:25 -0700)]
Spark: RemoveReachableFiles action should fail if GC is disabled (#2763)
Co-authored-by: Karuppayya Rajendran <karuppayya.rajendran@apple.com>
Ada Wong [Fri, 2 Jul 2021 00:22:39 +0000 (08:22 +0800)]
Docs: Fix typo in flink.md (#2772)
Ryan Blue [Thu, 1 Jul 2021 20:36:51 +0000 (13:36 -0700)]
Docs: Update for mkdocs 1.2 (#2747)
* Docs: Fix mkdocs use_directory_urls in 1.2.
* Fix broken links and update redirects.
Samarth Jain [Thu, 1 Jul 2021 17:00:48 +0000 (10:00 -0700)]
Spark: Add limited support for vectorized reads for Parquet V2 (#2749)
With this change, we have added support for Parquet data written in V2 format.
The only data encodings we support are dictionary and plain.
Vectorized reads against data written using Delta/RLE and other encodings are
not supported. As of this commit, note that the Spark Parquet vectorized reads also don't
support vectorized reads for such encodings.
Eduard Tudenhöfner [Thu, 1 Jul 2021 16:38:36 +0000 (18:38 +0200)]
Docs: Describe how to configure Code formatter for IntelliJ IDEA (#2766)