incubator-iceberg.git
2 days agoAllow custom hadoop properties to be loaded in the Spark data source (#7) master
mccheah [Fri, 14 Dec 2018 18:04:17 +0000 (10:04 -0800)] 
Allow custom hadoop properties to be loaded in the Spark data source (#7)

Properties that start with iceberg.hadoop are copied into the Hadoop Configuration used in the Spark source. These may be set in table properties or in read and write options passed to the Spark operation. Read and write options take precedence over the table properties.

Supporting these custom Hadoop properties should also be done in other Iceberg integrations in subsequent patches.

3 days agoFix commit retry with manfiest lists. (#48)
Ryan Blue [Thu, 13 Dec 2018 21:18:37 +0000 (13:18 -0800)] 
Fix commit retry with manfiest lists. (#48)

A manifest list is created for every commit attempt. Before this update,
the same file was used, which caused retries to fail trying to create
the same list file. This uses a new location for every manifest list,
keeps track of old lists, and cleans up unused lists after a commit
succeeds.

4 days agoFix type handling in Spark and Pig. (#49)
Ryan Blue [Thu, 13 Dec 2018 16:40:17 +0000 (08:40 -0800)] 
Fix type handling in Spark and Pig. (#49)

4 days agoDo not scan manifests with no deletes when expiring. (#46)
Ryan Blue [Wed, 12 Dec 2018 21:50:01 +0000 (13:50 -0800)] 
Do not scan manifests with no deletes when expiring. (#46)

5 days agoSpark: Support custom data location (#6)
mccheah [Tue, 11 Dec 2018 20:36:59 +0000 (12:36 -0800)] 
Spark: Support custom data location (#6)

This adds a new table property, write.folder-storage.path, that controls the location of new data files.

5 days agoAdd pluggable file I/O submodule in TableOperations (#14)
mccheah [Tue, 11 Dec 2018 17:14:45 +0000 (09:14 -0800)] 
Add pluggable file I/O submodule in TableOperations (#14)

This adds FileIO that is returned by TableOperations and used to delete paths and to create InputFile and OutputFile instances. FileIO is Serializable so that it can be sent to tasks running in different JVMs and used for all file-related tasks for a table.

6 days agoAdd Javadoc link.
Ryan Blue [Mon, 10 Dec 2018 17:56:38 +0000 (09:56 -0800)] 
Add Javadoc link.

6 days agoUpdate to Spark 2.4 (#30)
Ryan Blue [Mon, 10 Dec 2018 17:35:15 +0000 (09:35 -0800)] 
Update to Spark 2.4 (#30)

* Update to the Spark 2.4 API.
* Remove ORC support from iceberg-spark.
* Use Filter instead of Expression.

6 days agoUpdate community doc with repo location.
Ryan Blue [Mon, 10 Dec 2018 17:18:50 +0000 (09:18 -0800)] 
Update community doc with repo location.

6 days agoAdd doc about hidden partitioning.
Ryan Blue [Mon, 10 Dec 2018 17:16:05 +0000 (09:16 -0800)] 
Add doc about hidden partitioning.

10 days agoReturn FileAppender from Avro.WriteBuilder, not a package-private implementation...
Ratandeep Ratti [Thu, 6 Dec 2018 17:49:59 +0000 (09:49 -0800)] 
Return FileAppender from Avro.WriteBuilder, not a package-private implementation. (#27)

10 days agoUpdate ExpireSnapshots to avoid commit when no snapshots are removed. (#22)
Ryan Blue [Thu, 6 Dec 2018 17:45:39 +0000 (09:45 -0800)] 
Update ExpireSnapshots to avoid commit when no snapshots are removed. (#22)

11 days agoAdd manifest listing files (#21)
Ryan Blue [Thu, 6 Dec 2018 01:20:06 +0000 (17:20 -0800)] 
Add manifest listing files (#21)

* Add ManifestFile and migrate Snapshot to return it.
* Optionally write manifest lists to separate files.
    This adds a new table property, write.manifest-lists.enabled, that
    defaults to false. When enabled, new snapshot manifest lists will be
    written into separate files. The file location will be stored in the
    snapshot metadata as "manifest-list".
* Aggregate partition field summaries when writing manifests.
* Add InclusiveManifestEvaluator.
    This expression evaluator determines whether a manifest needs to be
    scanned or whether it cannot contain data files matching a partition
    predicate.
* Add file length to ManifestFile.
* Ensure files in manifest lists have helpful metadata.
    This modifies SnapshotUpdate when writing a snapshot with a manifest
    list file. If files for the manifest list do not have full metadata,
    then this will scan the manifests to add metadata, including snapshot
    ID, added/existing/deleted count, and partition field summaries.
* Add partitions name mapping when reading Snapshot manifest list.
* Update ScanSummary and FileHistory to use ManifestFile metadata.
    This optimizes ScanSummary and FileHistory to ignore manifests that
    cannot have changes in the configured time range.

11 days agoStore multiple partition specs in table metadata. (#3)
Ryan Blue [Wed, 5 Dec 2018 19:53:06 +0000 (11:53 -0800)] 
Store multiple partition specs in table metadata. (#3)

The purpose of this change is to enable future partition spec changes
and to assign IDs to specs that can be easily encoded in an Avro file
that tracks a snapshot's manifests.

This updates TableMetadata and the metadata parser to support multiple
partition specs. This change is forward-compatible for older readers
because the "partition-spec" field in table metadata is still set to the
default spec.

Multiple specs are now stored in an array in table metadata called
"partition-specs". Each entry in the array is an object with two fields,
a "spec-id" field with an integer ID value, and a "partition-spec"
field with a partition spec value (an array of partition fields). This
also adds "default-spec-id" that points to the spec that should be used
when writing.

12 days agoSet derby log system property earlier (#19)
mccheah [Wed, 5 Dec 2018 16:43:38 +0000 (08:43 -0800)] 
Set derby log system property earlier (#19)

Otherwise an empty hive/derby.log file is created when running the tests.

12 days agoFix Javadoc link in README.
Ryan Blue [Wed, 5 Dec 2018 01:02:33 +0000 (17:02 -0800)] 
Fix Javadoc link in README.

2 weeks agoRemove filter and iterable methods from Snapshot. (#17)
Ryan Blue [Wed, 28 Nov 2018 23:31:26 +0000 (15:31 -0800)] 
Remove filter and iterable methods from Snapshot. (#17)

These are not used.

2 weeks agoSupport dateCreated expressions in ScanSummary. (#2)
Ryan Blue [Wed, 28 Nov 2018 23:30:53 +0000 (15:30 -0800)] 
Support dateCreated expressions in ScanSummary. (#2)

2 weeks agoAllow custom FileSystem logic in HadoopTableOperations. (#15)
mccheah [Tue, 27 Nov 2018 19:22:16 +0000 (11:22 -0800)] 
Allow custom FileSystem logic in HadoopTableOperations. (#15)

2 weeks agoFix gitignore for IDEA project files. (#5)
mccheah [Mon, 26 Nov 2018 20:14:54 +0000 (12:14 -0800)] 
Fix gitignore for IDEA project files. (#5)

2 weeks agoAllow Tables implementations to override table paths. (#1)
Ryan Blue [Mon, 26 Nov 2018 17:37:44 +0000 (09:37 -0800)] 
Allow Tables implementations to override table paths. (#1)

3 weeks agoDocs: Update mailing lists and add Travis CI badge to README.
Ryan Blue [Tue, 20 Nov 2018 17:16:23 +0000 (09:16 -0800)] 
Docs: Update mailing lists and add Travis CI badge to README.

3 weeks agoUpdate README for Apache.
Ryan Blue [Mon, 19 Nov 2018 22:02:26 +0000 (14:02 -0800)] 
Update README for Apache.

3 weeks agoUpdate headers and licensing for Apache.
Ryan Blue [Mon, 19 Nov 2018 21:31:41 +0000 (13:31 -0800)] 
Update headers and licensing for Apache.

* Use Apache headers
* Add check-licenses script from Apache Spark
* Add license headers to markdown files
* Remove javadocs, which will be published on the site

3 weeks agoAdd Apache site.
Ryan Blue [Mon, 19 Nov 2018 19:21:00 +0000 (11:21 -0800)] 
Add Apache site.

4 weeks agoAdd snapshot timestamp filtering to ScanSummary.
Ryan Blue [Tue, 13 Nov 2018 19:35:16 +0000 (11:35 -0800)] 
Add snapshot timestamp filtering to ScanSummary.

4 weeks agoAdd accessor for data timestamp to ScanSummary.PartitionMetrics.
Ryan Blue [Tue, 13 Nov 2018 19:28:01 +0000 (11:28 -0800)] 
Add accessor for data timestamp to ScanSummary.PartitionMetrics.

4 weeks agoAdd task dependency on shadowJar to install.
Ryan Blue [Tue, 13 Nov 2018 17:58:43 +0000 (09:58 -0800)] 
Add task dependency on shadowJar to install.

This fixes #96.

4 weeks agoUpdate TableScan.select to select data columns, not manifest columns. (#95)
Ryan Blue [Mon, 12 Nov 2018 21:29:23 +0000 (13:29 -0800)] 
Update TableScan.select to select data columns, not manifest columns. (#95)

* Update TableScan.select to select data columns, not manifest columns.
* Add TableScan#project to set a projection without select.
* Add projection schema to scan events.

4 weeks agoApply the idea plugin for all projects. (#70)
mccheah [Mon, 12 Nov 2018 21:14:24 +0000 (13:14 -0800)] 
Apply the idea plugin for all projects. (#70)

5 weeks agoAdd ScanEvent and Listeners.
Ryan Blue [Sat, 10 Nov 2018 01:31:01 +0000 (17:31 -0800)] 
Add ScanEvent and Listeners.

5 weeks agoAdd optional number of retries when refreshing metadata.
Ryan Blue [Fri, 9 Nov 2018 21:02:13 +0000 (13:02 -0800)] 
Add optional number of retries when refreshing metadata.

5 weeks agoAdd exception trace when logging a task retry.
Ryan Blue [Fri, 9 Nov 2018 20:49:18 +0000 (12:49 -0800)] 
Add exception trace when logging a task retry.

5 weeks agoCache results in schema and partition spec parsers. (#89)
Ryan Blue [Wed, 7 Nov 2018 20:56:50 +0000 (12:56 -0800)] 
Cache results in schema and partition spec parsers. (#89)

5 weeks agoAdd ignoreDeleted to ManifestGroup.
Ryan Blue [Thu, 8 Nov 2018 21:32:23 +0000 (13:32 -0800)] 
Add ignoreDeleted to ManifestGroup.

5 weeks agoDo not filter data files if filter is null or always true.
Ryan Blue [Wed, 7 Nov 2018 19:49:37 +0000 (11:49 -0800)] 
Do not filter data files if filter is null or always true.

5 weeks agoFix NPE in ScanSummary.
Ryan Blue [Wed, 7 Nov 2018 19:25:06 +0000 (11:25 -0800)] 
Fix NPE in ScanSummary.

5 weeks agoAdd ManifestGroup to scan a set of manifests.
Ryan Blue [Wed, 7 Nov 2018 18:08:10 +0000 (10:08 -0800)] 
Add ManifestGroup to scan a set of manifests.

This allows filtering files using a data filter to prune based on
partition and statistics, as well as a file filter to prune based on
data file properties like location or row count.

FileHistory and ScanSummary have been updated to use ManifestGroup.

5 weeks agoAdd FilteredManifest.entries to get filtered manifest entries.
Ryan Blue [Wed, 7 Nov 2018 17:36:42 +0000 (09:36 -0800)] 
Add FilteredManifest.entries to get filtered manifest entries.

5 weeks agoAdd factory method withNoopClose to CloseableIterable.
Ryan Blue [Wed, 7 Nov 2018 17:35:33 +0000 (09:35 -0800)] 
Add factory method withNoopClose to CloseableIterable.

5 weeks agoUse a concurrent queue when parallelizing manifest scans.
Ryan Blue [Wed, 7 Nov 2018 17:34:03 +0000 (09:34 -0800)] 
Use a concurrent queue when parallelizing manifest scans.

5 weeks agoImplement StructLike in GenericDataFile.
Ryan Blue [Wed, 7 Nov 2018 17:32:44 +0000 (09:32 -0800)] 
Implement StructLike in GenericDataFile.

This enables filtering data files with evaluators.

5 weeks agoUpdate partition metrics toString.
Ryan Blue [Tue, 6 Nov 2018 00:42:32 +0000 (16:42 -0800)] 
Update partition metrics toString.

5 weeks agoRemove Closeable from TableScan interface. (#87)
Ryan Blue [Mon, 5 Nov 2018 23:56:17 +0000 (15:56 -0800)] 
Remove Closeable from TableScan interface. (#87)

TableScan should be immutable so it can be shared between threads. If
TableScan is Closeable, then it can't be shared because resources in use
from one thread can be closed by another when the scan is closed.

Instead, the iterables created by a scan should be closeable.

5 weeks agoAdd system property to turn off parallel scan planning. (#86)
Ryan Blue [Mon, 5 Nov 2018 22:27:03 +0000 (14:27 -0800)] 
Add system property to turn off parallel scan planning. (#86)

The system property is iceberg.scan.plan-in-worker-pool.

5 weeks agoAdd total partition size to ScanSummary.
Ryan Blue [Mon, 5 Nov 2018 22:26:11 +0000 (14:26 -0800)] 
Add total partition size to ScanSummary.

5 weeks agoAvoid projecting all manifest columns in ScanSummary.
Ryan Blue [Mon, 5 Nov 2018 22:25:48 +0000 (14:25 -0800)] 
Avoid projecting all manifest columns in ScanSummary.

5 weeks agoAdd hard limit to ScanSummary.
Ryan Blue [Mon, 5 Nov 2018 22:08:50 +0000 (14:08 -0800)] 
Add hard limit to ScanSummary.

When the limit is reached, throw IllegalStateException.

5 weeks agoClose scans in ScanSummary.
Ryan Blue [Mon, 5 Nov 2018 22:07:53 +0000 (14:07 -0800)] 
Close scans in ScanSummary.

5 weeks agoFix default value for parent snapshot ID in SnapshotUpdate.
Ryan Blue [Mon, 5 Nov 2018 18:55:36 +0000 (10:55 -0800)] 
Fix default value for parent snapshot ID in SnapshotUpdate.

6 weeks agoDo not refresh tables in commit.
Ryan Blue [Fri, 2 Nov 2018 17:51:33 +0000 (10:51 -0700)] 
Do not refresh tables in commit.

Refreshing table metadata in the commit method creates an error case
that is indistinguishable from a failed commit. This can be handled by
not deleting data that was written, but Iceberg cannot return an unknown
commit status to engines like Spark. Spark needs a clear success or
failure to return its success code that determines whether the job will
be retried.

6 weeks agoAdd snapshot ID and operation info log.
Ryan Blue [Wed, 31 Oct 2018 20:08:40 +0000 (13:08 -0700)] 
Add snapshot ID and operation info log.

6 weeks agoFix GenericDataFile when partition is not projected.
Ryan Blue [Wed, 31 Oct 2018 16:18:06 +0000 (09:18 -0700)] 
Fix GenericDataFile when partition is not projected.

6 weeks agoFix snapshot log order check causing flaky tests.
Ryan Blue [Tue, 30 Oct 2018 23:38:49 +0000 (16:38 -0700)] 
Fix snapshot log order check causing flaky tests.

6 weeks agoAdd FileHistory helper.
Ryan Blue [Tue, 30 Oct 2018 23:12:29 +0000 (16:12 -0700)] 
Add FileHistory helper.

6 weeks agoRemove mapred exception from catch in HadoopOutputFile.
Ryan Blue [Thu, 4 Oct 2018 16:57:27 +0000 (09:57 -0700)] 
Remove mapred exception from catch in HadoopOutputFile.

6 weeks agoAdd parent ID to snapshots. (#85)
Ryan Blue [Wed, 31 Oct 2018 18:48:14 +0000 (11:48 -0700)] 
Add parent ID to snapshots. (#85)

6 weeks agoUpdate ReplaceFiles to use MergingSnapshotUpdate. (#84)
Ryan Blue [Wed, 31 Oct 2018 18:36:37 +0000 (11:36 -0700)] 
Update ReplaceFiles to use MergingSnapshotUpdate. (#84)

This changes the implementation of ReplaceFiles. Previously,
ReplaceFiles used BaseReplaceFiles, which was only used by ReplaceFiles.
Now it uses MergingSnapshotUpdate, the same base class that is used for
deletes, merge appends, and overwrites.

The new implementation adds automatic merging when replacing files and
takes advantage of caching that makes retries much faster.

To use MergingSnapshotUpdate for ReplaceFiles, this adds a mode that
will fail when any specific paths to delete are not found in the table's
current manifests. Filtered manifests are cached and reused in this mode
by tracking the files that were deleted in a filtered manifest.

6 weeks agoAdd metadata file compression. (#79)
Parth Brahmbhatt [Tue, 30 Oct 2018 23:46:35 +0000 (16:46 -0700)] 
Add metadata file compression. (#79)

This adds support for reading and writing the metadata file using gzip compression when the file path ends in ".gz". Hadoop and metastore tables have been updated to write compressed metadata files when the Hadoop Configuration option iceberg.compress.metadata is set to true.

7 weeks agoUpdate the runtime jar dependencies to not pull in hadoop. Add hadoop version variabl...
Daniel Weeks [Thu, 25 Oct 2018 22:05:16 +0000 (15:05 -0700)] 
Update the runtime jar dependencies to not pull in hadoop. Add hadoop version variable (#64)

8 weeks agoAdd gitter link to README (#74)
Omer van Kloeten [Mon, 22 Oct 2018 16:30:48 +0000 (19:30 +0300)] 
Add gitter link to README (#74)

Having people know a gitter channel exists is a good way to get them to join :)

8 weeks agoUse nebula publishing and git version (#69)
mccheah [Thu, 18 Oct 2018 21:43:53 +0000 (14:43 -0700)] 
Use nebula publishing and git version (#69)

8 weeks agoSpark: Fix NPE in PartitionKey accessors. (#73)
Ryan Blue [Thu, 18 Oct 2018 20:48:41 +0000 (13:48 -0700)] 
Spark: Fix NPE in PartitionKey accessors. (#73)

8 weeks agoFix iceberg-spark and iceberg-pig descriptions in README.md (#68)
Atul Felix Payapilly [Thu, 18 Oct 2018 16:44:40 +0000 (09:44 -0700)] 
Fix iceberg-spark and iceberg-pig descriptions in README.md (#68)

iceberg-spark and iceberg-pig module descriptions were mixed up

2 months agoUpdating ReplaceFiles implementation so it will allow replacing a datafile that is...
Parth Brahmbhatt [Thu, 4 Oct 2018 21:36:31 +0000 (14:36 -0700)] 
Updating ReplaceFiles implementation so it will allow replacing a datafile that is duplicated in manifests.

2 months agoUpdate version to 0.3.1 for iceberg-presto-runtime.
Ryan Blue [Wed, 3 Oct 2018 18:29:30 +0000 (11:29 -0700)] 
Update version to 0.3.1 for iceberg-presto-runtime.

2 months agoUpdate README module descriptions.
Ryan Blue [Wed, 3 Oct 2018 18:28:52 +0000 (11:28 -0700)] 
Update README module descriptions.

2 months agoAdding description about both iceberg-hive and iceberg-presto module.
Parth Brahmbhatt [Wed, 3 Oct 2018 18:23:14 +0000 (11:23 -0700)] 
Adding description about both iceberg-hive and iceberg-presto module.

2 months agoMoving off of guava emptyIterator to collections.emptyIterator() as guava version...
Parth Brahmbhatt [Tue, 2 Oct 2018 22:39:50 +0000 (15:39 -0700)] 
Moving off of guava emptyIterator to collections.emptyIterator() as guava version mismatch results in failures with Exceptions like tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class com.netflix.iceberg.avro.ValueReaders

2 months agoPresto runtime jar that shades dependencies that conflict with presto. We still need...
Parth Brahmbhatt [Mon, 1 Oct 2018 23:57:37 +0000 (16:57 -0700)] 
Presto runtime jar that shades dependencies that conflict with presto. We still need to see if we can exclude hadoop dependencies that being brought in by both iceberg-core and hive-standalone-metastore so we can avoid specifying a laundry list of hadoop packages.

2 months agoFix Spark literal conversion in In expressions.
Ryan Blue [Tue, 2 Oct 2018 17:11:44 +0000 (10:11 -0700)] 
Fix Spark literal conversion in In expressions.

2 months agoAllow IcebergSource subclasses to access SparkSession.
Ryan Blue [Fri, 28 Sep 2018 19:45:37 +0000 (12:45 -0700)] 
Allow IcebergSource subclasses to access SparkSession.

2 months agoRemove public constructor for Expressions.
Ryan Blue [Fri, 28 Sep 2018 19:21:51 +0000 (12:21 -0700)] 
Remove public constructor for Expressions.

2 months agoUpdate LICENSE.
Ryan Blue [Sun, 30 Sep 2018 23:38:17 +0000 (16:38 -0700)] 
Update LICENSE.

2 months agoAdd iceberg-hive to track Iceberg tables in the Hive Metastore (#60)
Parth Brahmbhatt [Sun, 30 Sep 2018 23:30:07 +0000 (16:30 -0700)] 
Add iceberg-hive to track Iceberg tables in the Hive Metastore (#60)

2 months agoUse filtered manifests from cache in snapshot updates.
Ryan Blue [Fri, 28 Sep 2018 19:06:34 +0000 (12:06 -0700)] 
Use filtered manifests from cache in snapshot updates.

2 months agoFix float to double promotion in Parquet readers.
Ryan Blue [Fri, 28 Sep 2018 16:37:17 +0000 (09:37 -0700)] 
Fix float to double promotion in Parquet readers.

2 months agoUpdate README.md for 0.3.0.
Ryan Blue [Wed, 26 Sep 2018 18:56:34 +0000 (11:56 -0700)] 
Update README.md for 0.3.0.

2 months agoUpdate Javadoc for 0.3.0.
Ryan Blue [Wed, 26 Sep 2018 18:46:05 +0000 (11:46 -0700)] 
Update Javadoc for 0.3.0.

2 months agoFix nebula plugin and Travis CI.
Ryan Blue [Wed, 26 Sep 2018 18:41:42 +0000 (11:41 -0700)] 
Fix nebula plugin and Travis CI.

2 months agoBump version to 0.3.0.
Ryan Blue [Wed, 26 Sep 2018 18:25:37 +0000 (11:25 -0700)] 
Bump version to 0.3.0.

2 months agoUpdate Spark dependency to 2.3.2. (#62)
Owen O'Malley [Wed, 26 Sep 2018 18:23:26 +0000 (11:23 -0700)] 
Update Spark dependency to 2.3.2. (#62)

2 months agoExclude Hadoop from iceberg-runtime dependencies. (#63)
Daniel Weeks [Wed, 26 Sep 2018 17:49:19 +0000 (10:49 -0700)] 
Exclude Hadoop from iceberg-runtime dependencies. (#63)

Update the runtime jar dependencies to not pull in hadoop.  Add hadoop version variable.

2 months agoAdd toString to ScanSummary.PartitionMetrics.
Ryan Blue [Wed, 26 Sep 2018 17:24:02 +0000 (10:24 -0700)] 
Add toString to ScanSummary.PartitionMetrics.

2 months agoAdd tests for replace partitions with unpartitioned tables.
Ryan Blue [Fri, 21 Sep 2018 17:44:31 +0000 (10:44 -0700)] 
Add tests for replace partitions with unpartitioned tables.

2 months agoFix delete operation bug.
Ryan Blue [Mon, 24 Sep 2018 17:15:22 +0000 (10:15 -0700)] 
Fix delete operation bug.

Deletes were un-deleting files in manifests that were marked as deleted
because the original code assumed deleted files would be removed before
returning.

2 months agoFix TableScanIterable next when called without hasNext.
Ryan Blue [Fri, 21 Sep 2018 23:48:39 +0000 (16:48 -0700)] 
Fix TableScanIterable next when called without hasNext.

2 months agoAdd ScanSummary.
Ryan Blue [Fri, 21 Sep 2018 20:41:36 +0000 (13:41 -0700)] 
Add ScanSummary.

2 months agoLazily load file status to avoid remote calls.
Ryan Blue [Fri, 21 Sep 2018 18:47:23 +0000 (11:47 -0700)] 
Lazily load file status to avoid remote calls.

2 months agoFix SparkExpressions.convert recursion with nulls.
Ryan Blue [Fri, 21 Sep 2018 18:15:57 +0000 (11:15 -0700)] 
Fix SparkExpressions.convert recursion with nulls.

Conversion returns null to signal that the expression is not supported,
but in recursive calls it wasn't checking that the wrapped expression
was converted to null. This caused some expressions like not(isnan(d))
to be converted and pushed to Iceberg even though they are not
supported. Iceberg later hit a NPE because an expression was null.

This also checks for null expressions in the Expressions factory
methods.

2 months agoFix manifest file names correctness bug.
Ryan Blue [Mon, 17 Sep 2018 20:52:11 +0000 (13:52 -0700)] 
Fix manifest file names correctness bug.

When manifest files were written in parallel, multiple threads could use
the same manifest file name because the threads were using the manifest
cache sizes to create the next file name. Instead, this now uses an
atomic counter to ensure all file names are unique.

2 months agoImplement replace table transaction. (#61)
Ryan Blue [Wed, 19 Sep 2018 22:29:51 +0000 (15:29 -0700)] 
Implement replace table transaction. (#61)

3 months agoAdd Parquet file support to iceberg-data.
Ryan Blue [Tue, 11 Sep 2018 21:22:34 +0000 (14:22 -0700)] 
Add Parquet file support to iceberg-data.

3 months agoAdd Iceberg generic object model for local reads. (#36)
Ryan Blue [Mon, 10 Sep 2018 23:41:17 +0000 (16:41 -0700)] 
Add Iceberg generic object model for local reads. (#36)

* Add generic object model for Avro serialization.
* Add single-message encoders and decoders for generic data.
* Fix copyright headers for single-message encoding.
* Implement size and add copy to generic Record interface.
* Add IcebergGenerics.read(Table) for direct table reads using generics.

3 months agoAdd HasTableOperations to expose operations for creating input files.
Ryan Blue [Mon, 10 Sep 2018 20:01:58 +0000 (13:01 -0700)] 
Add HasTableOperations to expose operations for creating input files.

3 months agoFix HadoopTables default constructor.
Ryan Blue [Mon, 10 Sep 2018 20:02:45 +0000 (13:02 -0700)] 
Fix HadoopTables default constructor.

3 months agoFix overwrite test file metrics.
Ryan Blue [Mon, 10 Sep 2018 20:03:10 +0000 (13:03 -0700)] 
Fix overwrite test file metrics.

3 months agoAdd ReplacePartitions implementation.
Ryan Blue [Sun, 9 Sep 2018 03:55:08 +0000 (20:55 -0700)] 
Add ReplacePartitions implementation.

This also adds validateAppendOnly to ReplacePartitions to fail any
attempt that would delete data from the table. This is used to ensure
that a write is append-only.

3 months agoAdd StructLikeWrapper to compare partition tuples.
Ryan Blue [Sun, 9 Sep 2018 03:52:34 +0000 (20:52 -0700)] 
Add StructLikeWrapper to compare partition tuples.

This also adds size to the StructLike interface to implement the
comparison.

3 months agoAdd Overwrite operation.
Ryan Blue [Sun, 9 Sep 2018 00:48:57 +0000 (17:48 -0700)] 
Add Overwrite operation.

This commit combines the implementations from StreamingDelete and
MergeAppend to create MergingSnapshotUpdate that will delete files,
append, and merge the final manifests. All 3 operations now use this
common base.

Overwrite is now supported by tables and in transactions.