parquet-mr.git
23 hours ago[maven-release-plugin] prepare for next development iteration master
Zoltan Ivanfi [Thu, 13 Dec 2018 19:45:06 +0000 (20:45 +0100)] 
[maven-release-plugin] prepare for next development iteration

23 hours ago[maven-release-plugin] prepare release apache-parquet-1.11.0 apache-parquet-1.11.0
Zoltan Ivanfi [Thu, 13 Dec 2018 19:44:42 +0000 (20:44 +0100)] 
[maven-release-plugin] prepare release apache-parquet-1.11.0

25 hours agoUpdate CHANGES.md for 1.11.0 release candidate 2.
Zoltan Ivanfi [Thu, 13 Dec 2018 18:34:26 +0000 (19:34 +0100)] 
Update CHANGES.md for 1.11.0 release candidate 2.

26 hours agoPARQUET-1476: Don't emit a warning message for files without new logical type (#577)
nandorKollar [Thu, 13 Dec 2018 17:28:11 +0000 (18:28 +0100)] 
PARQUET-1476: Don't emit a warning message for files without new logical type (#577)

27 hours agoPARQUET-1474: Less verbose and lower level logging for missing column/offset indexes...
Gabor Szadovszky [Thu, 13 Dec 2018 15:58:07 +0000 (16:58 +0100)] 
PARQUET-1474: Less verbose and lower level logging for missing column/offset indexes (#563)

30 hours agoPARQUET-1472: Dictionary filter fails on FIXED_LEN_BYTE_ARRAY (#562)
Gabor Szadovszky [Thu, 13 Dec 2018 13:37:41 +0000 (14:37 +0100)] 
PARQUET-1472: Dictionary filter fails on FIXED_LEN_BYTE_ARRAY (#562)

10 days agoPARQUET-1462: Allow specifying new development version in prepare-release.sh (#557) encryption
Zoltan Ivanfi [Tue, 4 Dec 2018 13:28:00 +0000 (14:28 +0100)] 
PARQUET-1462: Allow specifying new development version in prepare-release.sh (#557)

Before this change, prepare-release.sh only took the release version as a
parameter, the new development version was asked interactively for each
individual pom.xml file, which made answering them tedious.

3 weeks ago[maven-release-plugin] prepare for next development iteration
Zoltan Ivanfi [Fri, 23 Nov 2018 13:05:01 +0000 (14:05 +0100)] 
[maven-release-plugin] prepare for next development iteration

3 weeks ago[maven-release-plugin] prepare release apache-parquet-1.11.0
Zoltan Ivanfi [Fri, 23 Nov 2018 13:04:43 +0000 (14:04 +0100)] 
[maven-release-plugin] prepare release apache-parquet-1.11.0

3 weeks agoPARQUET-1258: Update scm developer connection to github
Zoltan Ivanfi [Fri, 23 Nov 2018 12:54:36 +0000 (13:54 +0100)] 
PARQUET-1258: Update scm developer connection to github

3 weeks agoPARQUET-1434: Update CHANGES.md for 1.11.0 release candidate 2.
Zoltan Ivanfi [Fri, 23 Nov 2018 12:03:34 +0000 (13:03 +0100)] 
PARQUET-1434: Update CHANGES.md for 1.11.0 release candidate 2.

3 weeks agoPARQUET-1461: Third party code does not compile after parquet-mr minor version update...
Gabor Szadovszky [Fri, 23 Nov 2018 11:59:50 +0000 (12:59 +0100)] 
PARQUET-1461: Third party code does not compile after parquet-mr minor version update (#556)

3 weeks ago[maven-release-plugin] prepare for next development iteration
Zoltan Ivanfi [Wed, 21 Nov 2018 17:39:42 +0000 (18:39 +0100)] 
[maven-release-plugin] prepare for next development iteration

3 weeks ago[maven-release-plugin] prepare release apache-parquet-1.11.0
Zoltan Ivanfi [Wed, 21 Nov 2018 17:39:31 +0000 (18:39 +0100)] 
[maven-release-plugin] prepare release apache-parquet-1.11.0

3 weeks agoPARQUET-1434: Update CHANGES.md for 1.11.0 release.
Nandor Kollar [Wed, 21 Nov 2018 12:51:00 +0000 (13:51 +0100)] 
PARQUET-1434: Update CHANGES.md for 1.11.0 release.

3 weeks agoRevert "Experiment."
Zoltan Ivanfi [Wed, 21 Nov 2018 17:27:42 +0000 (18:27 +0100)] 
Revert "Experiment."

This reverts commit 97a880cfc4fc3c2c74ff1302bc6e4aab1582b6df.

3 weeks agoExperiment.
Zoltan Ivanfi [Fri, 26 Oct 2018 13:08:18 +0000 (15:08 +0200)] 
Experiment.

3 weeks agoPARQUET-1460: Fix javadoc errors and include javadoc checking in Travis checks (...
Gabor Szadovszky [Wed, 21 Nov 2018 16:12:00 +0000 (17:12 +0100)] 
PARQUET-1460: Fix javadoc errors and include javadoc checking in Travis checks (#554)

3 weeks agoPARQUET-1407: Avro: Fix binary values returned from dictionary encoding (#552)
nandorKollar [Mon, 19 Nov 2018 22:07:55 +0000 (23:07 +0100)] 
PARQUET-1407: Avro: Fix binary values returned from dictionary encoding (#552)

* PARQUET-1407: Add test case for PARQUET-1407 to demonstrate the issue
* PARQUET-1407: Fix binary values from dictionary encoding.

Closes #551.

3 weeks agoPARQUET-1456: Use page index, ParquetFileReader throw ArrayIndexOutOfBoundsException...
Gabor Szadovszky [Mon, 19 Nov 2018 12:18:28 +0000 (13:18 +0100)] 
PARQUET-1456: Use page index, ParquetFileReader throw ArrayIndexOutOfBoundsException (#548)

The usage of static caching in the page index implementation did not allow using multiple readers at the same time.

3 weeks agoPARQUET-1365: Don't write page level statistics (#549)
Gabor Szadovszky [Mon, 19 Nov 2018 12:15:39 +0000 (13:15 +0100)] 
PARQUET-1365: Don't write page level statistics (#549)

Page level statistics were never used in production and became pointless after adding column indexes.

4 weeks agoPARQUET-1435: Benchmark filtering column-indexes (#536)
Gabor Szadovszky [Mon, 12 Nov 2018 10:13:31 +0000 (11:13 +0100)] 
PARQUET-1435: Benchmark filtering column-indexes (#536)

5 weeks agoPARQUET-1414: Simplify next row count check calculation (#537)
Gabor Szadovszky [Thu, 8 Nov 2018 17:23:32 +0000 (18:23 +0100)] 
PARQUET-1414: Simplify next row count check calculation (#537)

5 weeks agoPARQUET-1452: Deprecate old logical types API (#535)
nandorKollar [Wed, 7 Nov 2018 11:31:43 +0000 (12:31 +0100)] 
PARQUET-1452: Deprecate old logical types API (#535)

5 weeks agoPARQUET-1305: Backward incompatible change introduced in 1.8 (#483)
nandorKollar [Wed, 7 Nov 2018 10:19:10 +0000 (11:19 +0100)] 
PARQUET-1305: Backward incompatible change introduced in 1.8 (#483)

5 weeks agoPARQUET-1414: Limit page size based on maximum row count (#531)
Gabor Szadovszky [Wed, 7 Nov 2018 08:44:06 +0000 (09:44 +0100)] 
PARQUET-1414: Limit page size based on maximum row count (#531)

8 weeks agoPARQUET-1201: Column indexes (#527) 570/head
Gabor Szadovszky [Thu, 18 Oct 2018 12:08:13 +0000 (14:08 +0200)] 
PARQUET-1201: Column indexes (#527)

This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch.

* PARQUET-1211: Column indexes: read/write API (#456)
* PARQUET-1212: Column indexes: Show indexes in tools (#479)
* PARQUET-1213: Column indexes: Limit index size (#480)
* PARQUET-1214: Column indexes: Truncate min/max values (#481)
* PARQUET-1364: Invalid row indexes for pages starting with nulls (#507)
* PARQUET-1310: Column indexes: Filtering (#509)
* PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515)
* PARQUET-1389: Improve value skipping at page synchronization (#514)
* PARQUET-1381: Fix missing endRecord after merging columnIndex

8 weeks agoPARQUET-1383: Parquet tools should indicate UTC parameter for time/timestamp types...
nandorKollar [Mon, 15 Oct 2018 13:06:58 +0000 (15:06 +0200)] 
PARQUET-1383: Parquet tools should indicate UTC parameter for time/timestamp types (#513)

2 months agoPARQUET-1440: Parquet-tools: Parse int32 or int64 decimal values to big decimals...
Ryan Gardner [Wed, 10 Oct 2018 15:54:54 +0000 (11:54 -0400)] 
PARQUET-1440: Parquet-tools: Parse int32 or int64 decimal values to big decimals with the proper scale (#530)

2 months agoPARQUET-1436: TimestampMicrosStringifier shows wrong microseconds for timestamps...
nandorKollar [Fri, 5 Oct 2018 09:16:42 +0000 (11:16 +0200)] 
PARQUET-1436: TimestampMicrosStringifier shows wrong microseconds for timestamps before 1970 (#529)

2 months agoPARQUET-1388: Nanosecond precision time and timestamp - parquet-mr (#519)
nandorKollar [Thu, 4 Oct 2018 13:26:49 +0000 (15:26 +0200)] 
PARQUET-1388: Nanosecond precision time and timestamp - parquet-mr (#519)

2 months agoRevert "PARQUET-1353: Fix random data generator. (#504)" bloom-filter
Zoltan Ivanfi [Tue, 25 Sep 2018 12:38:00 +0000 (14:38 +0200)] 
Revert "PARQUET-1353: Fix random data generator. (#504)"

This reverts commit 1f79f9bd0ba61b8ec0bae1dec71ef7249d41eacd because of
concerns raised in the code review after the pull request was merged.

2 months agoPARQUET-1399: Move parquet-mr related code from parquet-format
Gabor Szadovszky [Wed, 22 Aug 2018 12:36:12 +0000 (14:36 +0200)] 
PARQUET-1399: Move parquet-mr related code from parquet-format

2 months agoMerge commit '344b56803fea37af84b9c01c9b6dcff586779683' into merge_PARQUET-1399
Gabor Szadovszky [Mon, 24 Sep 2018 11:57:52 +0000 (13:57 +0200)] 
Merge commit '344b56803fea37af84b9c01c9b6dcff586779683' into merge_PARQUET-1399

2 months agoPARQUET-1418: Run integration tests in Travis (#524)
Zoltan Ivanfi [Fri, 21 Sep 2018 15:08:39 +0000 (17:08 +0200)] 
PARQUET-1418: Run integration tests in Travis (#524)

2 months agoPARQUET-1421: InternalParquetRecordWriter logs debug messages at the INFO level ...
Zoltan Ivanfi [Thu, 20 Sep 2018 14:23:29 +0000 (16:23 +0200)] 
PARQUET-1421: InternalParquetRecordWriter logs debug messages at the INFO level (#526)

Reduced log level of said messages to DEBUG.

2 months agoPARQUET-1417: BINARY_AS_SIGNED_INTEGER_COMPARATOR fails with IOBE for the same arrays...
Volodymyr Vysotskyi [Thu, 20 Sep 2018 09:13:18 +0000 (12:13 +0300)] 
PARQUET-1417: BINARY_AS_SIGNED_INTEGER_COMPARATOR fails with IOBE for the same arrays with the different length (#522)

2 months agoPARQUET-1353: Fix random data generator. (#504)
Zoltan Ivanfi [Tue, 18 Sep 2018 11:38:56 +0000 (13:38 +0200)] 
PARQUET-1353: Fix random data generator. (#504)

The random data generator used for tests used to repeat the same value
over and over again.

3 months agoPARQUET-1410: Refactor modules to use the new logical type API (#520)
nandorKollar [Wed, 12 Sep 2018 12:14:20 +0000 (14:14 +0200)] 
PARQUET-1410: Refactor modules to use the new logical type API (#520)

3 months agoPARQUET-1381: Add merge blocks command to parquet-tools (#512)
Ekaterina Galieva [Tue, 11 Sep 2018 11:56:57 +0000 (04:56 -0700)] 
PARQUET-1381: Add merge blocks command to parquet-tools (#512)

Existing implementation of merge command in parquet-tools didn't merge row groups, just placed one after the other. This commit adds API and command option to be able to merge small blocks into larger ones up to specified size limit.

3 months agoPARQUET-1399: Move files to the module directory
Gabor Szadovszky [Wed, 22 Aug 2018 11:11:52 +0000 (13:11 +0200)] 
PARQUET-1399: Move files to the module directory

3 months ago[maven-release-plugin] prepare for next development iteration
Zoltan Ivanfi [Thu, 29 Mar 2018 13:47:20 +0000 (15:47 +0200)] 
[maven-release-plugin] prepare for next development iteration

3 months ago[maven-release-plugin] prepare release apache-parquet-format-2.5.0
Zoltan Ivanfi [Thu, 29 Mar 2018 13:47:01 +0000 (15:47 +0200)] 
[maven-release-plugin] prepare release apache-parquet-format-2.5.0

3 months agoRevert "[maven-release-plugin] prepare release apache-parquet-format-2.5.0"
Zoltan Ivanfi [Thu, 29 Mar 2018 13:44:08 +0000 (15:44 +0200)] 
Revert "[maven-release-plugin] prepare release apache-parquet-format-2.5.0"

This reverts commit a5b842613309a60b59d07af5d02a76c00e9ef2ac.

3 months ago[maven-release-plugin] prepare release apache-parquet-format-2.5.0
Zoltan Ivanfi [Thu, 29 Mar 2018 13:24:10 +0000 (15:24 +0200)] 
[maven-release-plugin] prepare release apache-parquet-format-2.5.0

3 months agoPARQUET-1258: Update scm developer connection to github (#90)
Gabor Szadovszky [Wed, 28 Mar 2018 13:57:37 +0000 (15:57 +0200)] 
PARQUET-1258: Update scm developer connection to github (#90)

After moving to gitbox the old apache repo is not working anymore.
The pom.xml had to be updated accordingly.

3 months agoPARQUET-1236: Align version of slf4j-api
1028332163 [Wed, 21 Mar 2018 15:26:58 +0000 (16:26 +0100)] 
PARQUET-1236: Align version of slf4j-api

https://issues.apache.org/jira/browse/PARQUET-1236

Author: 1028332163 <1028332163@qq.com>

Closes #85 from PandaMonkey/master and squashes the following commits:

158f082 [1028332163] align version of slf4j-api

3 months agoPARQUET-1201: Implement page indexes
Gabor Szadovszky [Tue, 13 Feb 2018 16:08:44 +0000 (17:08 +0100)] 
PARQUET-1201: Implement page indexes

Added helper methods to read/write ColumnIndex and OffsetIndex objects.

Author: Gabor Szadovszky <gabor.szadovszky@cloudera.com>

Closes #81 from gszadovszky/PARQUET-1201 and squashes the following commits:

573dada [Gabor Szadovszky] PARQUET-1201: Implement page indexes

3 months agoPARQUET-1197: Log rat failures
Gabor Szadovszky [Thu, 18 Jan 2018 16:05:11 +0000 (17:05 +0100)] 
PARQUET-1197: Log rat failures

Author: Gabor Szadovszky <gabor.szadovszky@cloudera.com>

Closes #80 from gszadovszky/PARQUET-1197 and squashes the following commits:

c97db9d [Gabor Szadovszky] PARQUET-1197: Log rat failures

3 months agoPARQUET-1145: Add license to .gitignore
Lars Volker [Mon, 13 Nov 2017 12:56:08 +0000 (13:56 +0100)] 
PARQUET-1145: Add license to .gitignore

Also removes .gitignore from the RAT whitelist.

Author: Lars Volker <lv@cloudera.com>

Closes #75 from lekv/license and squashes the following commits:

04523ef [Lars Volker] Also add license to .travis.yml
ce471fd [Lars Volker] PARQUET-1145: Add license to .gitignore

3 months ago[maven-release-plugin] prepare for next development iteration
Ryan Blue [Tue, 17 Oct 2017 19:25:34 +0000 (12:25 -0700)] 
[maven-release-plugin] prepare for next development iteration

3 months ago[maven-release-plugin] prepare release apache-parquet-format-2.4.0
Ryan Blue [Tue, 17 Oct 2017 19:25:18 +0000 (12:25 -0700)] 
[maven-release-plugin] prepare release apache-parquet-format-2.4.0

3 months agoPARQUET-1144: Remove slf4j-nop.
Ryan Blue [Tue, 17 Oct 2017 19:21:05 +0000 (12:21 -0700)] 
PARQUET-1144: Remove slf4j-nop.

Author: Ryan Blue <blue@apache.org>

Closes #74 from rdblue/PARQUET-1144-remove-slf4j-nop and squashes the following commits:

d5d5639 [Ryan Blue] PARQUET-1144: Remove slf4j-nop.

3 months ago[maven-release-plugin] prepare for next development iteration
Ryan Blue [Tue, 17 Oct 2017 00:07:13 +0000 (17:07 -0700)] 
[maven-release-plugin] prepare for next development iteration

3 months ago[maven-release-plugin] prepare release apache-parquet-format-2.4.0
Ryan Blue [Tue, 17 Oct 2017 00:06:58 +0000 (17:06 -0700)] 
[maven-release-plugin] prepare release apache-parquet-format-2.4.0

3 months agoPARQUET-906: Add LogicalType annotation.
Ryan Blue [Tue, 10 Oct 2017 19:37:15 +0000 (12:37 -0700)] 
PARQUET-906: Add LogicalType annotation.

This commit adds a `LogicalType` union and a field for this logical type to `SchemaElement`. Adding a new structure for logical types is needed for a few reasons:

1. Adding to the ConvertedType enum is not forward-compatible. Adding new types to the `LogicalType` union is forward-compatible.
2. Using a struct for each type allows additional metadata, like `isAdjustedToUTC`, without adding more fields to `SchemaElement` that don't apply to all types.
3. Types without additional metadata can be updated later. For example, adding an `encoding` field to `StringType` when it is needed.

Author: Ryan Blue <blue@apache.org>

Closes #51 from rdblue/PARQUET-906-add-timestamp-adjustment-metadata and squashes the following commits:

ad8e91d [Ryan Blue] PARQUET-906: Clarify the use of NullType.
7cc29f7 [Ryan Blue] PARQUET-906: Rename NULL to UNKNOWN.
02f3868 [Ryan Blue] PARQUET-906: Update from comments on the PR.
c0386e9 [Ryan Blue] PARQUET-906: Remove NULL ConvertedType.
190bd8a [Ryan Blue] PARQUET-906: Update for review comments.
8203b21 [Ryan Blue] PARQUET-906: Add copyright header to LogicalTypes.
993102e [Ryan Blue] PARQUET-906: Remove the unreleased NULL ConvertedType.
86a22b4 [Ryan Blue] PARQUET-906: Add LogicalType annotation.

3 months agoPARQUET-1049: Make thrift version a property in pom.xml
Zoltan Ivanfi [Mon, 31 Jul 2017 16:23:20 +0000 (09:23 -0700)] 
PARQUET-1049: Make thrift version a property in pom.xml

Author: Zoltan Ivanfi <zi@cloudera.com>

Closes #57 from zivanfi/PARQUET-1049 and squashes the following commits:

8efc7a3 [Zoltan Ivanfi] PARQUET-1049: Make thrift version a property in pom.xml

3 months agoPARQUET-371: update thrift dependency to 0.9.3; do not shade slf4j
Julien Le Dem [Sat, 29 Jul 2017 23:30:22 +0000 (16:30 -0700)] 
PARQUET-371: update thrift dependency to 0.9.3; do not shade slf4j

Author: Julien Le Dem <julien@dremio.com>
Author: Julien Le Dem <julien@apache.org>

Closes #50 from julienledem/update_thrift and squashes the following commits:

f5db375 [Julien Le Dem] update travis
e30ad8f [Julien Le Dem] update thrift dependency; do not shade slf4j

3 months agoPARQUET-609: Add Brotli to parquet's thrift definition
Ryan Blue [Mon, 11 Jul 2016 18:00:45 +0000 (11:00 -0700)] 
PARQUET-609: Add Brotli to parquet's thrift definition

Author: Ryan Blue <blue@apache.org>

Closes #40 from rdblue/PARQUET-609-add-brotli and squashes the following commits:

061dcbc [Ryan Blue] PARQUET-609: Add Brotli compression to the format.
4eb1ff0 [Ryan Blue] PARQUET-608: Add thrift.executable property.

3 months agoPARQUET-450: Fix several typos in Parquet format documentation
Laurent Goujon [Fri, 29 Jan 2016 18:30:06 +0000 (10:30 -0800)] 
PARQUET-450: Fix several typos in Parquet format documentation

It also changes parquet.thrift location to conform to maven layout and add a link to it from README.md

Author: Laurent Goujon <lgoujon@twitter.com>

Closes #36 from laurentgo/update-format-specification and squashes the following commits:

244c119 [Laurent Goujon] Fix several typos/errors in Parquet documentation
90a2be4 [Laurent Goujon] Fix thrift source path to match maven layout

3 months ago[maven-release-plugin] prepare for next development iteration
Ryan Blue [Mon, 14 Dec 2015 17:34:38 +0000 (09:34 -0800)] 
[maven-release-plugin] prepare for next development iteration

3 months ago[maven-release-plugin] prepare release apache-parquet-format-2.3.1
Ryan Blue [Mon, 14 Dec 2015 17:34:24 +0000 (09:34 -0800)] 
[maven-release-plugin] prepare release apache-parquet-format-2.3.1

3 months agoPARQUET-1390: Upgrade Arrow to 0.10.0
Andy Grove [Sun, 19 Aug 2018 09:12:36 +0000 (11:12 +0200)] 
PARQUET-1390: Upgrade Arrow to 0.10.0

This upgrades arrow from 0.8.0 to 0.10.0.

This required adding new SchemaConverter visitor methods for fixedSizeBinary data type and I pretty much guessed at how to implement those so would appreciate a review of that.

Author: Andy Grove <andy.grove@rms.com>

Closes #516 from agrove-rms/arrow_upgrade and squashes the following commits:

4a922876 [Andy Grove] Add new visitor methods
9535a162 [Andy Grove] Upgrade Arrow from 0.8.0 to 0.10.0

4 months agoPARQUET-1368: ParquetFileReader should close its input stream for the failure in...
Hyukjin Kwon [Tue, 7 Aug 2018 16:35:38 +0000 (00:35 +0800)] 
PARQUET-1368: ParquetFileReader should close its input stream for the failure in constructor (#510)

4 months agoPARQUET-1371: Time/Timestamp UTC normalization parameter doesn't work (#511)
nandorKollar [Tue, 7 Aug 2018 15:56:42 +0000 (17:56 +0200)] 
PARQUET-1371: Time/Timestamp UTC normalization parameter doesn't work (#511)

5 months agoPARQUET-1335: Logical type names in parquet-mr are not consistent with parquet-format...
nandorKollar [Mon, 9 Jul 2018 08:10:24 +0000 (10:10 +0200)] 
PARQUET-1335: Logical type names in parquet-mr are not consistent with parquet-format (#503)

Add test case for STRING annotation and revert UTF8 annotations removed in PR#496

5 months agoPARQUET-1344: Type builders don't honor new logical types (#500)
nandorKollar [Wed, 4 Jul 2018 11:58:33 +0000 (13:58 +0200)] 
PARQUET-1344: Type builders don't honor new logical types (#500)

* PARQUET-1344: Type builders don't honor new logical types

Call propert constructor when builder is caller with new logical type,
call the deprecated OriginalType version otherwise.

* Use static imports in test

5 months agoPARQUET-1341: Fix null count stats in unsigned-sort columns. (#499)
Ryan Blue [Tue, 3 Jul 2018 22:24:53 +0000 (15:24 -0700)] 
PARQUET-1341: Fix null count stats in unsigned-sort columns. (#499)

* Fix null count stats in unsigned-sort columns.
* Fix test case for old min/max values and unsigned ordering.

5 months agoPARQUET-1336: PrimitiveComparator should implements Serializable (#497) 495/head
Yuming Wang [Tue, 26 Jun 2018 07:38:23 +0000 (15:38 +0800)] 
PARQUET-1336: PrimitiveComparator should implements Serializable (#497)

5 months agoPARQUET-1335: Logical type names in parquet-mr are not consistent with parquet-format...
nandorKollar [Mon, 25 Jun 2018 06:24:15 +0000 (08:24 +0200)] 
PARQUET-1335: Logical type names in parquet-mr are not consistent with parquet-format (#496)

5 months agoPARQUET-952: Avro union with single type fails with 'is not a group' (#459)
nandorKollar [Mon, 18 Jun 2018 07:47:25 +0000 (09:47 +0200)] 
PARQUET-952: Avro union with single type fails with 'is not a group' (#459)

6 months agoPARQUET-1321: LogicalTypeAnnotation.LogicalTypeAnnotationVisitor#visit methods should...
nandorKollar [Tue, 12 Jun 2018 09:47:43 +0000 (11:47 +0200)] 
PARQUET-1321: LogicalTypeAnnotation.LogicalTypeAnnotationVisitor#visit methods should have a return value (#493)

6 months ago[PARQUET-1135][FOLLOW-UP] Update thrift and protoc version in README.md (#488)
Yuming Wang [Wed, 6 Jun 2018 09:14:13 +0000 (17:14 +0800)] 
[PARQUET-1135][FOLLOW-UP] Update thrift and protoc version in README.md (#488)

6 months agoPARQUET-1309: Parquet Java uses incorrect stats and dictionary filter properties...
Gabor Szadovszky [Tue, 5 Jun 2018 09:16:05 +0000 (11:16 +0200)] 
PARQUET-1309: Parquet Java uses incorrect stats and dictionary filter properties (#490)

6 months agoPARQUET-1317: Fix ParquetMetadataConverter throw NPE (#491)
nandorKollar [Mon, 4 Jun 2018 16:49:17 +0000 (18:49 +0200)] 
PARQUET-1317: Fix ParquetMetadataConverter throw NPE (#491)

New test case in TestParquetMetadataConverter to reproduce NPE and ensure backward compatibility

6 months agoPARQUET-1317: Fix ParquetMetadataConverter throw NPE (#489)
Yuming Wang [Mon, 4 Jun 2018 15:52:28 +0000 (23:52 +0800)] 
PARQUET-1317: Fix ParquetMetadataConverter throw NPE (#489)

6 months agoPARQUET-1311: Update README.md (#487)
nandorKollar [Mon, 4 Jun 2018 15:35:47 +0000 (17:35 +0200)] 
PARQUET-1311: Update README.md (#487)

parquet-mr documentation was not up to date:
- pointed to broken URLs
- instructed to install old Thrift version
- current version was stated as 1.8.1, although 1.10.0 is already released

6 months agoPARQUET-1304: Release 1.10 contains breaking changes for Hive (#485)
Gabor Szadovszky [Thu, 31 May 2018 14:38:43 +0000 (16:38 +0200)] 
PARQUET-1304: Release 1.10 contains breaking changes for Hive (#485)

6 months agoPARQUET-1253: Support for new logical type representation (#463)
nandorKollar [Thu, 24 May 2018 11:46:11 +0000 (13:46 +0200)] 
PARQUET-1253: Support for new logical type representation (#463)

6 months agoPARQUET-1294: Update release scripts for the new Apache policy (#475)
Gabor Szadovszky [Thu, 17 May 2018 10:56:15 +0000 (12:56 +0200)] 
PARQUET-1294: Update release scripts for the new Apache policy (#475)

7 months agoPARQUET-1296: Travis kills build after 10 minutes, because "no output was received"
nandorKollar [Tue, 15 May 2018 19:19:56 +0000 (21:19 +0200)] 
PARQUET-1296: Travis kills build after 10 minutes, because "no output was received"

Use pv to periodically print progress for Travis to prevent timeout due to lack of output.

7 months agoPARQUET-1297: SchemaConverter should not convert from Timestamp(TimeUnit.SECOND)...
Masayuki Takahashi [Sun, 13 May 2018 17:31:02 +0000 (02:31 +0900)] 
PARQUET-1297: SchemaConverter should not convert from Timestamp(TimeUnit.SECOND) and Timestamp(TimeUnit.NANOSECOND) of Arrow (#477)

Arrow's 'Timestamp' definition is below:
{
  "name" : "timestamp",
  "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
}
http://arrow.apache.org/docs/metadata.html

But Parquet only supports 'TIMESTAMP_MILLIS' and 'TIMESTAMP_MICROS'.
 https://github.com/Apache/parquet-format/blob/master/LogicalTypes.md

Therefore SchemaConverter should not convert from Timestamp(TimeUnit.SECOND) and Timestamp(TimeUnit.NANOSECOND) of Arrow to Parquet.

Related:
https://issues.apache.org/jira/browse/PARQUET-1285

Author: Masayuki Takahashi <masayuki038@gmail.com>

7 months agoPARQUET-1293: Build failure when using Java 8 lambda expressions
Nandor Kollar [Wed, 9 May 2018 10:07:11 +0000 (12:07 +0200)] 
PARQUET-1293: Build failure when using Java 8 lambda expressions

7 months agoPARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND and...
Masayuki Takahashi [Mon, 7 May 2018 08:11:58 +0000 (17:11 +0900)] 
PARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND and TimeUnit.NANOSECOND of Arrow (#469)

* PARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow

Arrow's 'Time' definition is below:

{ "name" : "time", "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND", "bitWidth": /* integer: 32 or 64 */ }
http://arrow.apache.org/docs/metadata.html

But Parquet only supports 'TIME_MILLIS' and 'TIME_MICROS'.
https://github.com/Apache/parquet-format/blob/master/LogicalTypes.md

Therefore SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow to Parquet.

Author: Masayuki Takahashi <masayuki038@gmail.com>

* PARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow

Since the import statements were collected, I restored it.

Author: Masayuki Takahashi <masayuki038@gmail.com>

* PARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow

Remove unnecessary updates.

Author: Masayuki Takahashi <masayuki038@gmail.com>

* PARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow

Remove unnecessary package name

Author: Masayuki Takahashi <masayuki038@gmail.com>

* PARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow

Add a conversion pattern from Parquet's TIME_MICROS  to Arrow's MICROSECOND

Author: Masayuki Takahashi <masayuki038@gmail.com>

* PARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow

Fix to specify `expected` positions in assertEquals

Author: Masayuki Takahashi <masayuki038@gmail.com>

* PARQUET-1285: [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow

Add a test to convert from Parquet's TIME_MICROS  to Arrow's MICROSECOND

Author: Masayuki Takahashi <masayuki038@gmail.com>

7 months agoPARQUET-968 Add Hive/Presto support in ProtoParquet
Constantin Muraru [Thu, 26 Apr 2018 12:48:08 +0000 (08:48 -0400)] 
PARQUET-968 Add Hive/Presto support in ProtoParquet

This PR adds Hive (https://github.com/apache/hive) and Presto (https://github.com/prestodb/presto) support for parquet messages written with ProtoParquetWriter. Hive and other tools, such as Presto (used by AWS Athena), rely on specific LIST/MAP wrappers (as defined in the parquet spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). These wrappers are currently missing from the ProtoParquet schema. AvroParquet works just fine, because it adds these wrappers when it deals with arrays and maps. This PR brings these wrappers in parquet-proto, providing the same functionality that already exists in parquet-avro.

This is backward compatible. Messages written without the extra LIST/MAP wrappers are still being read successfully using the updated ProtoParquetReader.

Regarding the change.
Given the following protobuf schema:

```
message ListOfPrimitives {
    repeated int64 my_repeated_id = 1;
}
```

Old parquet schema was:
```
message ListOfPrimitives {
  repeated int64 my_repeated_id = 1;
}
```

New parquet schema is:
```
message ListOfPrimitives {
  required group my_repeated_id (LIST) = 1 {
    repeated group list {
      required int64 element;
    }
  }
}
```
---

For list of messages, the changes look like this:

Protobuf schema:
```
message ListOfMessages {
    string top_field = 1;
    repeated MyInnerMessage first_array = 2;
}

message MyInnerMessage {
    int32 inner_field = 1;
}
```

Old parquet schema was:
```
message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  repeated group first_array = 2 {
    optional int32 inner_field = 1;
  }
}
```

The expected parquet schema, compatible with Hive (and similar to parquet-avro) is the following (notice the LIST wrapper):

```
message TestProto3.ListOfMessages {
  optional binary top_field (UTF8) = 1;
  required group first_array (LIST) = 2 {
    repeated group list {
      optional group element {
        optional int32 inner_field = 1;
      }
    }
  }
}
```

---

Similar for maps. Protobuf schema:
```
message TopMessage {
    map<int64, MyInnerMessage> myMap = 1;
}

message MyInnerMessage {
    int32 inner_field = 1;
}
```

Old parquet schema:
```
message TestProto3.TopMessage {
  repeated group myMap = 1 {
    optional int64 key = 1;
    optional group value = 2 {
      optional int32 inner_field = 1;
    }
  }
}
```

New parquet schema (notice the `MAP` wrapper):
```
message TestProto3.TopMessage {
  required group myMap (MAP) = 1 {
    repeated group key_value {
      required int64 key;
      optional group value {
        optional int32 inner_field = 1;
      }
    }
  }
}
```

Jira: https://issues.apache.org/jira/browse/PARQUET-968

Author: Constantin Muraru <cmuraru@adobe.com>
Author: Benoît Hanotte <BenoitHanotte@users.noreply.github.com>

Closes #411 from costimuraru/PARQUET-968 and squashes the following commits:

16eafcb6 [Benoît Hanotte] PARQUET-968 add proto flag to enable writing using specs-compliant schemas (#2)
a8bd7041 [Constantin Muraru] Pick up commit from @andredasilvapinto
5cf92487 [Constantin Muraru] PARQUET-968 Add Hive support in ProtoParquet

7 months agoPARQUET-1128: [Java] Upgrade the Apache Arrow version to 0.8.0 for SchemaConverter
Masayuki Takahashi [Sat, 21 Apr 2018 13:58:35 +0000 (14:58 +0100)] 
PARQUET-1128: [Java] Upgrade the Apache Arrow version to 0.8.0 for SchemaConverter

When I converted parquet(1.9.1-SNAPSHOT) schema to arrow(0.4.0) with SchemaConverter, this exception raised.
```
java.lang.NoClassDefFoundError: org/apache/arrow/vector/types/pojo/ArrowType$Struct_

at net.wrap_trap.parquet_arrow.ParquetToArrowConverter.convertToArrow(ParquetToArrowConverter.java:67)
at net.wrap_trap.parquet_arrow.ParquetToArrowConverter.convertToArrow(ParquetToArrowConverter.java:40)
at net.wrap_trap.parquet_arrow.ParquetToArrowConverterTest.parquetToArrowConverterTest(ParquetToArrowConverterTest.java:27)
```

This reason is that SchemaConverter refer to Apache Arrow 0.1.0.
I upgrade the Apache Arrow version to 0.8.0(latest) for SchemaConverter.

Author: Masayuki Takahashi <masayuki038@gmail.com>

Closes #443 from masayuki038/PARQUET-1128 and squashes the following commits:

8ba47813 [Masayuki Takahashi] PARQUET-1128: [Java] Upgrade the Apache Arrow version to 0.8.0 for SchemaConverter
b80d793a [Masayuki Takahashi] PARQUET-1128: [Java] Upgrade the Apache Arrow version to 0.8.0 for SchemaConverter

8 months ago[maven-release-plugin] prepare for next development iteration
Ryan Blue [Thu, 5 Apr 2018 20:36:32 +0000 (13:36 -0700)] 
[maven-release-plugin] prepare for next development iteration

8 months ago[maven-release-plugin] prepare release apache-parquet-1.10.0 apache-parquet-1.10.0
Ryan Blue [Thu, 5 Apr 2018 20:36:12 +0000 (13:36 -0700)] 
[maven-release-plugin] prepare release apache-parquet-1.10.0

8 months agoPARQUET-1258: Update scm developer connection to github HTTPS.
Ryan Blue [Thu, 5 Apr 2018 20:17:12 +0000 (13:17 -0700)] 
PARQUET-1258: Update scm developer connection to github HTTPS.

8 months agoRQUET-1264: Fix javadoc warnings for Java 8.
Ryan Blue [Thu, 5 Apr 2018 19:38:24 +0000 (12:38 -0700)] 
RQUET-1264: Fix javadoc warnings for Java 8.

8 months agoPARQUET-1264: Fix javadoc 8 problem in VersionGenerator.
Ryan Blue [Sat, 31 Mar 2018 00:53:49 +0000 (17:53 -0700)] 
PARQUET-1264: Fix javadoc 8 problem in VersionGenerator.

8 months agoPARQUET-1264: Fix javadoc warnings for Java 8.
Ryan Blue [Sat, 31 Mar 2018 00:51:23 +0000 (17:51 -0700)] 
PARQUET-1264: Fix javadoc warnings for Java 8.

8 months agoPARQUET-1189: Update CHANGES.md for 1.10.0 release.
Ryan Blue [Fri, 30 Mar 2018 22:33:43 +0000 (15:33 -0700)] 
PARQUET-1189: Update CHANGES.md for 1.10.0 release.

8 months agoPARQUET-1263: If file has a config, use it for ParquetReadOptions. (#464)
Ryan Blue [Fri, 30 Mar 2018 22:31:01 +0000 (15:31 -0700)] 
PARQUET-1263: If file has a config, use it for ParquetReadOptions. (#464)

8 months agoPARQUET-1183: Add Avro builders using InputFile and OutputFile. (#460)
Ryan Blue [Fri, 30 Mar 2018 22:24:17 +0000 (15:24 -0700)] 
PARQUET-1183: Add Avro builders using InputFile and OutputFile. (#460)

* PARQUET-1183: Add Avro builders using InputFile and OutputFile.
* PARQUET-1183: Add deprecation warnings to Avro read builder.

Closes #446

8 months agoPARQUET-1258: Update scm developer connection to github (#462)
Gabor Szadovszky [Thu, 29 Mar 2018 15:39:57 +0000 (17:39 +0200)] 
PARQUET-1258: Update scm developer connection to github (#462)

8 months agoPARQUET-1246: Ignore float/double statistics in case of NaN
Gabor Szadovszky [Mon, 19 Mar 2018 13:43:12 +0000 (14:43 +0100)] 
PARQUET-1246: Ignore float/double statistics in case of NaN

Because of the ambigous sorting order of float/double the following changes made at the reading path of the related statistics:
- Ignoring statistics in case of it contains a NaN value.
- Using -0.0 as min value and +0.0 as max value independently from which 0.0 value was saved in the statistics.

Author: Gabor Szadovszky <gabor.szadovszky@cloudera.com>

Closes #461 from gszadovszky/PARQUET-1246 and squashes the following commits:

20e9332 [Gabor Szadovszky] PARQUET-1246: Changes according to zi's comments
3447938 [Gabor Szadovszky] PARQUET-1246: Ignore float/double statistics in case of NaN

9 months agoPARQUET-1135: upgrade thrift and protobuf dependencies
Julien Le Dem [Sat, 10 Mar 2018 00:14:11 +0000 (16:14 -0800)] 
PARQUET-1135: upgrade thrift and protobuf dependencies

Author: Julien Le Dem <julien.ledem@wework.com>
Author: Julien Le Dem <julien@ledem.net>

Closes #427 from julienledem/PARQUET_1135_thrift_PB and squashes the following commits:

f23b32d9 [Julien Le Dem] remove double install
78cbf734 [Julien Le Dem] remove running check on protobuf build
4bc2b8f7 [Julien Le Dem] add timing; upgrade proto version
e17ca956 [Julien Le Dem] without-nodejs
d15e523d [Julien Le Dem] PARQUET-1135: upgrade thrift and protobuf dependencies

9 months agoPARQUET-1217: Incorrect handling of missing values in Statistics
Gabor Szadovszky [Tue, 27 Feb 2018 13:19:14 +0000 (14:19 +0100)] 
PARQUET-1217: Incorrect handling of missing values in Statistics

In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:
- null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
  fix: check if min/max is set before comparing to the related values
- null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
  fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Author: Gabor Szadovszky <gabor.szadovszky@cloudera.com>

Closes #458 from gszadovszky/PARQUET-1217 and squashes the following commits:

9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics

9 months agoPARQUET-787: Limit read allocation size
Ryan Blue [Wed, 21 Feb 2018 17:40:07 +0000 (09:40 -0800)] 
PARQUET-787: Limit read allocation size

WIP: This update the `ParquetFileReader` to use multiple buffers when reading a row group, instead of a single humongous allocation. As a consequence, many classes needed to be updated to accept a stream backed by multiple buffers, instead of using a single buffer directly. Assuming a single contiguous buffer would require too many copies.

Author: Ryan Blue <blue@apache.org>

Closes #390 from rdblue/PARQUET-787-limit-read-allocation-size and squashes the following commits:

4abba3e7a [Ryan Blue] PARQUET-787: Update byte buffer input streams for review comments.
e7c6c5dd2 [Ryan Blue] PARQUET-787: Fix problems from Zoltan's review.
be52b59fa [Ryan Blue] PARQUET-787: Update tests for both ByteBufferInputStreams.
b0b614748 [Ryan Blue] PARQUET-787: Update encodings to use ByteBufferInputStream.
a4fa05ac5 [Ryan Blue] Refactor ByteBufferInputStream implementations.
56b22a6a1 [Ryan Blue] Make allocation size configurable.
103ed3d86 [Ryan Blue] Add tests for ByteBufferInputStream and fix bugs.
614a2bbc8 [Ryan Blue] Limit allocation size to 8MB chunks for better garbage collection.