18 min agoARROW-16281: [R] [CI] Bump versions with the release of 4.2 master
Dragoș Moldovan-Grünfeld [Wed, 18 May 2022 20:28:23 +0000 (13:28 -0700)] 
ARROW-16281: [R] [CI] Bump versions with the release of 4.2

Update hard-coded versions on R in our CI after the release of R 4.2.

Closes #12980 from dragosmg/r_42_ci_update

Authored-by: Dragoș Moldovan-Grünfeld <>
Signed-off-by: Jonathan Keane <>
41 min agoMINOR: [FlightRPC] Document assumption about catalogs support on SQL_CATALOG_TERM
Rafael Telles [Wed, 18 May 2022 20:04:47 +0000 (16:04 -0400)] 
MINOR: [FlightRPC] Document assumption about catalogs support on SQL_CATALOG_TERM

To indicate that a Flight SQL server does not support catalogs we assume that sources will return empty string for `SQL_CATALOG_TERM` on Flight SQL's CommandGetSqlInfo response.

Closes #13175 from rafael-telles/document-catalog-support

Authored-by: Rafael Telles <>
Signed-off-by: David Li <>
2 hours agoARROW-16427: [Java] Provide explicit column type mapping
Todd Farmer [Wed, 18 May 2022 17:52:18 +0000 (13:52 -0400)] 
ARROW-16427: [Java] Provide explicit column type mapping

Closes #13166 from toddfarmer/toddfarmer/arrow-16427

Authored-by: Todd Farmer <>
Signed-off-by: David Li <>
5 hours agoMINOR: Fix wrongly redefining pytestmark for parquet encryption tests (#13189)
Raúl Cumplido [Wed, 18 May 2022 15:39:32 +0000 (17:39 +0200)] 
MINOR: Fix wrongly redefining pytestmark for parquet encryption tests (#13189)

11 hours agoARROW-16516: [R] Implement ym() my() and yq() parsers
Dragoș Moldovan-Grünfeld [Wed, 18 May 2022 09:44:53 +0000 (10:44 +0100)] 
ARROW-16516: [R] Implement ym() my() and yq() parsers

The `ym()`, `my()` and `yq()` bindings will make the following possible (and identical):

``` r
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

test_df <- tibble::tibble(
  ym_string = c("2022-05", "2022/02", "22.03", NA)

test_df %>%
  mutate(ym_date = ym(ym_string))
#> # A tibble: 4 × 2
#>   ym_string ym_date
#>   <chr>     <date>
#> 1 2022-05   2022-05-01
#> 2 2022/02   2022-02-01
#> 3 22.03     2022-03-01
#> 4 <NA>      NA

test_df %>%
  arrow_table() %>%
  mutate(ym_date = ym(ym_string)) %>%
#> # A tibble: 4 × 2
#>   ym_string ym_date
#>   <chr>     <date>
#> 1 2022-05   2022-05-01
#> 2 2022/02   2022-02-01
#> 3 22.03     2022-03-01
#> 4 <NA>      NA

<sup>Created on 2022-05-16 by the [reprex package]( (v2.0.1)</sup>

I've implementing this with the following steps:
* add `"-01"` to the end of the strings we're trying to parse, and then
* use one the supported `orders` (`"ymd"` or `"myd"`)

Closes #13163 from dragosmg/ym_my_yq_parsers

Authored-by: Dragoș Moldovan-Grünfeld <>
Signed-off-by: Nic Crane <>
16 hours agoARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins
Sasha Krassovsky [Wed, 18 May 2022 04:01:46 +0000 (18:01 -1000)] 
ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins

This adds Bloom filter pushdown between hash join nodes.

Closes #12289 from save-buffer/sasha_bloom_pushdown

Lead-authored-by: Sasha Krassovsky <>
Co-authored-by: michalursa <>
Signed-off-by: Weston Pace <>
17 hours agoARROW-16601: [C++][FlightRPC] Don't enforcing static link with static GoogleTest...
Sutou Kouhei [Wed, 18 May 2022 03:20:06 +0000 (12:20 +0900)] 
ARROW-16601: [C++][FlightRPC] Don't enforcing static link with static GoogleTest for arrow_flight_testing (#13180)

We can remove this because #13169/ARROW-16588 solved the link problem.

Authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
18 hours agoARROW-16478: [C++] Refine cpu info detection
Yibo Cai [Wed, 18 May 2022 02:10:03 +0000 (02:10 +0000)] 
ARROW-16478: [C++] Refine cpu info detection

This patch separates OS and ARCH depdendent code and removes CPU
frequency detection (cycles_per_ms()) which is brittle and not very
useful in practice.

There are still many caveats, especially for Arm platform. It's better
to adopt a mature library if we want more complete functionalities.

Below are examples of cpu info detected on various platforms (some
from virtual machines).

Intel, Linux
Vendor: Intel
Model: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Features (set bits):  0  1  2  3  4  5  6  7  8  9  10  11  12
Cache sizes: 32768 1048576 37486592

AMD, Linux
Vendor: AMD
Model: AMD EPYC 7251 8-Core Processor
Features (set bits):  0  1  2  3  4  5  11  12
Cache sizes: 32768 524288 33554432

Intel, MacOS
Vendor: Unknown
Model: Unknown
Features (set bits):  0  1  2  3  4
Cache sizes: 32768 262144 12582912

Intel, Windows
Vendor: Intel
Model: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz\0\0
Features (set bits):  0  1  2  3  4  5  6  7  8  9  10  11  12
Cache sizes: 131072 2097152 37486592

Intel, MinGW
Vendor: Intel
Model: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz\0\0\0\0\0\0\0
Features (set bits):  0  1  2  3  4  5  11  12
Cache sizes: 131072 524288 52428800

Arm, Linux
Vendor: Unknown
Model: Unknown
Features (set bits):  32
Cache sizes: 65536 1048576 Unknown

Arm, MacOS
Vendor: Unknown
Model: Unknown
Features (set bits):  32
Cache sizes: 65536 4194304 Unknown

Closes #13112 from cyb70289/cpuinfo-refine

Authored-by: Yibo Cai <>
Signed-off-by: Yibo Cai <>
18 hours agoMINOR: [C++] cpp/parquet/Statistics: clarify that num_values() is the number of non...
Even Rouault [Wed, 18 May 2022 01:54:48 +0000 (01:54 +0000)] 
MINOR: [C++] cpp/parquet/Statistics: clarify that num_values() is the number of non-null values

The current documentation of Statistics::num_values() is a bit
ambiguous as it mentions the 'total number of values' and my initial
understanding is that it also included null values. But experimentation
and documentation of
shows that it is the number of non-null values.

Closes #13164 from rouault/statistics_num_values

Authored-by: Even Rouault <>
Signed-off-by: Yibo Cai <>
23 hours agoARROW-16570: [R] Make pkg-config commands find all of the libs
Neal Richardson [Tue, 17 May 2022 20:54:27 +0000 (05:54 +0900)] 
ARROW-16570: [R] Make pkg-config commands find all of the libs

See discussion at

We don't currently have any CI that triggers the case @rvernica reported there.

Closes #13151 from nealrichardson/r-pkg-libs

Authored-by: Neal Richardson <>
Signed-off-by: Sutou Kouhei <>
24 hours agoARROW-16588: [C++][FlightRPC] Don't subclass GTest in test helpers
David Li [Tue, 17 May 2022 20:10:52 +0000 (05:10 +0900)] 
ARROW-16588: [C++][FlightRPC] Don't subclass GTest in test helpers

Also, don't link every Arrow library to UCX when enabled.

Closes #13169 from lidavidm/arrow-16588

Authored-by: David Li <>
Signed-off-by: Sutou Kouhei <>
29 hours agoARROW-16555: [Go][Parquet] Lift BitBlockCounter and VisitBitBlocks into shared intern...
Matthew Topol [Tue, 17 May 2022 15:00:17 +0000 (11:00 -0400)] 
ARROW-16555: [Go][Parquet] Lift BitBlockCounter and VisitBitBlocks into shared internal utils

Closes #13135 from zeroshade/arrow-16555-shared-utils

Authored-by: Matthew Topol <>
Signed-off-by: Matthew Topol <>
29 hours agoARROW-16552: [Go] Improve decimal128 utilities
Matthew Topol [Tue, 17 May 2022 14:58:52 +0000 (10:58 -0400)] 
ARROW-16552: [Go] Improve decimal128 utilities

Adding new utilities for decimal128.Num for rescaling and for converting to and from float32/64

Closes #13134 from zeroshade/arrow-16552-decimals

Authored-by: Matthew Topol <>
Signed-off-by: Matthew Topol <>
29 hours agoARROW-16530: [Go] Added concurrency in key places that are always serial, regardless...
Robert Purdom [Tue, 17 May 2022 14:57:04 +0000 (10:57 -0400)] 
ARROW-16530: [Go] Added concurrency in key places that are always serial, regardless if parallel=true or not

added concurrency to field readers.  Even when parallel=true, there a…re times when default behavior is serial which causes very slow performance when dealing with many columns and structures with many columns.

I'm working with very complex parquet files that have  500+ columns and lists of structures with 100's of columns. In the original code, getting the field readers is always done serially regardless if parallel is true.  This is also true when the readers retrieve 'next batch' of records.  I modified the code to perform concurrent 'read' operations in three places in two files.  The performance impact is especially heavy on high-latency files, e.g., cloud storage.

The original version required just over an hour to read 600+ columns from GCS.  The revised version completes the same read in ~ 11 minutes.

Closes #13120 from raceordie690/master

Authored-by: Robert Purdom <>
Signed-off-by: Matthew Topol <>
34 hours agoARROW-16548: [Python] Add pytest.mark.parquet to all tests under tests/parquet package
Raúl Cumplido [Tue, 17 May 2022 09:58:12 +0000 (11:58 +0200)] 
ARROW-16548: [Python] Add pytest.mark.parquet to all tests under tests/parquet package

The implementation marks all the individual tests that are on this structure with the parquet dataset mark correctly.

Closes #13147 from raulcd/ARROW-16548

Authored-by: Raúl Cumplido <>
Signed-off-by: Joris Van den Bossche <>
35 hours agoARROW-16541: [R] [CI] Reduce the number of times lintr is run
Dragoș Moldovan-Grünfeld [Tue, 17 May 2022 09:17:53 +0000 (10:17 +0100)] 
ARROW-16541: [R] [CI] Reduce the number of times lintr is run

This PR will reduce the number of times `lintr::lint_package()` is being run to 1 (run only on the current release). At present it runs on each branch of the Windows CI workflows.

Closes #13162 from dragosmg/run_lintr_once

Authored-by: Dragoș Moldovan-Grünfeld <>
Signed-off-by: Nic Crane <>
37 hours agoARROW-16507: [CI][C++] Use system gtest with mamba/conda
Sutou Kouhei [Tue, 17 May 2022 07:12:38 +0000 (09:12 +0200)] 
ARROW-16507: [CI][C++] Use system gtest with mamba/conda

The gtest package for Windows provided by conda-forge doesn't provide `GTestConfig.cmake`.
See also:

And `FindGTest.cmake` provided by CMake can't find `gtest_dll.dll` that is a shared
library version of GoogleTest.
See also:

It means that we can find only static version of GoogleTest on Windows with Conda
without a custom `FindGTestAlt.cmake`.

Shared library version `arrow_flight_testing` requires shared library version GoogleTest on
Windows because it defines `arrow::flight::FlightTest` that inherits `testing::Test`.
See also:

We must use the same library type for them on Windows.

`ARROW_BUILD_TESTS=ON` with static version of GoogleTest, we need to build a static library not
shared library for  `arrow_flight_testing`.

Closes #13101 from assignUser/ARROW-16507-fix-gtest2

Lead-authored-by: Sutou Kouhei <>
Co-authored-by: Jacob Wujciak-Jens <>
Signed-off-by: Antoine Pitrou <>
44 hours agoARROW-16571: [Java] Update .gitignore to exclude JNI-related binaries
Larry White [Tue, 17 May 2022 00:06:58 +0000 (09:06 +0900)] 
ARROW-16571: [Java] Update .gitignore to exclude JNI-related binaries

Adds three lines to gitignore to exclude three folders containing binaries and other build output produced by the JNI build process. The folders:
- java-dist/
- java-native-c/
- java-native-cpp/

are created in the root arrow directory when cmake is run

The command line build is documented here:

I followed the macOS instructions:

[Building JNI Libraries on MacOS](

To build only the C Data Interface library:

    $ cd arrow
    $ brew bundle --file=cpp/Brewfile
    Homebrew Bundle complete! 25 Brewfile dependencies now installed.
    $ export JAVA_HOME=<absolute path to your java home>
    $ mkdir -p java-dist java-native-c
    $ cd java-native-c
    $ cmake \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=../java-dist \
    $ cmake --build . --target install

To build other JNI libraries:

    $ cd arrow
    $ brew bundle --file=cpp/Brewfile
    Homebrew Bundle complete! 25 Brewfile dependencies now installed.
    $ export JAVA_HOME=<absolute path to your java home>
    $ mkdir -p java-dist java-native-cpp
    $ cd java-native-cpp
    $ cmake \
        -DARROW_JNI=ON \
        -DARROW_ORC=ON \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=../java-dist \
        -Dre2_SOURCE=BUNDLED \
        -DBoost_SOURCE=BUNDLED \
        -Dutf8proc_SOURCE=BUNDLED \
        -DSnappy_SOURCE=BUNDLED \
    $ cmake --build . --target install

[Building Arrow JNI Modules](

To compile the JNI bindings, use the arrow-c-data Maven profile:

    $ cd arrow/java
    $ mvn -Darrow.c.jni.dist.dir=../java-dist/lib -Parrow-c-data clean install

To compile the JNI bindings for ORC / Gandiva / Dataset, use the arrow-jni Maven profile:

    $ cd arrow/java
    $ mvn -Parrow-jni clean install

Closes #13153 from lwhite1/update-gitignore-to-include-folders-where-binaries-are-created

Authored-by: Larry White <>
Signed-off-by: Sutou Kouhei <>
2 days agoMINOR: [CI] Fix centos cmake path (#13167)
Jacob Wujciak-Jens [Mon, 16 May 2022 19:37:07 +0000 (21:37 +0200)] 
MINOR: [CI] Fix centos cmake path (#13167)

* fix centos cmake path

* extract to standardized dir

2 days agoARROW-16568: [Java] Enable skip BOUNDS_CHECKING with setBytes and getBytes of ArrowBuf
stczwd [Mon, 16 May 2022 16:41:21 +0000 (12:41 -0400)] 
ARROW-16568: [Java] Enable skip BOUNDS_CHECKING with setBytes and getBytes of ArrowBuf

We have BOUNDS_CHECKING_SKIP in ArrowBuf.setByte or ArrowBuf.getByte, it helps to remove unexpected bounds checks. However, it doesn't exists in ArrowBuf.setBytes or ArrowBuf.getBytes, which makes 10% cpu time cost for checking bounds in our environment.

Closes #13161 from jackylee-ch/skip_bounds_check_for_set_or_get_bytes

Authored-by: stczwd <>
Signed-off-by: David Li <>
2 days agoARROW-16525: [C++] Tee node not properly marking node finished
Weston Pace [Mon, 16 May 2022 15:32:24 +0000 (11:32 -0400)] 
ARROW-16525: [C++] Tee node not properly marking node finished

Closes #13117 from westonpace/feature/ARROW-16525--tee-node-not-marking-finished

Lead-authored-by: Weston Pace <>
Co-authored-by: %(trailers:key=Co-authored-by,valueonly)
Signed-off-by: Benjamin Kietzman <>
2 days agoARROW-16531: [Dev] Update pre-commit to use latest flake8 and remove unsupported...
Raúl Cumplido [Mon, 16 May 2022 09:07:46 +0000 (11:07 +0200)] 
ARROW-16531: [Dev] Update pre-commit to use latest flake8 and remove unsupported cython linting

This PR tries to update and make consistent our linting for Python between archery and pre-commit by removing the checks for cython files. Flake8 does not support Cython linting, see [this conversation]( for more details.

Closes #13129 from raulcd/ARROW-16531

Authored-by: Raúl Cumplido <>
Signed-off-by: Antoine Pitrou <>
2 days agoARROW-16581: [C++][Java] Upgrade ORC to 1.7.4
William Hyun [Mon, 16 May 2022 07:59:07 +0000 (09:59 +0200)] 
ARROW-16581: [C++][Java] Upgrade ORC to 1.7.4

Bump ORC to 1.7.4.

Apache ORC 1.7.3 is a maintenance release with the following bug fixes.


Closes #13159 from williamhyun/orc174

Authored-by: William Hyun <>
Signed-off-by: Antoine Pitrou <>
3 days agoARROW-16504 [Go][CSV] Add arrow.TimestampType support to the reader
Mark Wolfe [Sun, 15 May 2022 17:35:30 +0000 (13:35 -0400)] 
ARROW-16504 [Go][CSV] Add arrow.TimestampType support to the reader

There is already a helper to convert strings to arrow.Timestamp so incorporate this into the CSV reader.

The CSV files I am currently working with have RFC3339 timestamps so I followed some of the code JSON and stuck with millisecond default.

Was really easy to add this using the existing functions and structure.

Closes #13098 from wolfeidau/ARROW-16504-add-timestamp-support-to-reader

Authored-by: Mark Wolfe <>
Signed-off-by: Matt Topol <>
4 days agoARROW-16572: [C++] Fix LZ4 build for external projects
Hamish Nicholson [Sat, 14 May 2022 20:05:41 +0000 (05:05 +0900)] 
ARROW-16572: [C++] Fix LZ4 build for external projects

The use of CMAKE_SOURCE_DIR is not compatible with projects that build arrow from source as a CMake project dependency.

Closes #13154 from Shamazo/master

Authored-by: Hamish Nicholson <>
Signed-off-by: Sutou Kouhei <>
4 days agoARROW-16579: [Go][CI] Fix Flakey Struct Test
Matt Topol [Sat, 14 May 2022 18:01:16 +0000 (14:01 -0400)] 
ARROW-16579: [Go][CI] Fix Flakey Struct Test

The tests for [ARROW-16502]( reused the same `StructBuilder` between cases. But since it was testing for a panic, it left the StructBuilder in a bad state. This is understandable as it's a panic. Because the order of Go map's is non-deterministic, the test would only intermittently fail if the panic test went first and succeed if it went second.

By shifting the StructBuilder creating inside the test function, the two test cases no longer share it and the test is no longer flakey.

Closes #13158 from zeroshade/arrow-16579-flakeytest

Authored-by: Matt Topol <>
Signed-off-by: Matt Topol <>
4 days agoARROW-16561 [Go][Parquet] test for parquet root node configuration
Mark Wolfe [Sat, 14 May 2022 17:08:06 +0000 (13:08 -0400)] 
ARROW-16561 [Go][Parquet] test for parquet root node configuration

As requested in #13139 I have added a test with some example configuration to verify it works as intended.

I added it to the schema test as well as that is where I am using it, hopefully that is fine @zeroshade .

Closes #13156 from wolfeidau/ARROW-16561-customise-root-node-test

Authored-by: Mark Wolfe <>
Signed-off-by: Matt Topol <>
4 days agoARROW-16563: [Go][Parquet] Fix broken parquet plain boolean decoder
Matt DePero [Sat, 14 May 2022 16:43:38 +0000 (12:43 -0400)] 
ARROW-16563: [Go][Parquet] Fix broken parquet plain boolean decoder

While reading parquet files using this library, we discovered that boolean fields with `PLAIN` encoding were not being read properly. It was discovered that this was due to how `bitOffset` was managed in the boolean decoder; this PR patches the decoder and also adds test coverage for reading plain boolean pages.

Closes #13141 from mdepero/depero/boolencodingfix

Authored-by: Matt DePero <>
Signed-off-by: Matt Topol <>
4 days agoARROW-16569: [CI] Update checkout actions to newer version
Raúl Cumplido [Sat, 14 May 2022 06:54:25 +0000 (15:54 +0900)] 
ARROW-16569: [CI] Update checkout actions to newer version

This PR aims to update the github checkout actions to the newest version.

Closes #13152 from raulcd/ARROW-16569

Authored-by: Raúl Cumplido <>
Signed-off-by: Sutou Kouhei <>
4 days agoARROW-16498: [C++] Fix potential deadlock in arrow::compute::TaskScheduler
Weston Pace [Fri, 13 May 2022 23:10:20 +0000 (13:10 -1000)] 
ARROW-16498: [C++] Fix potential deadlock in arrow::compute::TaskScheduler

Closes #13091 from westonpace/bugfix/ARROW-16498--task-scheduler-deadlock

Authored-by: Weston Pace <>
Signed-off-by: Weston Pace <>
4 days agoARROW-16534: [Java] update Gandiva protobuf library to enable builds on M1
Larry White [Fri, 13 May 2022 21:29:11 +0000 (17:29 -0400)] 
ARROW-16534: [Java] update Gandiva protobuf library to enable builds on M1

Currently used protobuf library 2.5.0 does not include support for M1, causing tests to fail after compiling from source on Apple Silicon. Version 3.20.1 does provide M1 support. This PR updates the library version.

Closes #13121 from lwhite1/update-protobuf-dependenc

Authored-by: Larry White <>
Signed-off-by: David Li <>
5 days agoARROW-16539: [C++] Bump bundled thrift to 0.16.0
Sutou Kouhei [Fri, 13 May 2022 20:02:18 +0000 (16:02 -0400)] 
ARROW-16539: [C++] Bump bundled thrift to 0.16.0

Closes #13122 from nealrichardson/bump-thrift

Lead-authored-by: Sutou Kouhei <>
Co-authored-by: Neal Richardson <>
Signed-off-by: Neal Richardson <>
5 days agoARROW-16561 [Go][Parquet] add option to customise parquet root node
Mark Wolfe [Fri, 13 May 2022 17:17:21 +0000 (13:17 -0400)] 
ARROW-16561 [Go][Parquet] add option to customise parquet root node

Currently the root nodes name and repetition are hard coded and the current settings need to be changed to work some other tools.

Closes #13139 from wolfeidau/ARROW-16561-customise-root-node

Authored-by: Mark Wolfe <>
Signed-off-by: Matthew Topol <>
5 days agoARROW-16538: [Java] Adding flexibility to mock ResultSets
Todd Farmer [Fri, 13 May 2022 12:13:02 +0000 (08:13 -0400)] 
ARROW-16538: [Java] Adding flexibility to mock ResultSets

The minimum required to support existing use cases of FakeResultSet has been implemented here - every other method throws a SQLException, and support can be added in the future as specific methods of MockResultSet are referenced.

Closes #13123 from toddfarmer/toddfarmer/arrow-16427

Authored-by: Todd Farmer <>
Signed-off-by: David Li <>
5 days agoARROW-16402: [R][CI] Create new Archery Tasks
Jacob Wujciak-Jens [Fri, 13 May 2022 07:50:39 +0000 (09:50 +0200)] 
ARROW-16402: [R][CI] Create new Archery Tasks

This PR introduces two new archery docker task for future use in building the R nightlies in Crossbow.

Closes #13131 from assignUser/ARROW-16402

Lead-authored-by: Jacob Wujciak-Jens <>
Co-authored-by: Neal Richardson <>
Signed-off-by: Alessandro Molina <>
6 days agoARROW-16425: [C++] Add compute kernel test for scalar array timestamp comparison
Yaron Gvili [Thu, 12 May 2022 20:03:47 +0000 (16:03 -0400)] 
ARROW-16425: [C++] Add compute kernel test for scalar array timestamp comparison


Closes #13037 from rtpsw/ARROW-16425

Authored-by: Yaron Gvili <>
Signed-off-by: David Li <>
6 days agoARROW-16183: [C++][FlightRPC] Support bundled UCX
Yibo Cai [Thu, 12 May 2022 19:56:42 +0000 (15:56 -0400)] 
ARROW-16183: [C++][FlightRPC] Support bundled UCX

Closes #12881 from cyb70289/16183-bundled-ucx

Authored-by: Yibo Cai <>
Signed-off-by: David Li <>
6 days agoARROW-16502: [Go] Accept missing optional fields when unmarshalling JSON in StructBuilder
Przemysław Kowolik [Thu, 12 May 2022 17:47:56 +0000 (13:47 -0400)] 
ARROW-16502: [Go] Accept missing optional fields when unmarshalling JSON in StructBuilder

When calling array.StructBuilder.UnmarshalJSON with a JSON object that has missing optional fields, it fails to decode the JSON object properly and will panic - but overall it's a common behavior to drop empty/null fields from the JSON

Fix this by filling all missing optional fields with null values to prevent builder from panic

Closes #13097 from Kowol/bufix/struct-unmarshal-optional-fields

Authored-by: Przemysław Kowolik <>
Signed-off-by: Matthew Topol <>
6 days agoARROW-16486: [Go] Implement bit_packing functions with Arm64 GoLang Assembly
Yuqi Gu [Thu, 12 May 2022 17:46:08 +0000 (13:46 -0400)] 
ARROW-16486: [Go] Implement bit_packing functions with Arm64 GoLang Assembly

Implement the functions of unpack32_neon in the module: parquet-bitpacking.

Tests passed:
go test ./...

Closes #13080 from guyuqi/ARROW-16486

Authored-by: Yuqi Gu <>
Signed-off-by: Matthew Topol <>
6 days agoARROW-16394: [R] Implement lubridate's parsers with year, month and date components
Dragoș Moldovan-Grünfeld [Thu, 12 May 2022 14:41:40 +0000 (15:41 +0100)] 
ARROW-16394: [R] Implement lubridate's parsers with year, month and date components

This PR adds bindings for lubridate's parsers with **y**ear, **m**onth, and **d**ay components, allowing the following to work correctly:
``` r
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

test_df <- tibble::tibble(
  ymd_string = c("2022-05-11", "2022/05/12", "22.05-13")

test_df %>%
  mutate(ymd_date = ymd(ymd_string))
#> # A tibble: 3 × 2
#>   ymd_string ymd_date
#>   <chr>      <date>
#> 1 2022-05-11 2022-05-11
#> 2 2022/05/12 2022-05-12
#> 3 22.05-13   2022-05-13

test_df %>%
  arrow_table() %>%
  mutate(ymd_date = ymd(ymd_string)) %>%
#> # A tibble: 3 × 2
#>   ymd_string ymd_date
#>   <chr>      <date>
#> 1 2022-05-11 2022-05-11
#> 2 2022/05/12 2022-05-12
#> 3 22.05-13   2022-05-13

<sup>Created on 2022-05-11 by the [reprex package]( (v2.0.1)</sup>

Closes #13118 from dragosmg/ymd_parsers

Authored-by: Dragoș Moldovan-Grünfeld <>
Signed-off-by: Nic Crane <>
6 days agoMINOR: [Java] Indicate absolute path is required in docs
Larry White [Thu, 12 May 2022 13:43:35 +0000 (09:43 -0400)] 
MINOR: [Java] Indicate absolute path is required in docs

Because maven is setup to run the tests in a temp folder, the path provided to the native libraries must be absolute or the tests will fail

Closes #13114 from lwhite1/patch-1

Lead-authored-by: Larry White <>
Co-authored-by: Larry White <>
Co-authored-by: David Li <>
Signed-off-by: David Li <>
6 days agoARROW-16526: [Python] test_partitioned_dataset fails when building with PARQUET but...
Weston Pace [Thu, 12 May 2022 11:25:26 +0000 (13:25 +0200)] 
ARROW-16526: [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET

One of the legacy parquet dataset tests was not properly passing use_legacy_dataset and this caused the test to attempt to use the new datasets module even if it wasn't enabled

Closes #13116 from westonpace/bugfix/MINOR--missing-dataset-mark

Authored-by: Weston Pace <>
Signed-off-by: Joris Van den Bossche <>
6 days agoARROW-16468: [Python] Test Table filter feature with complex exprs and add Expression...
Alessandro Molina [Thu, 12 May 2022 10:01:41 +0000 (12:01 +0200)] 
ARROW-16468: [Python] Test Table filter feature with complex exprs and add Expression.apply method

Depends on

Closes #13099 from amol-/ARROW-16468

Authored-by: Alessandro Molina <>
Signed-off-by: Alessandro Molina <>
6 days agoARROW-16508: [Archery][Dev] Add possibility to extend chat report message based on...
Raúl Cumplido [Thu, 12 May 2022 09:58:45 +0000 (11:58 +0200)] 
ARROW-16508: [Archery][Dev] Add possibility to extend chat report message based on success or failures of jobs

Closes #13102 from raulcd/ARROW-16508

Authored-by: Raúl Cumplido <>
Signed-off-by: Alessandro Molina <>
6 days agoMINOR: Add feedback information output when building in case of skipping pyarrow...
Raúl Cumplido [Thu, 12 May 2022 09:28:30 +0000 (11:28 +0200)] 
MINOR: Add feedback information output when building in case of skipping pyarrow build

This minor PR tries to give the user a better hint on what is happening in case of their build being skipped because `cachedir != build_temp`. I faced the issue and had to ask and debug to understand that I had to clean my previous build:
(pyarrow-dev) ~/arrow/python (master)  $ python build_ext --inplace
running build_ext
(pyarrow-dev) ~/arrow/python (master)  $
The new output will show on that case:
(pyarrow-dev) ~/arrow/python (master)  $ python build_ext --inplace
running build_ext
-- Skipping build. Temp build /home/raulcd/arrow/python/build/temp.linux-x86_64-3.10 does not match cached dir /arrow/python/build/temp.linux-x86_64-3.10
(pyarrow-dev) ~/arrow/python (master)  $

Closes #13119 from raulcd/minor-feedback-improvement

Authored-by: Raúl Cumplido <>
Signed-off-by: Joris Van den Bossche <>
7 days agoARROW-16168: [C++][CMake] Use target to add include paths
Sutou Kouhei [Wed, 11 May 2022 20:24:28 +0000 (05:24 +0900)] 
ARROW-16168: [C++][CMake] Use target to add include paths

We can remove "include_directories(SYSTEM)" by this.

Closes #12861 from kou/cpp-target-include-path

Authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
7 days agoARROW-16473: [Go] fixing memory leak in serializedPageReader
Min-Young Wu [Wed, 11 May 2022 19:55:38 +0000 (15:55 -0400)] 
ARROW-16473: [Go] fixing memory leak in serializedPageReader

`parquet/file.serializedPageReader` has a [memory.Buffer]( attribute (presumably to reuse across page reads). But at the end of `serializedPageReader.Next` (in the non-error case), a new `memory.Buffer` is [created]( without releasing the pre-existing `p.buf`, thus resulting in a leak.

Existing tests updated to test and catch this (`parquet/file` now uses `CheckedAllocator).

Closes #13068 from minyoung/user/minyoung/0504-serialized-page-reader-leak

Authored-by: Min-Young Wu <>
Signed-off-by: Matthew Topol <>
7 days agoARROW-16529: [Java] Fix ArrowVectorIterator.hasNext()
Todd Farmer [Wed, 11 May 2022 18:30:55 +0000 (14:30 -0400)] 
ARROW-16529: [Java] Fix ArrowVectorIterator.hasNext()

Calls to ArrowVectorIterator.hasNext() should return true until has been called, and the underlying ResultSet has been fully consumed.

Closes #13107 from toddfarmer/tofarmer/fix-hasnext-for-empty-resultset

Authored-by: Todd Farmer <>
Signed-off-by: David Li <>
7 days agoARROW-602: [C++] Provide iterator access to primitive elements inside an Array
Alvin Chunga [Wed, 11 May 2022 16:54:31 +0000 (18:54 +0200)] 
ARROW-602: [C++] Provide iterator access to primitive elements inside an Array

Create Iterator method in stl for Array and ChunkedArray

Closes #13009 from AlvinJ15/ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays

Lead-authored-by: Alvin Chunga <>
Co-authored-by: Antoine Pitrou <>
Signed-off-by: Antoine Pitrou <>
7 days agoARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables...
Alessandro Molina [Wed, 11 May 2022 11:53:01 +0000 (13:53 +0200)] 
ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression

The function is focused on Tables, as it will be the foundation for `Table.filter`, but as the extra work was minimal I added support for Dataset too.

Closes #13075 from amol-/ARROW-16467

Authored-by: Alessandro Molina <>
Signed-off-by: Joris Van den Bossche <>
7 days agoARROW-13052: [Gandiva][C++] Add regexp_extract function
Johnnathan [Wed, 11 May 2022 10:06:53 +0000 (15:36 +0530)] 
ARROW-13052: [Gandiva][C++] Add regexp_extract function

Implements the REGEXP_EXTRACT function based on [the Hive implementation](

Closes #13015 from Johnnathanalmeida/feature/add-regexp-extract

Authored-by: Johnnathan <>
Signed-off-by: Pindikura Ravindra <>
7 days agoMINOR: [R] correct NEWS heading
Dragoș Moldovan-Grünfeld [Wed, 11 May 2022 09:45:43 +0000 (10:45 +0100)] 
MINOR: [R] correct NEWS heading

Closes #13106 from dragosmg/minor_news_update

Authored-by: Dragoș Moldovan-Grünfeld <>
Signed-off-by: Nic Crane <>
7 days agoARROW-16253: [R] Helper function for casting from float to duration via int64()
Dragoș Moldovan-Grünfeld [Wed, 11 May 2022 09:44:33 +0000 (10:44 +0100)] 
ARROW-16253: [R] Helper function for casting from float to duration via int64()

Closes #13055 from dragosmg/duration_casting_helper

Authored-by: Dragoș Moldovan-Grünfeld <>
Signed-off-by: Nic Crane <>
8 days agoARROW-16414: [R] Remove ARROW_R_WITH_ARROW and arrow_available()
Neal Richardson [Tue, 10 May 2022 16:48:22 +0000 (12:48 -0400)] 
ARROW-16414: [R] Remove ARROW_R_WITH_ARROW and arrow_available()

The diff looks bigger than that because

* Sometimes those changes just resulted in reducing indentation
* I moved arrow_info() and related functions to their own file, and did the same with ArrowObject while I was there
* The way we were wrapping testthat::test_that to check whether arrow was available had a side effect of creating a closure that stored intermediate objects that we reused across tests, and that broke when I removed it.
* I didn't have styler configured correctly in vscode when I started because I had upgraded R to 4.2, so to fix what I had already committed that was unstyled, I ran `make style-all` across everything, which reformatted a bunch of unrelated code.

I tried to pull on all threads I noticed where we were doing things an unnatural way because we couldn't assume that arrow was present, but there may be more.

Closes #13086 from nealrichardson/arrow-is-available

Lead-authored-by: Neal Richardson <>
Co-authored-by: Jonathan Keane <>
Signed-off-by: Neal Richardson <>
8 days agoARROW-15587: [C++] Add support for all options specified by substrait::ReadRel::Local...
Ariana Villegas [Tue, 10 May 2022 16:10:57 +0000 (18:10 +0200)] 
ARROW-15587: [C++] Add support for all options specified by substrait::ReadRel::LocalFiles::FileOrFiles

The Substrait read operator defines files with LocalFiles::FileOrFiles. These elements can take one of several forms:

- uri_path (can be a file or a folder)
- uri_path_glob (a glob expression)
- uri_file (file only)
- uri_folder (folder only)

The C++ Substrait consumer currently only supports uri_file. This PR adds support for the other options.

- [x] uri_path (can be a file or a folder)
- [x] uri_path_glob (a glob expression)
- [x] uri_folder (folder only)

Closes #12625 from ArianaVillegas/ARROW-15587

Lead-authored-by: Ariana Villegas <>
Co-authored-by: ArianaVillegas <>
Co-authored-by: Antoine Pitrou <>
Co-authored-by: Weston Pace <>
Signed-off-by: Antoine Pitrou <>
8 days agoARROW-16489: [R] wrong encoding causes parsing error
Jacob Wujciak-Jens [Tue, 10 May 2022 13:58:04 +0000 (14:58 +0100)] 
ARROW-16489: [R] wrong encoding causes parsing error

Closes #13082 from assignUser/ARROW-16489-fix-encoding

Authored-by: Jacob Wujciak-Jens <>
Signed-off-by: Nic Crane <>
8 days agoARROW-16297: [R] Improve detection of ARROW_*_URL variables for offline build
karldw [Tue, 10 May 2022 13:24:54 +0000 (09:24 -0400)] 
ARROW-16297: [R] Improve detection of ARROW_*_URL variables for offline build

As Neal mentioned in, the current code in nixlibs.R doesn't handle URL variable names components that have multiple words (because of the way it parses variable names from filenames). Until now, we've had a special case for the AWS variables, but `ARROW_GOOGLE_CLOUD_CPP_URL` and `ARROW_NLOHMANN_JSON_URL` also need handling. Instead of adding special cases, we can provide the correct `ARROW_*_URL` values with the new bash script added as part of ARROW-15092 (in PR #12849).

Please let me know what you think!

Closes #12973 from karldw/fix-16297

Lead-authored-by: karldw <>
Co-authored-by: Neal Richardson <>
Signed-off-by: Neal Richardson <>
8 days agoARROW-16501: [Docs][C++][R] Migrate to Matomo from Google Analytics
Sutou Kouhei [Tue, 10 May 2022 08:15:35 +0000 (10:15 +0200)] 
ARROW-16501: [Docs][C++][R] Migrate to Matomo from Google Analytics

It's for becoming compliant with the GDPR (the European Union's
General Data Protection Regulation).

See also:


Closes #13096 from kou/docs-matomo

Authored-by: Sutou Kouhei <>
Signed-off-by: Joris Van den Bossche <>
8 days agoARROW-16500: [Release][R] Don't use GNU sed extension for r/ update
Sutou Kouhei [Tue, 10 May 2022 07:07:17 +0000 (16:07 +0900)] 
ARROW-16500: [Release][R] Don't use GNU sed extension for r/ update

We should not '0,/.../' syntax for updating r/ because it's a
GNU sed extension. If we use it, our release script doesn't work on
macOS that uses BSD sed.

Closes #13095 from kou/release-r-news-update

Authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
8 days agoARROW-16484: [Go][Parquet] Update parquet writer version
Matthew Topol [Tue, 10 May 2022 06:44:52 +0000 (15:44 +0900)] 
ARROW-16484: [Go][Parquet] Update parquet writer version

This updates the default `Created By` string of the Go Parquet Writer, and updates the release script to correctly update that string with releases in the future.

Closes #13103 from zeroshade/arrow-16484-parq-writerversion

Lead-authored-by: Matthew Topol <>
Co-authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
8 days agoARROW-16487: [C++][Parquet] Fix parquet::Statistics::Equals() with minmax
Sutou Kouhei [Tue, 10 May 2022 06:42:40 +0000 (15:42 +0900)] 
ARROW-16487: [C++][Parquet] Fix parquet::Statistics::Equals() with minmax

The following cases return wrong result:

* statistics_no_minmax.Equals(statistics_no_minmax):
  This must returns true but false is returned.
* statistics_minmax.Equals(statistics_minmax):
  This must returns true but false is returned.
* statistics_minmax1.Equals(statistics_minmax2) where
  statistics_minmax1 and statistics_minmax2 have different minmax:
  This must returns false but true is returned.

Note that parquet::Statistics::Equals() was introduced by .

Closes #13087 from kou/cpp-parquet-statistics-equal

Authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
8 days agoARROW-16426: [C++] Add TeeNode to execution engine
Yaron Gvili [Tue, 10 May 2022 01:52:16 +0000 (15:52 -1000)] 
ARROW-16426: [C++] Add TeeNode to execution engine

The existing write node is a consuming one while the proposed tee node is a pass-through one.

Closes #13040 from rtpsw/ARROW-16426

Authored-by: Yaron Gvili <>
Signed-off-by: Weston Pace <>
9 days agoARROW-16511: [R] Preserve schema metadata in write_dataset()
Neal Richardson [Mon, 9 May 2022 19:48:22 +0000 (15:48 -0400)] 
ARROW-16511: [R] Preserve schema metadata in write_dataset()

Closes #13105 from nealrichardson/write-dataset-metadata

Authored-by: Neal Richardson <>
Signed-off-by: Neal Richardson <>
9 days agoARROW-14848: [R] Implement bindings for lubridate's parse_date_time
Dragoș Moldovan-Grünfeld [Mon, 9 May 2022 19:30:19 +0000 (14:30 -0500)] 
ARROW-14848: [R] Implement bindings for lubridate's parse_date_time

This PR adds a partial implementation of `parse_date_time()`:
* only parses the year, month, and date components (no hours, minutes and seconds yet)
* does not support parsing of strings without separators (e.g. `"220912"` to `2022-09-12`)
* `lubridate::parse_date_time()` infers the most likely `format` given `orders` (via `guess_formats()`, while the Arrow binding does not do any inference.

Closes #12589 from dragosmg/parse_date_time

Lead-authored-by: Dragoș Moldovan-Grünfeld <>
Co-authored-by: Jonathan Keane <>
Signed-off-by: Jonathan Keane <>
9 days agoMINOR: [R] Move tzdb loading out of .onLoad() to avoid a check NOTE
Neal Richardson [Mon, 9 May 2022 19:11:12 +0000 (15:11 -0400)] 
MINOR: [R] Move tzdb loading out of .onLoad() to avoid a check NOTE

`R CMD check` now raises a NOTE after my previous fix (f49fbda3dffeaead7d192ec64bfd2a7cfc4172a3):

* checking R code for possible problems ... NOTE
File ‘arrow/R/arrow-package.R’:
  .onLoad calls:
    packageStartupMessage("The tzdb package is not installed. Timezones will not be available.")
See section ‘Good practice’ in '?.onAttach'.

Interestingly, the docs they point to say to use `packageStartupMessage()` like we are doing here. In any case, if we move it to a function that `.onLoad()` calls rather than having it directly in .onLoad, `check` doesn't find it 🤷

Closes #13104 from nealrichardson/tzdb-msg-2

Authored-by: Neal Richardson <>
Signed-off-by: Neal Richardson <>
9 days agoARROW-16085: [C++][R] InMemoryDataset::ReplaceSchema does not alter scan output
Will Jones [Mon, 9 May 2022 17:35:46 +0000 (13:35 -0400)] 
ARROW-16085: [C++][R] InMemoryDataset::ReplaceSchema does not alter scan output

Feels a little funny that deleting this code just makes it work, so I added a decent number of tests to make sure differing schemas are handled. LMK if you think I missed something.

Closes #13088 from wjones127/ARROW-16085-unify-inmemory-datasets

Authored-by: Will Jones <>
Signed-off-by: David Li <>
9 days agoMINOR: [Python][Docs] Improving sentence on docs:python/parquet
alexdesiqueira [Mon, 9 May 2022 13:14:51 +0000 (15:14 +0200)] 
MINOR: [Python][Docs] Improving sentence on docs:python/parquet

Just a small improvement on `docs/source/python/parquet.rst`:
> _We need not use_


> _We do not need to use_

Closes #13093 from alexdesiqueira/small_wording-python_parquet

Authored-by: alexdesiqueira <>
Signed-off-by: Joris Van den Bossche <>
9 days agoARROW-16499: [Release][Ruby] Add missing export
Sutou Kouhei [Mon, 9 May 2022 02:44:27 +0000 (11:44 +0900)] 
ARROW-16499: [Release][Ruby] Add missing export

We need to export GEM_HOST_OTP_CODE so that "gem push" can use it.

Closes #13092 from kou/release-ruby-missing-export

Authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
9 days agoARROW-16497: [R] Update version in
Sutou Kouhei [Mon, 9 May 2022 02:24:30 +0000 (11:24 +0900)] 
ARROW-16497: [R] Update version in

This solves the following CI failure:

    Failure: test_version_pre_tag(PrepareTest)
    /home/runner/work/arrow/arrow/dev/release/01-prepare-test.rb:226:in `test_version_pre_tag'
         223:     end
         225:     prepare("VERSION_PRE_TAG")
      => 226:     assert_equal(expected_changes.sort_by {|diff| diff[:path]},
         227:                  parse_patch(git("log", "-n", "1", "-p")))
         228:   end
         229: end
    ?  {:hunks=>[["-# arrow", "+# arrow 9.0.0"]], :path=>"r/"},
    ?                       7

Closes #13089 from kou/r-news-update

Authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
9 days agoARROW-15671: [GLib] Add support for Vala
Sutou Kouhei [Mon, 9 May 2022 01:48:02 +0000 (10:48 +0900)] 
ARROW-15671: [GLib] Add support for Vala

Closes #12993 from kou/glib-vapi

Lead-authored-by: Sutou Kouhei <>
Co-authored-by: Tao Zuhong <>
Signed-off-by: Sutou Kouhei <>
11 days agoARROW-16474: [C++][Packaging] Require Python 3.7 or later
Sutou Kouhei [Sat, 7 May 2022 20:17:05 +0000 (05:17 +0900)] 
ARROW-16474: [C++][Packaging] Require Python 3.7 or later

Closes #13079 from kou/packaging-linux-drop-python-3.6-support

Authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
11 days agoARROW-16228: [CI][Packaging][Conan] Add a job to test minimum build
Sutou Kouhei [Fri, 6 May 2022 22:33:51 +0000 (07:33 +0900)] 
ARROW-16228: [CI][Packaging][Conan] Add a job to test minimum build

ci/conan/ is based on

We'll update ci/conan/ to follow the latest cpp/ changes such as new
build options and send a pull request the updates to when we release a new

Closes #12918 from kou/ci-conan

Authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
11 days ago[Release] Update .deb/.rpm changelogs for 8.0.0
Krisztián Szűcs [Fri, 6 May 2022 21:05:25 +0000 (23:05 +0200)] 
[Release] Update .deb/.rpm changelogs for 8.0.0

11 days ago[Release] Update .deb package names for 9.0.0
Krisztián Szűcs [Fri, 6 May 2022 21:05:24 +0000 (23:05 +0200)] 
[Release] Update .deb package names for 9.0.0

11 days ago[Release] Update versions for 9.0.0-SNAPSHOT
Krisztián Szűcs [Fri, 6 May 2022 21:05:24 +0000 (23:05 +0200)] 
[Release] Update versions for 9.0.0-SNAPSHOT

12 days agoARROW-16490: [C++][Windows] Don't force to use bundled GoogleTest
Jacob Wujciak-Jens [Fri, 6 May 2022 19:56:31 +0000 (04:56 +0900)] 
ARROW-16490: [C++][Windows] Don't force to use bundled GoogleTest

It seems that conda-forge's GoogleTest package for Windows solved the problem we encountered.

We can use GoogleTest installed by vcpkg by removing the workaround.

Closes #13083 from assignUser/ARROW-16490-fix-gtest

Authored-by: Jacob Wujciak-Jens <>
Signed-off-by: Sutou Kouhei <>
12 days agoARROW-16494: [C++] Add missing include that is making some packaging jobs fail
Raúl Cumplido [Fri, 6 May 2022 19:46:42 +0000 (04:46 +0900)] 
ARROW-16494: [C++] Add missing include that is making some packaging jobs fail

This PR tries to fix the issue that has been raised on some packaging builds like:
- [conda-linux-gcc-py310-cuda](
- [conda-linux-gcc-py310-ppc64le](
- [conda-linux-gcc-py37-arm64](
- [conda-linux-gcc-py37-cpu-r40](

Closes #13084 from raulcd/ARROW-16494

Authored-by: Raúl Cumplido <>
Signed-off-by: Sutou Kouhei <>
12 days agoMINOR: [R] Don't use warning() in .onLoad()
Neal Richardson [Fri, 6 May 2022 19:17:23 +0000 (15:17 -0400)] 
MINOR: [R] Don't use warning() in .onLoad()

According to the [docs](, `packageStartupMessage()` is preferred. With `warning()`, it can show up in the `R CMD INSTALL` output like this:

installing to D:/a/_temp/Library/00LOCK-arrow/00new/arrow/libs/x64
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
Warning in fun(libname, pkgname) :
  The tzdb package is not installed. Timezones will not be available.
** testing if installed package can be loaded from final location
Warning in fun(libname, pkgname) :
** testing if installed package keeps a record of temporary installation path
  The tzdb package is not installed. Timezones will not be available.
* MD5 sums
packaged installation of 'arrow' as
* DONE (arrow)

I will cherry-pick when preparing the 8.0.0 CRAN submission

Closes #13085 from nealrichardson/load-warning

Authored-by: Neal Richardson <>
Signed-off-by: Neal Richardson <>
12 days agoARROW-16488: [Archery][Dev] Allow extra message to be sent on chat report
Raúl Cumplido [Fri, 6 May 2022 18:23:55 +0000 (20:23 +0200)] 
ARROW-16488: [Archery][Dev] Allow extra message to be sent on chat report

This PR allows an extra CLI argument `--extra-message` to be used when sending a chat report in order to append content to the message being sent.

Closes #13081 from raulcd/ARROW-16488

Authored-by: Raúl Cumplido <>
Signed-off-by: Krisztián Szűcs <>
12 days agoARROW-16448: [CI][Archery] Refactor EmailReport to be a JinjaReport
Raúl Cumplido [Fri, 6 May 2022 18:20:32 +0000 (20:20 +0200)] 
ARROW-16448: [CI][Archery] Refactor EmailReport to be a JinjaReport

This PR tries to move the current `EmailReport` used for the Nightly report emails to a `JinjaReport` and adds a test for the report generation.
I also have tested locally sending an email successfully.

Closes #13074 from raulcd/ARROW-16448

Authored-by: Raúl Cumplido <>
Signed-off-by: Krisztián Szűcs <>
12 days agoARROW-16327: [Java][CI] Add Java 17 to CI matrix for java workflows
Raúl Cumplido [Thu, 5 May 2022 21:26:01 +0000 (06:26 +0900)] 
ARROW-16327: [Java][CI] Add Java 17 to CI matrix for java workflows

This PR aims to add support for java 17 on CI as required on
We probably should cherrypick the commit here on that PR.

Closes #13021 from raulcd/ARROW-16327

Authored-by: Raúl Cumplido <>
Signed-off-by: Sutou Kouhei <>
13 days agoARROW-16241: [Python] Suppress warnings in tests when using use_legacy_dataset=True
Alenka Frim [Thu, 5 May 2022 15:50:32 +0000 (17:50 +0200)] 
ARROW-16241: [Python] Suppress warnings in tests when using use_legacy_dataset=True

Closes #12954 from AlenkaF/ARROW-16241

Authored-by: Alenka Frim <>
Signed-off-by: Joris Van den Bossche <>
13 days agoARROW-16450: [Go][Docs] Include error handling in csv examples
Mark Wolfe [Thu, 5 May 2022 13:57:51 +0000 (09:57 -0400)] 
ARROW-16450: [Go][Docs] Include error handling in csv examples

As per the tests I have added checks in the examples as this tripped me up while dealing with nulls in my CSVs.

Closes #13059 from wolfeidau/arrow-16450-csv-example-godoc

Authored-by: Mark Wolfe <>
Signed-off-by: Matthew Topol <>
13 days agoARROW-16461: [C++] Fix sporadic Thread Sanitizer failure
Antoine Pitrou [Thu, 5 May 2022 13:43:17 +0000 (15:43 +0200)] 
ARROW-16461: [C++] Fix sporadic Thread Sanitizer failure

In debug mode the `ThreadedTaskGroup::finished_` member would be read unlocked, so make it atomic.

Note that the failure should be harmless, but still deserves fixing to improve CI reliability.

Closes #13067 from pitrou/ARROW-16461-atomic-finished

Authored-by: Antoine Pitrou <>
Signed-off-by: Antoine Pitrou <>
13 days agoARROW-14114: [C++][Parquet] Fix multi-threaded read of PME files
Maya Anderson [Thu, 5 May 2022 12:31:32 +0000 (14:31 +0200)] 
ARROW-14114: [C++][Parquet] Fix multi-threaded read of PME files

Change AesDecryptor to be per Decryptor, instead of shared.
This solves the problem of reading with PME using multiple threads.

It was discovered when exposing high-level PME in PyArrow that reading an encrypted parquet file in PyArrow intermittently fails decryption finalization and sometime fails with Segmentation fault. The same in C++ reading an encrypted parquet with FileReader.ReadTable() multithreaded (with set_use_threads(true) ).
The current implementation uses two caches: meta_decryptor_ and data_decryptor_ , for AesDecryptors, and every Decryptor gets the same AesDecryptor with AesDecryptorImpl from this cache.
However, AesDecryptor::AesDecryptorImpl::GcmDecrypt() and AesDecryptor::AesDecryptorImpl::CtrDecrypt() use ctx_ member of type EVP_CIPHER_CTX from OpenSSL, which shouldn't be used from multiple threads concurrently.
So, instead of sharing the same AesDecryptor between all Decryptors, an AesDecryptor will be created per Decryptor, which is per column.

Co-authored-by: Gidon Gershinsky <>
CC @thamht4190 @pitrou @revit13

Closes #12778 from andersonm-ibm/multithreaded_read

Lead-authored-by: Maya Anderson <>
Co-authored-by: Gidon Gershinsky <>
Signed-off-by: Antoine Pitrou <>
13 days agoARROW-16458: [CI][Python] Run dask S3 tests on nightly integration
Raúl Cumplido [Thu, 5 May 2022 12:17:15 +0000 (14:17 +0200)] 
ARROW-16458: [CI][Python] Run dask S3 tests on nightly integration

This PR adds coverage for running dask parquet tests that use S3 filesystem.

Closes #13071 from raulcd/ARROW-16458

Authored-by: Raúl Cumplido <>
Signed-off-by: Joris Van den Bossche <>
13 days agoMINOR: [R][Doc] encoding setting of `read_csv_arrow`
SHIMA Tatsuya [Thu, 5 May 2022 10:23:08 +0000 (10:23 +0000)] 
MINOR: [R][Doc] encoding setting of `read_csv_arrow`

In the documentation, the `encoding` argument was of `CsvConvertOptions`, but it is correctly of `CsvReadOptions`.

I also corrected some formatting for automatic linking in the doc.

Closes #13038 from eitsupi/read-option-doc

Authored-by: SHIMA Tatsuya <>
Signed-off-by: Nic Crane <>
13 days agoARROW-16445: [R] [Doc] Add a short summary for the Installing the Arrow package on...
Dragoș Moldovan-Grünfeld [Thu, 5 May 2022 10:20:46 +0000 (10:20 +0000)] 
ARROW-16445: [R] [Doc] Add a short summary for the Installing the Arrow package on Linux article

Happy to rephrase if the wording is too colloquial.

Closes #13056 from dragosmg/intro_to_install_on_linux

Authored-by: Dragoș Moldovan-Grünfeld <>
Signed-off-by: Nic Crane <>
2 weeks agoARROW-16456: [Go] Fix RecordBuilder UnmarshalJSON when extra fields are present
Phillip LeBlanc [Wed, 4 May 2022 16:38:26 +0000 (12:38 -0400)] 
ARROW-16456: [Go] Fix RecordBuilder UnmarshalJSON when extra fields are present

If RecordBuilder.UnmarshalJSON encountered an unknown field in the JSON document; it wouldn't consume the next JSON token representing the value.

Fix this by manually calling `dec.Token()` and allowing the next call to `dec.Token()` to return the following key.

Note that this only handles the case where the ignored value is a simple type, and not a nested array or object. I'm not sure what the correct fix for that would be.

Closes #13065 from phillipleblanc/phillip/unmarshal-json-bug

Authored-by: Phillip LeBlanc <>
Signed-off-by: Matthew Topol <>
2 weeks agoMINOR: [C++] Trivial improvements to execution plan example
Vibhatha Abeykoon [Wed, 4 May 2022 16:10:16 +0000 (18:10 +0200)] 
MINOR: [C++] Trivial improvements to execution plan example

This is a minor PR including a fix for the streaming execution plan code. In a currently reviewing PR an issue was raised and reflecting the suggesting change for this example.

cc @pitrou

Closes #12767 from vibhatha/minor-cpp-example-improvement

Authored-by: Vibhatha Abeykoon <>
Signed-off-by: Antoine Pitrou <>
2 weeks agoARROW-16441: [Go][Flight][Java] Update flight integration test to wait for io.EOF...
Matthew Topol [Wed, 4 May 2022 13:59:25 +0000 (09:59 -0400)] 
ARROW-16441: [Go][Flight][Java] Update flight integration test to wait for io.EOF after DoPut

Closes #13058 from zeroshade/arrow-16441-integration

Authored-by: Matthew Topol <>
Signed-off-by: Matthew Topol <>
2 weeks agoARROW-16436: [C++][Python] Datasets should not ignore CSV autogenerate_column_names
Raúl Cumplido [Wed, 4 May 2022 13:49:52 +0000 (15:49 +0200)] 
ARROW-16436: [C++][Python] Datasets should not ignore CSV autogenerate_column_names

The added test failed previously because the `autogenerate_column_names` was ignored:
E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of/pytest-15/test_csv_format_options_genera1/test.csv': Could not open CSV input source '/tmp/pytest-of/pytest-15/test_csv_format_options_genera1/test.csv': Invalid: CSV file contained multiple columns named 1. Is this a 'csv' file?
Use the same approach we use on `GenerateColumnNames` here

Closes #13064 from raulcd/ARROW-16436

Authored-by: Raúl Cumplido <>
Signed-off-by: Antoine Pitrou <>
2 weeks agoARROW-16255: [R] Reorganise the datetime bindings
Dragoș Moldovan-Grünfeld [Wed, 4 May 2022 13:39:36 +0000 (13:39 +0000)] 
ARROW-16255: [R] Reorganise the datetime bindings

The purpose of this PR is to reorganise the datetime bindings.

* some are in files where one wouldn't think to look (e.g. in `R/dplyr-funcs-type.R`)
* the are a bunch of somewhat scattered helper functions
* some of the `register_bindings_...()` functions are too complex and trigger the cyclocomp lint

* create a separate file for the datetime helpers, called `R/dplyr-datetime-helpers.R`
* all bindings are in `dplyr-funcs-datetime.R` (with the exception of `leap_year`, which was moved to `expressions.R`)
* all tests are in `test-dplyr-funcs-datetime.R`

* cyclomatic complexity for the `dplyr-funcs-datetime.R` reduced to 21 (from 26)

<details><summary>&#9658; More details</summary>


| Binding                   | Old registering function               | New registering function |
|---                            |---                                                   |---                                       |
| `strptime`               | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `strftime`                | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
|  `format_ISO8601`| `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `second`                 | `register_bindings_datetime()` |  `register_bindings_datetime_components()` |
| `wday`                    | `register_bindings_datetime()` |  `register_bindings_datetime_components()` |
| `week`                    | `register_bindings_datetime()` |  `register_bindings_datetime_components()` |
| `month`                  | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
|  `is.Date`                | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `is.instant`             | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `is.timepoint`         | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
|  `is.POSIXct`          | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `leap_year`            | `register_bindings_datetime()` | |
|  `am`                      | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
|  `pm`                      | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `tz`                         | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `semester`            | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `date`                     | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `make_datetime`  | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `make_date`         | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `ISOdatetime`       | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `ISOdate`              | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `difftime`               | `register_bindings_duration()` | `register_bindings_duration()` |
| `as.difftime`          | `register_bindings_duration()` | `register_bindings_duration()` |
| `decimal_date`     | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `date_decimal`     | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `duration_helpers_map_factory` | `register_bindings_duration_helpers()` | `register_bindings_duration_helpers()` |
| `dpicoseconds`   | `register_bindings_duration_helpers()` | `register_bindings_duration_helpers()` |
| `make_difftime ` | `register_bindings_difftime_constructors()` | `register_bindings_duration_constructor()` |
| `as.Date`              | `register_bindings_type_cast()` | `register_bindings_datetime_conversion()` |
| `as_date `            | `register_bindings_type_cast()` | `register_bindings_datetime_conversion()` |
| `as_datetime `     | `register_bindings_type_cast()` | `register_bindings_datetime_conversion()` |



Closes #13029 from dragosmg/datetime_bindings_reorg

Authored-by: Dragoș Moldovan-Grünfeld <>
Signed-off-by: Nic Crane <>
2 weeks agoARROW-16116: [C++] Handle non-nullable fields when reading Parquet
David Li [Wed, 4 May 2022 13:19:16 +0000 (15:19 +0200)] 
ARROW-16116: [C++] Handle non-nullable fields when reading Parquet

Closes #12829 from lidavidm/arrow-16116

Authored-by: David Li <>
Signed-off-by: Antoine Pitrou <>
2 weeks agoARROW-16455: [CI][Packaging] Add linux-ppc64le to the list of platforms to clean...
Raúl Cumplido [Wed, 4 May 2022 13:00:16 +0000 (15:00 +0200)] 
ARROW-16455: [CI][Packaging] Add linux-ppc64le to the list of platforms to clean on conda

This PR aims to fix the issue:
"[ERROR] ('Storage requirements exceeded (3221225472 bytes). Payment is required to add a file. Please go to to update your plan', 402)"

Closes #13066 from raulcd/ARROW-16455

Authored-by: Raúl Cumplido <>
Signed-off-by: Antoine Pitrou <>
2 weeks agoMINOR: [Release] Use CONDA_ENV instead of VENV_ENV when creating conda environments
Krisztián Szűcs [Wed, 4 May 2022 12:06:39 +0000 (08:06 -0400)] 
MINOR: [Release] Use CONDA_ENV instead of VENV_ENV when creating conda environments

Closes #13061 from kszucs/conda_env

Authored-by: Krisztián Szűcs <>
Signed-off-by: David Li <>
2 weeks agoARROW-16335: [Release][C++] Windows source verification runs C++ tests on a single...
Jacob Wujciak-Jens [Wed, 4 May 2022 09:30:56 +0000 (11:30 +0200)] 
ARROW-16335: [Release][C++] Windows source verification runs C++ tests on a single thread

Closes #13054 from assignUser/ARROW-16335-verify-mc

Authored-by: Jacob Wujciak-Jens <>
Signed-off-by: Krisztián Szűcs <>
2 weeks agoARROW-16357: [Archery][Dev] Add possibility to send nightly reports to Zulip/Slack
Raúl Cumplido [Wed, 4 May 2022 09:29:50 +0000 (11:29 +0200)] 
ARROW-16357: [Archery][Dev] Add possibility to send nightly reports to Zulip/Slack

This PR adds the possibility to send the reports via a webhook to Zulip / Slack.

Usage example:
$ archery crossbow -t $GITHUB_TOKEN report-chat --webhook $SLACK_OR_ZULIP_WEBHOOK_URL --send --no-fetch $JOB_NAME
The integration has been tested both in Zulip:

And Slack:

Closes #13031 from raulcd/ARROW-16357

Authored-by: Raúl Cumplido <>
Signed-off-by: Krisztián Szűcs <>
2 weeks agoMINOR: [Release] Don't install go if it can be located
Krisztián Szűcs [Tue, 3 May 2022 21:31:52 +0000 (06:31 +0900)] 
MINOR: [Release] Don't install go if it can be located

Closes #13048 from kszucs/dont-install-go

Lead-authored-by: Krisztián Szűcs <>
Co-authored-by: Sutou Kouhei <>
Signed-off-by: Sutou Kouhei <>
2 weeks agoARROW-16276: [R] Arrow 8.0 News
Will Jones [Tue, 3 May 2022 19:30:27 +0000 (15:30 -0400)] 
ARROW-16276: [R] Arrow 8.0 News

Let me know if I've missed anything important in this release!

Closes #13005 from wjones127/ARROW-16276-r-news-8

Lead-authored-by: Will Jones <>
Co-authored-by: Neal Richardson <>
Co-authored-by: Dewey Dunnington <>
Signed-off-by: Neal Richardson <>