Dragoș Moldovan-Grünfeld [Wed, 18 May 2022 20:28:23 +0000 (13:28 -0700)]
ARROW-16281: [R] [CI] Bump versions with the release of 4.2
Update hard-coded versions on R in our CI after the release of R 4.2.
Closes #12980 from dragosmg/r_42_ci_update
Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Jonathan Keane <jkeane@gmail.com>
Rafael Telles [Wed, 18 May 2022 20:04:47 +0000 (16:04 -0400)]
MINOR: [FlightRPC] Document assumption about catalogs support on SQL_CATALOG_TERM
To indicate that a Flight SQL server does not support catalogs we assume that sources will return empty string for `SQL_CATALOG_TERM` on Flight SQL's CommandGetSqlInfo response.
Closes #13175 from rafael-telles/document-catalog-support
Authored-by: Rafael Telles <rafael@telles.dev>
Signed-off-by: David Li <li.davidm96@gmail.com>
Todd Farmer [Wed, 18 May 2022 17:52:18 +0000 (13:52 -0400)]
ARROW-16427: [Java] Provide explicit column type mapping
Closes #13166 from toddfarmer/toddfarmer/arrow-16427
Authored-by: Todd Farmer <todd@fivefarmers.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Raúl Cumplido [Wed, 18 May 2022 15:39:32 +0000 (17:39 +0200)]
MINOR: Fix wrongly redefining pytestmark for parquet encryption tests (#13189)
Dragoș Moldovan-Grünfeld [Wed, 18 May 2022 09:44:53 +0000 (10:44 +0100)]
ARROW-16516: [R] Implement ym() my() and yq() parsers
The `ym()`, `my()` and `yq()` bindings will make the following possible (and identical):
``` r
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
test_df <- tibble::tibble(
ym_string = c("2022-05", "2022/02", "22.03", NA)
)
test_df %>%
mutate(ym_date = ym(ym_string))
#> # A tibble: 4 × 2
#> ym_string ym_date
#> <chr> <date>
#> 1 2022-05 2022-05-01
#> 2 2022/02 2022-02-01
#> 3 22.03 2022-03-01
#> 4 <NA> NA
test_df %>%
arrow_table() %>%
mutate(ym_date = ym(ym_string)) %>%
collect()
#> # A tibble: 4 × 2
#> ym_string ym_date
#> <chr> <date>
#> 1 2022-05 2022-05-01
#> 2 2022/02 2022-02-01
#> 3 22.03 2022-03-01
#> 4 <NA> NA
```
<sup>Created on 2022-05-16 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
I've implementing this with the following steps:
* add `"-01"` to the end of the strings we're trying to parse, and then
* use one the supported `orders` (`"ymd"` or `"myd"`)
Closes #13163 from dragosmg/ym_my_yq_parsers
Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sasha Krassovsky [Wed, 18 May 2022 04:01:46 +0000 (18:01 -1000)]
ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins
This adds Bloom filter pushdown between hash join nodes.
Closes #12289 from save-buffer/sasha_bloom_pushdown
Lead-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>
Co-authored-by: michalursa <michal@ursacomputing.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sutou Kouhei [Wed, 18 May 2022 03:20:06 +0000 (12:20 +0900)]
ARROW-16601: [C++][FlightRPC] Don't enforcing static link with static GoogleTest for arrow_flight_testing (#13180)
We can remove this because #13169/ARROW-16588 solved the link problem.
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Yibo Cai [Wed, 18 May 2022 02:10:03 +0000 (02:10 +0000)]
ARROW-16478: [C++] Refine cpu info detection
This patch separates OS and ARCH depdendent code and removes CPU
frequency detection (cycles_per_ms()) which is brittle and not very
useful in practice.
There are still many caveats, especially for Arm platform. It's better
to adopt a mature library if we want more complete functionalities.
E.g., github.com/pytorch/cpuinfo.
Below are examples of cpu info detected on various platforms (some
from virtual machines).
Intel, Linux
------------
Vendor: Intel
Model: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Features (set bits): 0 1 2 3 4 5 6 7 8 9 10 11 12
Cache sizes: 32768
1048576 37486592
AMD, Linux
----------
Vendor: AMD
Model: AMD EPYC 7251 8-Core Processor
Features (set bits): 0 1 2 3 4 5 11 12
Cache sizes: 32768 524288
33554432
Intel, MacOS
------------
Vendor: Unknown
Model: Unknown
Features (set bits): 0 1 2 3 4
Cache sizes: 32768 262144
12582912
Intel, Windows
--------------
Vendor: Intel
Model: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz\0\0
Features (set bits): 0 1 2 3 4 5 6 7 8 9 10 11 12
Cache sizes: 131072
2097152 37486592
Intel, MinGW
------------
Vendor: Intel
Model: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz\0\0\0\0\0\0\0
Features (set bits): 0 1 2 3 4 5 11 12
Cache sizes: 131072 524288
52428800
Arm, Linux
----------
Vendor: Unknown
Model: Unknown
Features (set bits): 32
Cache sizes: 65536
1048576 Unknown
Arm, MacOS
----------
Vendor: Unknown
Model: Unknown
Features (set bits): 32
Cache sizes: 65536
4194304 Unknown
Closes #13112 from cyb70289/cpuinfo-refine
Authored-by: Yibo Cai <yibo.cai@arm.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Even Rouault [Wed, 18 May 2022 01:54:48 +0000 (01:54 +0000)]
MINOR: [C++] cpp/parquet/Statistics: clarify that num_values() is the number of non-null values
The current documentation of Statistics::num_values() is a bit
ambiguous as it mentions the 'total number of values' and my initial
understanding is that it also included null values. But experimentation
and documentation of https://arrow.apache.org/docs/python/generated/pyarrow.parquet.Statistics.html
shows that it is the number of non-null values.
Closes #13164 from rouault/statistics_num_values
Authored-by: Even Rouault <even.rouault@spatialys.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Neal Richardson [Tue, 17 May 2022 20:54:27 +0000 (05:54 +0900)]
ARROW-16570: [R] Make pkg-config commands find all of the libs
See discussion at https://lists.apache.org/thread/7brfhv6mzc392jqo04o1jp5vos7g7lnj
We don't currently have any CI that triggers the case @rvernica reported there.
Closes #13151 from nealrichardson/r-pkg-libs
Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
David Li [Tue, 17 May 2022 20:10:52 +0000 (05:10 +0900)]
ARROW-16588: [C++][FlightRPC] Don't subclass GTest in test helpers
Also, don't link every Arrow library to UCX when enabled.
Closes #13169 from lidavidm/arrow-16588
Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Matthew Topol [Tue, 17 May 2022 15:00:17 +0000 (11:00 -0400)]
ARROW-16555: [Go][Parquet] Lift BitBlockCounter and VisitBitBlocks into shared internal utils
Closes #13135 from zeroshade/arrow-16555-shared-utils
Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Matthew Topol [Tue, 17 May 2022 14:58:52 +0000 (10:58 -0400)]
ARROW-16552: [Go] Improve decimal128 utilities
Adding new utilities for decimal128.Num for rescaling and for converting to and from float32/64
Closes #13134 from zeroshade/arrow-16552-decimals
Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Robert Purdom [Tue, 17 May 2022 14:57:04 +0000 (10:57 -0400)]
ARROW-16530: [Go] Added concurrency in key places that are always serial, regardless if parallel=true or not
added concurrency to field readers. Even when parallel=true, there a…re times when default behavior is serial which causes very slow performance when dealing with many columns and structures with many columns.
I'm working with very complex parquet files that have 500+ columns and lists of structures with 100's of columns. In the original code, getting the field readers is always done serially regardless if parallel is true. This is also true when the readers retrieve 'next batch' of records. I modified the code to perform concurrent 'read' operations in three places in two files. The performance impact is especially heavy on high-latency files, e.g., cloud storage.
The original version required just over an hour to read 600+ columns from GCS. The revised version completes the same read in ~ 11 minutes.
Closes #13120 from raceordie690/master
Authored-by: Robert Purdom <50066327+raceordie690@users.noreply.github.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Raúl Cumplido [Tue, 17 May 2022 09:58:12 +0000 (11:58 +0200)]
ARROW-16548: [Python] Add pytest.mark.parquet to all tests under tests/parquet package
The implementation marks all the individual tests that are on this structure with the parquet dataset mark correctly.
Closes #13147 from raulcd/ARROW-16548
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Dragoș Moldovan-Grünfeld [Tue, 17 May 2022 09:17:53 +0000 (10:17 +0100)]
ARROW-16541: [R] [CI] Reduce the number of times lintr is run
This PR will reduce the number of times `lintr::lint_package()` is being run to 1 (run only on the current release). At present it runs on each branch of the Windows CI workflows.
Closes #13162 from dragosmg/run_lintr_once
Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sutou Kouhei [Tue, 17 May 2022 07:12:38 +0000 (09:12 +0200)]
ARROW-16507: [CI][C++] Use system gtest with mamba/conda
The gtest package for Windows provided by conda-forge doesn't provide `GTestConfig.cmake`.
See also: https://github.com/conda-forge/gtest-feedstock/blob/main/recipe/bld.bat
And `FindGTest.cmake` provided by CMake can't find `gtest_dll.dll` that is a shared
library version of GoogleTest.
See also: https://gitlab.kitware.com/cmake/cmake/-/blob/master/Modules/FindGTest.cmake
It means that we can find only static version of GoogleTest on Windows with Conda
without a custom `FindGTestAlt.cmake`.
Shared library version `arrow_flight_testing` requires shared library version GoogleTest on
Windows because it defines `arrow::flight::FlightTest` that inherits `testing::Test`.
See also: https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/test_definitions.h
We must use the same library type for them on Windows.
To support `ARROW_FLIGHT=ON`/`ARROW_BUILD_SHARED=ON`/`ARROW_BUILD_STATIC=OFF`/
`ARROW_BUILD_TESTS=ON` with static version of GoogleTest, we need to build a static library not
shared library for `arrow_flight_testing`.
Closes #13101 from assignUser/ARROW-16507-fix-gtest2
Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Larry White [Tue, 17 May 2022 00:06:58 +0000 (09:06 +0900)]
ARROW-16571: [Java] Update .gitignore to exclude JNI-related binaries
Adds three lines to gitignore to exclude three folders containing binaries and other build output produced by the JNI build process. The folders:
- java-dist/
- java-native-c/
- java-native-cpp/
are created in the root arrow directory when cmake is run
The command line build is documented here: https://arrow.apache.org/docs/dev/developers/java/building.html#
I followed the macOS instructions:
[Building JNI Libraries on MacOS](https://arrow.apache.org/docs/dev/developers/java/building.html#id7)
To build only the C Data Interface library:
$ cd arrow
$ brew bundle --file=cpp/Brewfile
Homebrew Bundle complete! 25 Brewfile dependencies now installed.
$ export JAVA_HOME=<absolute path to your java home>
$ mkdir -p java-dist java-native-c
$ cd java-native-c
$ cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_INSTALL_PREFIX=../java-dist \
../java/c
$ cmake --build . --target install
To build other JNI libraries:
$ cd arrow
$ brew bundle --file=cpp/Brewfile
Homebrew Bundle complete! 25 Brewfile dependencies now installed.
$ export JAVA_HOME=<absolute path to your java home>
$ mkdir -p java-dist java-native-cpp
$ cd java-native-cpp
$ cmake \
-DARROW_BOOST_USE_SHARED=OFF \
-DARROW_BROTLI_USE_SHARED=OFF \
-DARROW_BZ2_USE_SHARED=OFF \
-DARROW_GFLAGS_USE_SHARED=OFF \
-DARROW_GRPC_USE_SHARED=OFF \
-DARROW_LZ4_USE_SHARED=OFF \
-DARROW_OPENSSL_USE_SHARED=OFF \
-DARROW_PROTOBUF_USE_SHARED=OFF \
-DARROW_SNAPPY_USE_SHARED=OFF \
-DARROW_THRIFT_USE_SHARED=OFF \
-DARROW_UTF8PROC_USE_SHARED=OFF \
-DARROW_ZSTD_USE_SHARED=OFF \
-DARROW_JNI=ON \
-DARROW_PARQUET=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_DATASET=ON \
-DARROW_GANDIVA_JAVA=ON \
-DARROW_GANDIVA_STATIC_LIBSTDCPP=ON \
-DARROW_GANDIVA=ON \
-DARROW_ORC=ON \
-DARROW_PLASMA_JAVA_CLIENT=ON \
-DARROW_PLASMA=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_INSTALL_PREFIX=../java-dist \
-DCMAKE_UNITY_BUILD=ON \
-Dre2_SOURCE=BUNDLED \
-DBoost_SOURCE=BUNDLED \
-Dutf8proc_SOURCE=BUNDLED \
-DSnappy_SOURCE=BUNDLED \
-DORC_SOURCE=BUNDLED \
-DZLIB_SOURCE=BUNDLED \
../cpp
$ cmake --build . --target install
[Building Arrow JNI Modules](https://arrow.apache.org/docs/dev/developers/java/building.html#id8)
To compile the JNI bindings, use the arrow-c-data Maven profile:
$ cd arrow/java
$ mvn -Darrow.c.jni.dist.dir=../java-dist/lib -Parrow-c-data clean install
To compile the JNI bindings for ORC / Gandiva / Dataset, use the arrow-jni Maven profile:
$ cd arrow/java
$ mvn -Darrow.cpp.build.dir=../java-dist/lib -Parrow-jni clean install
Closes #13153 from lwhite1/update-gitignore-to-include-folders-where-binaries-are-created
Authored-by: Larry White <lwhite1@users.noreply.github.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Jacob Wujciak-Jens [Mon, 16 May 2022 19:37:07 +0000 (21:37 +0200)]
MINOR: [CI] Fix centos cmake path (#13167)
* fix centos cmake path
* extract to standardized dir
stczwd [Mon, 16 May 2022 16:41:21 +0000 (12:41 -0400)]
ARROW-16568: [Java] Enable skip BOUNDS_CHECKING with setBytes and getBytes of ArrowBuf
We have BOUNDS_CHECKING_SKIP in ArrowBuf.setByte or ArrowBuf.getByte, it helps to remove unexpected bounds checks. However, it doesn't exists in ArrowBuf.setBytes or ArrowBuf.getBytes, which makes 10% cpu time cost for checking bounds in our environment.
Closes #13161 from jackylee-ch/skip_bounds_check_for_set_or_get_bytes
Authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Weston Pace [Mon, 16 May 2022 15:32:24 +0000 (11:32 -0400)]
ARROW-16525: [C++] Tee node not properly marking node finished
Closes #13117 from westonpace/feature/ARROW-16525--tee-node-not-marking-finished
Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: %(trailers:key=Co-authored-by,valueonly)
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Raúl Cumplido [Mon, 16 May 2022 09:07:46 +0000 (11:07 +0200)]
ARROW-16531: [Dev] Update pre-commit to use latest flake8 and remove unsupported cython linting
This PR tries to update and make consistent our linting for Python between archery and pre-commit by removing the checks for cython files. Flake8 does not support Cython linting, see [this conversation](https://github.com/PyCQA/flake8/issues/1482) for more details.
Closes #13129 from raulcd/ARROW-16531
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
William Hyun [Mon, 16 May 2022 07:59:07 +0000 (09:59 +0200)]
ARROW-16581: [C++][Java] Upgrade ORC to 1.7.4
Bump ORC to 1.7.4.
Apache ORC 1.7.3 is a maintenance release with the following bug fixes.
- https://github.com/apache/orc/releases/tag/v1.7.4
- https://orc.apache.org/news/2022/04/15/ORC-1.7.4/
Closes #13159 from williamhyun/orc174
Authored-by: William Hyun <william@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Mark Wolfe [Sun, 15 May 2022 17:35:30 +0000 (13:35 -0400)]
ARROW-16504 [Go][CSV] Add arrow.TimestampType support to the reader
There is already a helper to convert strings to arrow.Timestamp so incorporate this into the CSV reader.
The CSV files I am currently working with have RFC3339 timestamps so I followed some of the code JSON and stuck with millisecond default.
Was really easy to add this using the existing functions and structure.
Closes #13098 from wolfeidau/ARROW-16504-add-timestamp-support-to-reader
Authored-by: Mark Wolfe <mark@wolfe.id.au>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Hamish Nicholson [Sat, 14 May 2022 20:05:41 +0000 (05:05 +0900)]
ARROW-16572: [C++] Fix LZ4 build for external projects
The use of CMAKE_SOURCE_DIR is not compatible with projects that build arrow from source as a CMake project dependency.
Closes #13154 from Shamazo/master
Authored-by: Hamish Nicholson <hmnicholson12@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Matt Topol [Sat, 14 May 2022 18:01:16 +0000 (14:01 -0400)]
ARROW-16579: [Go][CI] Fix Flakey Struct Test
The tests for [ARROW-16502](https://issues.apache.org/jira/browse/ARROW-16502) reused the same `StructBuilder` between cases. But since it was testing for a panic, it left the StructBuilder in a bad state. This is understandable as it's a panic. Because the order of Go map's is non-deterministic, the test would only intermittently fail if the panic test went first and succeed if it went second.
By shifting the StructBuilder creating inside the test function, the two test cases no longer share it and the test is no longer flakey.
Closes #13158 from zeroshade/arrow-16579-flakeytest
Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Mark Wolfe [Sat, 14 May 2022 17:08:06 +0000 (13:08 -0400)]
ARROW-16561 [Go][Parquet] test for parquet root node configuration
As requested in #13139 I have added a test with some example configuration to verify it works as intended.
I added it to the schema test as well as that is where I am using it, hopefully that is fine @zeroshade .
Closes #13156 from wolfeidau/ARROW-16561-customise-root-node-test
Authored-by: Mark Wolfe <mark@wolfe.id.au>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Matt DePero [Sat, 14 May 2022 16:43:38 +0000 (12:43 -0400)]
ARROW-16563: [Go][Parquet] Fix broken parquet plain boolean decoder
While reading parquet files using this library, we discovered that boolean fields with `PLAIN` encoding were not being read properly. It was discovered that this was due to how `bitOffset` was managed in the boolean decoder; this PR patches the decoder and also adds test coverage for reading plain boolean pages.
Closes #13141 from mdepero/depero/boolencodingfix
Authored-by: Matt DePero <depero@neeva.co>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Raúl Cumplido [Sat, 14 May 2022 06:54:25 +0000 (15:54 +0900)]
ARROW-16569: [CI] Update checkout actions to newer version
This PR aims to update the github checkout actions to the newest version.
Closes #13152 from raulcd/ARROW-16569
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Weston Pace [Fri, 13 May 2022 23:10:20 +0000 (13:10 -1000)]
ARROW-16498: [C++] Fix potential deadlock in arrow::compute::TaskScheduler
Closes #13091 from westonpace/bugfix/ARROW-16498--task-scheduler-deadlock
Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Larry White [Fri, 13 May 2022 21:29:11 +0000 (17:29 -0400)]
ARROW-16534: [Java] update Gandiva protobuf library to enable builds on M1
Currently used protobuf library 2.5.0 does not include support for M1, causing tests to fail after compiling from source on Apple Silicon. Version 3.20.1 does provide M1 support. This PR updates the library version.
Closes #13121 from lwhite1/update-protobuf-dependenc
Authored-by: Larry White <ljw1001@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sutou Kouhei [Fri, 13 May 2022 20:02:18 +0000 (16:02 -0400)]
ARROW-16539: [C++] Bump bundled thrift to 0.16.0
Closes #13122 from nealrichardson/bump-thrift
Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Mark Wolfe [Fri, 13 May 2022 17:17:21 +0000 (13:17 -0400)]
ARROW-16561 [Go][Parquet] add option to customise parquet root node
Currently the root nodes name and repetition are hard coded and the current settings need to be changed to work some other tools.
Closes #13139 from wolfeidau/ARROW-16561-customise-root-node
Authored-by: Mark Wolfe <mark@wolfe.id.au>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Todd Farmer [Fri, 13 May 2022 12:13:02 +0000 (08:13 -0400)]
ARROW-16538: [Java] Adding flexibility to mock ResultSets
The minimum required to support existing use cases of FakeResultSet has been implemented here - every other method throws a SQLException, and support can be added in the future as specific methods of MockResultSet are referenced.
Closes #13123 from toddfarmer/toddfarmer/arrow-16427
Authored-by: Todd Farmer <todd@fivefarmers.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Jacob Wujciak-Jens [Fri, 13 May 2022 07:50:39 +0000 (09:50 +0200)]
ARROW-16402: [R][CI] Create new Archery Tasks
This PR introduces two new archery docker task for future use in building the R nightlies in Crossbow.
Closes #13131 from assignUser/ARROW-16402
Lead-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Alessandro Molina <amol@turbogears.org>
Yaron Gvili [Thu, 12 May 2022 20:03:47 +0000 (16:03 -0400)]
ARROW-16425: [C++] Add compute kernel test for scalar array timestamp comparison
See https://issues.apache.org/jira/browse/ARROW-16425
Closes #13037 from rtpsw/ARROW-16425
Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Yibo Cai [Thu, 12 May 2022 19:56:42 +0000 (15:56 -0400)]
ARROW-16183: [C++][FlightRPC] Support bundled UCX
Closes #12881 from cyb70289/16183-bundled-ucx
Authored-by: Yibo Cai <yibo.cai@arm.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Przemysław Kowolik [Thu, 12 May 2022 17:47:56 +0000 (13:47 -0400)]
ARROW-16502: [Go] Accept missing optional fields when unmarshalling JSON in StructBuilder
When calling array.StructBuilder.UnmarshalJSON with a JSON object that has missing optional fields, it fails to decode the JSON object properly and will panic - but overall it's a common behavior to drop empty/null fields from the JSON
Fix this by filling all missing optional fields with null values to prevent builder from panic
Closes #13097 from Kowol/bufix/struct-unmarshal-optional-fields
Authored-by: Przemysław Kowolik <przemyslaw.kowolik@gmail.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Yuqi Gu [Thu, 12 May 2022 17:46:08 +0000 (13:46 -0400)]
ARROW-16486: [Go] Implement bit_packing functions with Arm64 GoLang Assembly
Implement the functions of unpack32_neon in the module: parquet-bitpacking.
Tests passed:
```
go test ./...
ok github.com/apache/arrow/go/v8/parquet/internal/utils
```
Closes #13080 from guyuqi/ARROW-16486
Authored-by: Yuqi Gu <yuqi.gu@arm.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Dragoș Moldovan-Grünfeld [Thu, 12 May 2022 14:41:40 +0000 (15:41 +0100)]
ARROW-16394: [R] Implement lubridate's parsers with year, month and date components
This PR adds bindings for lubridate's parsers with **y**ear, **m**onth, and **d**ay components, allowing the following to work correctly:
``` r
library(dplyr, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
test_df <- tibble::tibble(
ymd_string = c("2022-05-11", "2022/05/12", "22.05-13")
)
test_df %>%
mutate(ymd_date = ymd(ymd_string))
#> # A tibble: 3 × 2
#> ymd_string ymd_date
#> <chr> <date>
#> 1 2022-05-11 2022-05-11
#> 2 2022/05/12 2022-05-12
#> 3 22.05-13 2022-05-13
test_df %>%
arrow_table() %>%
mutate(ymd_date = ymd(ymd_string)) %>%
collect()
#> # A tibble: 3 × 2
#> ymd_string ymd_date
#> <chr> <date>
#> 1 2022-05-11 2022-05-11
#> 2 2022/05/12 2022-05-12
#> 3 22.05-13 2022-05-13
```
<sup>Created on 2022-05-11 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
Closes #13118 from dragosmg/ymd_parsers
Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Larry White [Thu, 12 May 2022 13:43:35 +0000 (09:43 -0400)]
MINOR: [Java] Indicate absolute path is required in docs
Because maven is setup to run the tests in a temp folder, the path provided to the native libraries must be absolute or the tests will fail
Closes #13114 from lwhite1/patch-1
Lead-authored-by: Larry White <ljw1001@gmail.com>
Co-authored-by: Larry White <lwhite1@users.noreply.github.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Weston Pace [Thu, 12 May 2022 11:25:26 +0000 (13:25 +0200)]
ARROW-16526: [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET
One of the legacy parquet dataset tests was not properly passing use_legacy_dataset and this caused the test to attempt to use the new datasets module even if it wasn't enabled
Closes #13116 from westonpace/bugfix/MINOR--missing-dataset-mark
Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Alessandro Molina [Thu, 12 May 2022 10:01:41 +0000 (12:01 +0200)]
ARROW-16468: [Python] Test Table filter feature with complex exprs and add Expression.apply method
Depends on https://github.com/apache/arrow/pull/13075
Closes #13099 from amol-/ARROW-16468
Authored-by: Alessandro Molina <amol@turbogears.org>
Signed-off-by: Alessandro Molina <amol@turbogears.org>
Raúl Cumplido [Thu, 12 May 2022 09:58:45 +0000 (11:58 +0200)]
ARROW-16508: [Archery][Dev] Add possibility to extend chat report message based on success or failures of jobs
Closes #13102 from raulcd/ARROW-16508
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Alessandro Molina <amol@turbogears.org>
Raúl Cumplido [Thu, 12 May 2022 09:28:30 +0000 (11:28 +0200)]
MINOR: Add feedback information output when building in case of skipping pyarrow build
This minor PR tries to give the user a better hint on what is happening in case of their build being skipped because `cachedir != build_temp`. I faced the issue and had to ask and debug to understand that I had to clean my previous build:
```
(pyarrow-dev) ~/arrow/python (master) $ python setup.py build_ext --inplace
running build_ext
(pyarrow-dev) ~/arrow/python (master) $
```
The new output will show on that case:
```
(pyarrow-dev) ~/arrow/python (master) $ python setup.py build_ext --inplace
running build_ext
-- Skipping build. Temp build /home/raulcd/arrow/python/build/temp.linux-x86_64-3.10 does not match cached dir /arrow/python/build/temp.linux-x86_64-3.10
(pyarrow-dev) ~/arrow/python (master) $
```
Closes #13119 from raulcd/minor-feedback-improvement
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sutou Kouhei [Wed, 11 May 2022 20:24:28 +0000 (05:24 +0900)]
ARROW-16168: [C++][CMake] Use target to add include paths
We can remove "include_directories(SYSTEM)" by this.
Closes #12861 from kou/cpp-target-include-path
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Min-Young Wu [Wed, 11 May 2022 19:55:38 +0000 (15:55 -0400)]
ARROW-16473: [Go] fixing memory leak in serializedPageReader
`parquet/file.serializedPageReader` has a [memory.Buffer](https://github.com/apache/arrow/blob/
8bd5514f52bf9cc542a389edaf697cbc2c97b752/go/parquet/file/page_reader.go#L299) attribute (presumably to reuse across page reads). But at the end of `serializedPageReader.Next` (in the non-error case), a new `memory.Buffer` is [created](https://github.com/apache/arrow/blob/
8bd5514f52bf9cc542a389edaf697cbc2c97b752/go/parquet/file/page_reader.go#L615) without releasing the pre-existing `p.buf`, thus resulting in a leak.
Existing tests updated to test and catch this (`parquet/file` now uses `CheckedAllocator).
Closes #13068 from minyoung/user/minyoung/0504-serialized-page-reader-leak
Authored-by: Min-Young Wu <mwu368@gmail.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Todd Farmer [Wed, 11 May 2022 18:30:55 +0000 (14:30 -0400)]
ARROW-16529: [Java] Fix ArrowVectorIterator.hasNext()
Calls to ArrowVectorIterator.hasNext() should return true until ArrowVectorIterator.next() has been called, and the underlying ResultSet has been fully consumed.
Closes #13107 from toddfarmer/tofarmer/fix-hasnext-for-empty-resultset
Authored-by: Todd Farmer <todd@fivefarmers.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Alvin Chunga [Wed, 11 May 2022 16:54:31 +0000 (18:54 +0200)]
ARROW-602: [C++] Provide iterator access to primitive elements inside an Array
Create Iterator method in stl for Array and ChunkedArray
Closes #13009 from AlvinJ15/ARROW-602#Provide_iterator_access_to_primitive_elements_inside_chunked_arrays
Lead-authored-by: Alvin Chunga <alvinchma@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Alessandro Molina [Wed, 11 May 2022 11:53:01 +0000 (13:53 +0200)]
ARROW-16467: [Python] Add helper function _exec_plan._filter_table to filter tables based on Expression
The function is focused on Tables, as it will be the foundation for `Table.filter`, but as the extra work was minimal I added support for Dataset too.
Closes #13075 from amol-/ARROW-16467
Authored-by: Alessandro Molina <amol@turbogears.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Johnnathan [Wed, 11 May 2022 10:06:53 +0000 (15:36 +0530)]
ARROW-13052: [Gandiva][C++] Add regexp_extract function
Implements the REGEXP_EXTRACT function based on [the Hive implementation](https://www.revisitclass.com/hadoop/regexp_extract-function-in-hive-with-examples/).
Closes #13015 from Johnnathanalmeida/feature/add-regexp-extract
Authored-by: Johnnathan <johnnathanalmeida@gmail.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
Dragoș Moldovan-Grünfeld [Wed, 11 May 2022 09:45:43 +0000 (10:45 +0100)]
MINOR: [R] correct NEWS heading
Closes #13106 from dragosmg/minor_news_update
Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Dragoș Moldovan-Grünfeld [Wed, 11 May 2022 09:44:33 +0000 (10:44 +0100)]
ARROW-16253: [R] Helper function for casting from float to duration via int64()
Closes #13055 from dragosmg/duration_casting_helper
Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Neal Richardson [Tue, 10 May 2022 16:48:22 +0000 (12:48 -0400)]
ARROW-16414: [R] Remove ARROW_R_WITH_ARROW and arrow_available()
The diff looks bigger than that because
* Sometimes those changes just resulted in reducing indentation
* I moved arrow_info() and related functions to their own file, and did the same with ArrowObject while I was there
* The way we were wrapping testthat::test_that to check whether arrow was available had a side effect of creating a closure that stored intermediate objects that we reused across tests, and that broke when I removed it.
* I didn't have styler configured correctly in vscode when I started because I had upgraded R to 4.2, so to fix what I had already committed that was unstyled, I ran `make style-all` across everything, which reformatted a bunch of unrelated code.
I tried to pull on all threads I noticed where we were doing things an unnatural way because we couldn't assume that arrow was present, but there may be more.
Closes #13086 from nealrichardson/arrow-is-available
Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Ariana Villegas [Tue, 10 May 2022 16:10:57 +0000 (18:10 +0200)]
ARROW-15587: [C++] Add support for all options specified by substrait::ReadRel::LocalFiles::FileOrFiles
The Substrait read operator defines files with LocalFiles::FileOrFiles. These elements can take one of several forms:
- uri_path (can be a file or a folder)
- uri_path_glob (a glob expression)
- uri_file (file only)
- uri_folder (folder only)
The C++ Substrait consumer currently only supports uri_file. This PR adds support for the other options.
- [x] uri_path (can be a file or a folder)
- [x] uri_path_glob (a glob expression)
- [x] uri_folder (folder only)
Closes #12625 from ArianaVillegas/ARROW-15587
Lead-authored-by: Ariana Villegas <ariana.villegas@utec.edu.pe>
Co-authored-by: ArianaVillegas <40250321+ArianaVillegas@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Jacob Wujciak-Jens [Tue, 10 May 2022 13:58:04 +0000 (14:58 +0100)]
ARROW-16489: [R] wrong encoding causes parsing error
Closes #13082 from assignUser/ARROW-16489-fix-encoding
Authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
karldw [Tue, 10 May 2022 13:24:54 +0000 (09:24 -0400)]
ARROW-16297: [R] Improve detection of ARROW_*_URL variables for offline build
As Neal mentioned in https://github.com/apache/arrow/pull/12849#issuecomment-
1101489333, the current code in nixlibs.R doesn't handle URL variable names components that have multiple words (because of the way it parses variable names from filenames). Until now, we've had a special case for the AWS variables, but `ARROW_GOOGLE_CLOUD_CPP_URL` and `ARROW_NLOHMANN_JSON_URL` also need handling. Instead of adding special cases, we can provide the correct `ARROW_*_URL` values with the new bash script added as part of ARROW-15092 (in PR #12849).
Please let me know what you think!
Closes #12973 from karldw/fix-16297
Lead-authored-by: karldw <karldw@users.noreply.github.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sutou Kouhei [Tue, 10 May 2022 08:15:35 +0000 (10:15 +0200)]
ARROW-16501: [Docs][C++][R] Migrate to Matomo from Google Analytics
It's for becoming compliant with the GDPR (the European Union's
General Data Protection Regulation).
See also:
* https://privacy.apache.org/policies/privacy-policy-public.html
* https://privacy.apache.org/faq/committers.html
Closes #13096 from kou/docs-matomo
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sutou Kouhei [Tue, 10 May 2022 07:07:17 +0000 (16:07 +0900)]
ARROW-16500: [Release][R] Don't use GNU sed extension for r/NEWS.md update
We should not '0,/.../' syntax for updating r/NEWS.md because it's a
GNU sed extension. If we use it, our release script doesn't work on
macOS that uses BSD sed.
Closes #13095 from kou/release-r-news-update
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Matthew Topol [Tue, 10 May 2022 06:44:52 +0000 (15:44 +0900)]
ARROW-16484: [Go][Parquet] Update parquet writer version
This updates the default `Created By` string of the Go Parquet Writer, and updates the release script to correctly update that string with releases in the future.
Closes #13103 from zeroshade/arrow-16484-parq-writerversion
Lead-authored-by: Matthew Topol <mtopol@factset.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sutou Kouhei [Tue, 10 May 2022 06:42:40 +0000 (15:42 +0900)]
ARROW-16487: [C++][Parquet] Fix parquet::Statistics::Equals() with minmax
The following cases return wrong result:
* statistics_no_minmax.Equals(statistics_no_minmax):
This must returns true but false is returned.
* statistics_minmax.Equals(statistics_minmax):
This must returns true but false is returned.
* statistics_minmax1.Equals(statistics_minmax2) where
statistics_minmax1 and statistics_minmax2 have different minmax:
This must returns false but true is returned.
Note that parquet::Statistics::Equals() was introduced by
https://github.com/apache/arrow/pull/8507 .
Closes #13087 from kou/cpp-parquet-statistics-equal
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Yaron Gvili [Tue, 10 May 2022 01:52:16 +0000 (15:52 -1000)]
ARROW-16426: [C++] Add TeeNode to execution engine
The existing write node is a consuming one while the proposed tee node is a pass-through one.
Closes #13040 from rtpsw/ARROW-16426
Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Neal Richardson [Mon, 9 May 2022 19:48:22 +0000 (15:48 -0400)]
ARROW-16511: [R] Preserve schema metadata in write_dataset()
Closes #13105 from nealrichardson/write-dataset-metadata
Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Dragoș Moldovan-Grünfeld [Mon, 9 May 2022 19:30:19 +0000 (14:30 -0500)]
ARROW-14848: [R] Implement bindings for lubridate's parse_date_time
This PR adds a partial implementation of `parse_date_time()`:
* only parses the year, month, and date components (no hours, minutes and seconds yet)
* does not support parsing of strings without separators (e.g. `"220912"` to `2022-09-12`)
* `lubridate::parse_date_time()` infers the most likely `format` given `orders` (via `guess_formats()`, while the Arrow binding does not do any inference.
Closes #12589 from dragosmg/parse_date_time
Lead-authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Jonathan Keane <jkeane@gmail.com>
Neal Richardson [Mon, 9 May 2022 19:11:12 +0000 (15:11 -0400)]
MINOR: [R] Move tzdb loading out of .onLoad() to avoid a check NOTE
`R CMD check` now raises a NOTE after my previous fix (
f49fbda3dffeaead7d192ec64bfd2a7cfc4172a3):
```
* checking R code for possible problems ... NOTE
File ‘arrow/R/arrow-package.R’:
.onLoad calls:
packageStartupMessage("The tzdb package is not installed. Timezones will not be available.")
See section ‘Good practice’ in '?.onAttach'.
```
Interestingly, the docs they point to say to use `packageStartupMessage()` like we are doing here. In any case, if we move it to a function that `.onLoad()` calls rather than having it directly in .onLoad, `check` doesn't find it 🤷
Closes #13104 from nealrichardson/tzdb-msg-2
Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Will Jones [Mon, 9 May 2022 17:35:46 +0000 (13:35 -0400)]
ARROW-16085: [C++][R] InMemoryDataset::ReplaceSchema does not alter scan output
Feels a little funny that deleting this code just makes it work, so I added a decent number of tests to make sure differing schemas are handled. LMK if you think I missed something.
Closes #13088 from wjones127/ARROW-16085-unify-inmemory-datasets
Authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
alexdesiqueira [Mon, 9 May 2022 13:14:51 +0000 (15:14 +0200)]
MINOR: [Python][Docs] Improving sentence on docs:python/parquet
Just a small improvement on `docs/source/python/parquet.rst`:
> _We need not use_
becomes
> _We do not need to use_
Closes #13093 from alexdesiqueira/small_wording-python_parquet
Authored-by: alexdesiqueira <alex.desiqueira@igdore.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sutou Kouhei [Mon, 9 May 2022 02:44:27 +0000 (11:44 +0900)]
ARROW-16499: [Release][Ruby] Add missing export
We need to export GEM_HOST_OTP_CODE so that "gem push" can use it.
Closes #13092 from kou/release-ruby-missing-export
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sutou Kouhei [Mon, 9 May 2022 02:24:30 +0000 (11:24 +0900)]
ARROW-16497: [R] Update version in NEWS.md
This solves the following CI failure:
https://github.com/apache/arrow/runs/
6329293119?check_suite_focus=true
Failure: test_version_pre_tag(PrepareTest)
/home/runner/work/arrow/arrow/dev/release/01-prepare-test.rb:226:in `test_version_pre_tag'
223: end
224:
225: prepare("VERSION_PRE_TAG")
=> 226: assert_equal(expected_changes.sort_by {|diff| diff[:path]},
227: parse_patch(git("log", "-n", "1", "-p")))
228: end
229: end
...
? {:hunks=>[["-# arrow 8.0.0.9000", "+# arrow 9.0.0"]], :path=>"r/NEWS.md"},
? 7
Closes #13089 from kou/r-news-update
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sutou Kouhei [Mon, 9 May 2022 01:48:02 +0000 (10:48 +0900)]
ARROW-15671: [GLib] Add support for Vala
Closes #12993 from kou/glib-vapi
Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Tao Zuhong <taozuhong@users.noreply.github.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sutou Kouhei [Sat, 7 May 2022 20:17:05 +0000 (05:17 +0900)]
ARROW-16474: [C++][Packaging] Require Python 3.7 or later
Closes #13079 from kou/packaging-linux-drop-python-3.6-support
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sutou Kouhei [Fri, 6 May 2022 22:33:51 +0000 (07:33 +0900)]
ARROW-16228: [CI][Packaging][Conan] Add a job to test minimum build
ci/conan/ is based on
https://github.com/conan-io/conan-center-index/tree/master/recipes/arrow/
.
We'll update ci/conan/ to follow the latest cpp/ changes such as new
build options and send a pull request the updates to
https://github.com/conan-io/conan-center-index/ when we release a new
version.
Closes #12918 from kou/ci-conan
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Krisztián Szűcs [Fri, 6 May 2022 21:05:25 +0000 (23:05 +0200)]
[Release] Update .deb/.rpm changelogs for 8.0.0
Krisztián Szűcs [Fri, 6 May 2022 21:05:24 +0000 (23:05 +0200)]
[Release] Update .deb package names for 9.0.0
Krisztián Szűcs [Fri, 6 May 2022 21:05:24 +0000 (23:05 +0200)]
[Release] Update versions for 9.0.0-SNAPSHOT
Jacob Wujciak-Jens [Fri, 6 May 2022 19:56:31 +0000 (04:56 +0900)]
ARROW-16490: [C++][Windows] Don't force to use bundled GoogleTest
It seems that conda-forge's GoogleTest package for Windows solved the problem we encountered.
We can use GoogleTest installed by vcpkg by removing the workaround.
Closes #13083 from assignUser/ARROW-16490-fix-gtest
Authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Raúl Cumplido [Fri, 6 May 2022 19:46:42 +0000 (04:46 +0900)]
ARROW-16494: [C++] Add missing include that is making some packaging jobs fail
This PR tries to fix the issue that has been raised on some packaging builds like:
- [conda-linux-gcc-py310-cuda](https://github.com/ursacomputing/crossbow/runs/
6321523101)
- [conda-linux-gcc-py310-ppc64le](https://github.com/ursacomputing/crossbow/runs/
6321445819)
- [conda-linux-gcc-py37-arm64](https://github.com/ursacomputing/crossbow/runs/
6322824420)
- [conda-linux-gcc-py37-cpu-r40](https://github.com/ursacomputing/crossbow/runs/
6321760095)
Closes #13084 from raulcd/ARROW-16494
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Neal Richardson [Fri, 6 May 2022 19:17:23 +0000 (15:17 -0400)]
MINOR: [R] Don't use warning() in .onLoad()
According to the [docs](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ns-hooks.html), `packageStartupMessage()` is preferred. With `warning()`, it can show up in the `R CMD INSTALL` output like this:
```
installing to D:/a/_temp/Library/00LOCK-arrow/00new/arrow/libs/x64
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
Warning in fun(libname, pkgname) :
The tzdb package is not installed. Timezones will not be available.
** testing if installed package can be loaded from final location
Warning in fun(libname, pkgname) :
** testing if installed package keeps a record of temporary installation path
The tzdb package is not installed. Timezones will not be available.
* MD5 sums
packaged installation of 'arrow' as arrow_7.0.0.
20220505.zip
* DONE (arrow)
```
I will cherry-pick when preparing the 8.0.0 CRAN submission
Closes #13085 from nealrichardson/load-warning
Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Raúl Cumplido [Fri, 6 May 2022 18:23:55 +0000 (20:23 +0200)]
ARROW-16488: [Archery][Dev] Allow extra message to be sent on chat report
This PR allows an extra CLI argument `--extra-message` to be used when sending a chat report in order to append content to the message being sent.
Closes #13081 from raulcd/ARROW-16488
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Raúl Cumplido [Fri, 6 May 2022 18:20:32 +0000 (20:20 +0200)]
ARROW-16448: [CI][Archery] Refactor EmailReport to be a JinjaReport
This PR tries to move the current `EmailReport` used for the Nightly report emails to a `JinjaReport` and adds a test for the report generation.
I also have tested locally sending an email successfully.
Closes #13074 from raulcd/ARROW-16448
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Raúl Cumplido [Thu, 5 May 2022 21:26:01 +0000 (06:26 +0900)]
ARROW-16327: [Java][CI] Add Java 17 to CI matrix for java workflows
This PR aims to add support for java 17 on CI as required on https://github.com/apache/arrow/pull/12941
We probably should cherrypick the commit here on that PR.
Closes #13021 from raulcd/ARROW-16327
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Alenka Frim [Thu, 5 May 2022 15:50:32 +0000 (17:50 +0200)]
ARROW-16241: [Python] Suppress warnings in tests when using use_legacy_dataset=True
Closes #12954 from AlenkaF/ARROW-16241
Authored-by: Alenka Frim <frim.alenka@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Mark Wolfe [Thu, 5 May 2022 13:57:51 +0000 (09:57 -0400)]
ARROW-16450: [Go][Docs] Include error handling in csv examples
As per the tests I have added checks in the examples as this tripped me up while dealing with nulls in my CSVs.
Closes #13059 from wolfeidau/arrow-16450-csv-example-godoc
Authored-by: Mark Wolfe <mark@wolfe.id.au>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Antoine Pitrou [Thu, 5 May 2022 13:43:17 +0000 (15:43 +0200)]
ARROW-16461: [C++] Fix sporadic Thread Sanitizer failure
In debug mode the `ThreadedTaskGroup::finished_` member would be read unlocked, so make it atomic.
Note that the failure should be harmless, but still deserves fixing to improve CI reliability.
Closes #13067 from pitrou/ARROW-16461-atomic-finished
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Maya Anderson [Thu, 5 May 2022 12:31:32 +0000 (14:31 +0200)]
ARROW-14114: [C++][Parquet] Fix multi-threaded read of PME files
Change AesDecryptor to be per Decryptor, instead of shared.
This solves the problem of reading with PME using multiple threads.
Details:
It was discovered when exposing high-level PME in PyArrow that reading an encrypted parquet file in PyArrow intermittently fails decryption finalization and sometime fails with Segmentation fault. The same in C++ reading an encrypted parquet with FileReader.ReadTable() multithreaded (with set_use_threads(true) ).
The current implementation uses two caches: meta_decryptor_ and data_decryptor_ , for AesDecryptors, and every Decryptor gets the same AesDecryptor with AesDecryptorImpl from this cache.
However, AesDecryptor::AesDecryptorImpl::GcmDecrypt() and AesDecryptor::AesDecryptorImpl::CtrDecrypt() use ctx_ member of type EVP_CIPHER_CTX from OpenSSL, which shouldn't be used from multiple threads concurrently.
So, instead of sharing the same AesDecryptor between all Decryptors, an AesDecryptor will be created per Decryptor, which is per column.
Co-authored-by: Gidon Gershinsky <ggershinsky@apple.com>
CC @thamht4190 @pitrou @revit13
Closes #12778 from andersonm-ibm/multithreaded_read
Lead-authored-by: Maya Anderson <mayaa@il.ibm.com>
Co-authored-by: Gidon Gershinsky <ggershinsky@apple.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Raúl Cumplido [Thu, 5 May 2022 12:17:15 +0000 (14:17 +0200)]
ARROW-16458: [CI][Python] Run dask S3 tests on nightly integration
This PR adds coverage for running dask parquet tests that use S3 filesystem.
Closes #13071 from raulcd/ARROW-16458
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
SHIMA Tatsuya [Thu, 5 May 2022 10:23:08 +0000 (10:23 +0000)]
MINOR: [R][Doc] encoding setting of `read_csv_arrow`
In the documentation, the `encoding` argument was of `CsvConvertOptions`, but it is correctly of `CsvReadOptions`.
I also corrected some formatting for automatic linking in the doc.
Closes #13038 from eitsupi/read-option-doc
Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Dragoș Moldovan-Grünfeld [Thu, 5 May 2022 10:20:46 +0000 (10:20 +0000)]
ARROW-16445: [R] [Doc] Add a short summary for the Installing the Arrow package on Linux article
Happy to rephrase if the wording is too colloquial.
Closes #13056 from dragosmg/intro_to_install_on_linux
Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Phillip LeBlanc [Wed, 4 May 2022 16:38:26 +0000 (12:38 -0400)]
ARROW-16456: [Go] Fix RecordBuilder UnmarshalJSON when extra fields are present
If RecordBuilder.UnmarshalJSON encountered an unknown field in the JSON document; it wouldn't consume the next JSON token representing the value.
Fix this by manually calling `dec.Token()` and allowing the next call to `dec.Token()` to return the following key.
Note that this only handles the case where the ignored value is a simple type, and not a nested array or object. I'm not sure what the correct fix for that would be.
Closes #13065 from phillipleblanc/phillip/unmarshal-json-bug
Authored-by: Phillip LeBlanc <phillip@leblanc.tech>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Vibhatha Abeykoon [Wed, 4 May 2022 16:10:16 +0000 (18:10 +0200)]
MINOR: [C++] Trivial improvements to execution plan example
This is a minor PR including a fix for the streaming execution plan code. In a currently reviewing PR an issue was raised and reflecting the suggesting change for this example.
cc @pitrou
Closes #12767 from vibhatha/minor-cpp-example-improvement
Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Matthew Topol [Wed, 4 May 2022 13:59:25 +0000 (09:59 -0400)]
ARROW-16441: [Go][Flight][Java] Update flight integration test to wait for io.EOF after DoPut
Closes #13058 from zeroshade/arrow-16441-integration
Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Raúl Cumplido [Wed, 4 May 2022 13:49:52 +0000 (15:49 +0200)]
ARROW-16436: [C++][Python] Datasets should not ignore CSV autogenerate_column_names
The added test failed previously because the `autogenerate_column_names` was ignored:
```
E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of/pytest-15/test_csv_format_options_genera1/test.csv': Could not open CSV input source '/tmp/pytest-of/pytest-15/test_csv_format_options_genera1/test.csv': Invalid: CSV file contained multiple columns named 1. Is this a 'csv' file?
```
Use the same approach we use on `GenerateColumnNames` here https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/reader.cc#L637-L646
Closes #13064 from raulcd/ARROW-16436
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Dragoș Moldovan-Grünfeld [Wed, 4 May 2022 13:39:36 +0000 (13:39 +0000)]
ARROW-16255: [R] Reorganise the datetime bindings
The purpose of this PR is to reorganise the datetime bindings.
Why?
* some are in files where one wouldn't think to look (e.g. in `R/dplyr-funcs-type.R`)
* the are a bunch of somewhat scattered helper functions
* some of the `register_bindings_...()` functions are too complex and trigger the cyclocomp lint
What?
* create a separate file for the datetime helpers, called `R/dplyr-datetime-helpers.R`
* all bindings are in `dplyr-funcs-datetime.R` (with the exception of `leap_year`, which was moved to `expressions.R`)
* all tests are in `test-dplyr-funcs-datetime.R`
Results
* cyclomatic complexity for the `dplyr-funcs-datetime.R` reduced to 21 (from 26)
<details><summary>► More details</summary>
<p>
| Binding | Old registering function | New registering function |
|--- |--- |--- |
| `strptime` | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `strftime` | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `format_ISO8601`| `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `second` | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `wday` | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `week` | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `month` | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `is.Date` | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `is.instant` | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `is.timepoint` | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `is.POSIXct` | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `leap_year` | `register_bindings_datetime()` | |
| `am` | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `pm` | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `tz` | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `semester` | `register_bindings_datetime()` | `register_bindings_datetime_components()` |
| `date` | `register_bindings_datetime()` | `register_bindings_datetime_utility()` |
| `make_datetime` | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `make_date` | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `ISOdatetime` | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `ISOdate` | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `difftime` | `register_bindings_duration()` | `register_bindings_duration()` |
| `as.difftime` | `register_bindings_duration()` | `register_bindings_duration()` |
| `decimal_date` | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `date_decimal` | `register_bindings_duration()` | `register_bindings_datetime_conversion()` |
| `duration_helpers_map_factory` | `register_bindings_duration_helpers()` | `register_bindings_duration_helpers()` |
| `dpicoseconds` | `register_bindings_duration_helpers()` | `register_bindings_duration_helpers()` |
| `make_difftime ` | `register_bindings_difftime_constructors()` | `register_bindings_duration_constructor()` |
| `as.Date` | `register_bindings_type_cast()` | `register_bindings_datetime_conversion()` |
| `as_date ` | `register_bindings_type_cast()` | `register_bindings_datetime_conversion()` |
| `as_datetime ` | `register_bindings_type_cast()` | `register_bindings_datetime_conversion()` |
</p>
</details>
Closes #13029 from dragosmg/datetime_bindings_reorg
Authored-by: Dragoș Moldovan-Grünfeld <dragos.mold@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
David Li [Wed, 4 May 2022 13:19:16 +0000 (15:19 +0200)]
ARROW-16116: [C++] Handle non-nullable fields when reading Parquet
Closes #12829 from lidavidm/arrow-16116
Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Raúl Cumplido [Wed, 4 May 2022 13:00:16 +0000 (15:00 +0200)]
ARROW-16455: [CI][Packaging] Add linux-ppc64le to the list of platforms to clean on conda
This PR aims to fix the issue:
```
"[ERROR] ('Storage requirements exceeded (
3221225472 bytes). Payment is required to add a file. Please go to https://anaconda.org/binstar.settings/billing to update your plan', 402)"
```
Closes #13066 from raulcd/ARROW-16455
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Krisztián Szűcs [Wed, 4 May 2022 12:06:39 +0000 (08:06 -0400)]
MINOR: [Release] Use CONDA_ENV instead of VENV_ENV when creating conda environments
Closes #13061 from kszucs/conda_env
Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Jacob Wujciak-Jens [Wed, 4 May 2022 09:30:56 +0000 (11:30 +0200)]
ARROW-16335: [Release][C++] Windows source verification runs C++ tests on a single thread
Closes #13054 from assignUser/ARROW-16335-verify-mc
Authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Raúl Cumplido [Wed, 4 May 2022 09:29:50 +0000 (11:29 +0200)]
ARROW-16357: [Archery][Dev] Add possibility to send nightly reports to Zulip/Slack
This PR adds the possibility to send the reports via a webhook to Zulip / Slack.
Usage example:
```
$ archery crossbow -t $GITHUB_TOKEN report-chat --webhook $SLACK_OR_ZULIP_WEBHOOK_URL --send --no-fetch $JOB_NAME
```
The integration has been tested both in Zulip:

And Slack:

Closes #13031 from raulcd/ARROW-16357
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Krisztián Szűcs [Tue, 3 May 2022 21:31:52 +0000 (06:31 +0900)]
MINOR: [Release] Don't install go if it can be located
Closes #13048 from kszucs/dont-install-go
Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Will Jones [Tue, 3 May 2022 19:30:27 +0000 (15:30 -0400)]
ARROW-16276: [R] Arrow 8.0 News
Let me know if I've missed anything important in this release!
Closes #13005 from wjones127/ARROW-16276-r-news-8
Lead-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>