Andrew Lamb [Thu, 8 Jul 2021 15:58:24 +0000 (11:58 -0400)]
Update .asf.yaml (#2)
Jorge C. Leitao [Sat, 19 Jun 2021 05:40:27 +0000 (05:40 +0000)]
Removed all.
This will allow to start a project from scratch without losing
the contributions' stats and others.
Jorge C. Leitao [Sat, 19 Jun 2021 05:35:54 +0000 (05:35 +0000)]
Removed tooling related to arrow crate.
Jorge C. Leitao [Sat, 19 Jun 2021 05:22:12 +0000 (05:22 +0000)]
Kept parquet
Gary Pennington [Wed, 16 Jun 2021 16:37:02 +0000 (17:37 +0100)]
parquet: improve BOOLEAN writing logic and report error on encoding fail (#443)
* improve BOOLEAN writing logic and report error on encoding fail
When writing BOOLEAN data, writing more than 2048 rows of data will
overflow the hard-coded 256 buffer set for the bit-writer in the
PlainEncoder. Once this occurs, further attempts to write to the encoder
fail, becuase capacity is exceeded, but the errors are silently ignored.
This fix improves the error detection and reporting at the point of
encoding and modifies the logic for bit_writing (BOOLEANS). The
bit_writer is initially allocated 256 bytes (as at present), then each
time the capacity is exceeded the capacity is incremented by another
256 bytes.
This certainly resolves the current problem, but it's not exactly a
great fix because the capacity of the bit_writer could now grow
substantially.
Other data types seem to have a more sophisticated mechanism for writing
data which doesn't involve growing or having a fixed size buffer. It
would be desirable to make the BOOLEAN type use this same mechanism if
possible, but that level of change is more intrusive and probably
requires greater knowledge of the implementation than I possess.
resolves: #349
* only manipulate the bit_writer for BOOLEAN data
Tacky, but I can't think of better way to do this without
specialization.
* better isolation of changes
Remove the byte tracking from the PlainEncoder and use the existing
bytes_written() method in BitWriter.
This is neater.
* add test for boolean writer
The test ensures that we can write > 2048 rows to a parquet file and
that when we read the data back, it finishes without hanging (defined as
taking < 5 seconds).
If we don't want that extra complexity, we could remove the
thread/channel stuff and just try to read the file and let the test
runner terminate hanging tests.
* fix capacity calculation error in bool encoding
The values.len() reports the number of values to be encoded and so must
be divided by 8 (bits in a bytes) to determine the effect on the byte
capacity of the bit_writer.
Krisztián Szűcs [Wed, 16 Jun 2021 04:39:19 +0000 (06:39 +0200)]
Unvendor Archery (#459)
Navin [Mon, 14 Jun 2021 13:40:39 +0000 (23:40 +1000)]
Doctests for DecimalArray. (#414)
* Doctests for DecimalArray.
* fixup! Doctests for DecimalArray.
* fixup! fixup! Doctests for DecimalArray.
Jiayu Liu [Sun, 13 Jun 2021 10:30:51 +0000 (18:30 +0800)]
use iterator for partition kernel implementation (#438)
Andrew Lamb [Sun, 13 Jun 2021 10:24:14 +0000 (06:24 -0400)]
Update docs + email template (#450)
Laurent Mazare [Sun, 13 Jun 2021 00:22:38 +0000 (08:22 +0800)]
Implement the Iterator trait for the json Reader. (#451)
* Implement the Iterator trait for the json Reader.
* Use transpose.
Ádám Lippai [Sun, 13 Jun 2021 00:20:08 +0000 (02:20 +0200)]
Add Decimal to CsvWriter and improve debug display (#406)
* Add Decimal to CsvWriter and improve debug display
* Measure CSV writer instead of file and data creation
* Re-use decimal formatting
Jiayu Liu [Sun, 13 Jun 2021 00:00:35 +0000 (08:00 +0800)]
remove unnecessary wraps in sortk (#445)
Jiayu Liu [Sat, 12 Jun 2021 12:59:35 +0000 (20:59 +0800)]
remove clippy unnecessary wraps (#449)
Jörn Horstmann [Sat, 12 Jun 2021 12:46:27 +0000 (14:46 +0200)]
Remove DictionaryArray::keys_array method and replace usages by the keys method (#419)
Yordan Pavlov [Thu, 10 Jun 2021 22:10:53 +0000 (23:10 +0100)]
Implement faster arrow array reader (#384)
* implement ArrowArrayReader
* change StringArrayConverter to use push_unchecked for offsets
* add ASF license header to new files
* fix clippy issues
* cleanup arrow_array_reader benches
* cleanup arrow_array_reader
* change util module to limit public exports from test_common sub-module
* fix rustfmt issues
Jiayu Liu [Wed, 9 Jun 2021 18:16:42 +0000 (02:16 +0800)]
refactor lexico sort (#424)
Andrew Lamb [Wed, 9 Jun 2021 18:12:17 +0000 (14:12 -0400)]
Update release readme.md (#436)
Don't start search on page 2, make link nicer looking
Andrew Lamb [Wed, 9 Jun 2021 18:11:38 +0000 (14:11 -0400)]
Reenable MIRI check (#421)
Jiayu Liu [Tue, 8 Jun 2021 21:54:46 +0000 (05:54 +0800)]
window::shift to work for all array types (#388)
* add more doc test for window::shift
* use Ok(make_array(array.data_ref().clone()))
* shift array for not only primitive cases
* include more test cases
* add back copied
* fix renaming
Ritchie Vink [Tue, 8 Jun 2021 21:16:18 +0000 (23:16 +0200)]
make sure that only concat preallocates buffers (#382)
* MutableArrayData::with_capacities
* better pattern matching
* add binary capacities
* add list child data
* add struct capacities
* add panic for dictionary type
* change dictionary capacity enum variant
Jiayu Liu [Tue, 8 Jun 2021 21:09:44 +0000 (05:09 +0800)]
refactor lexico sort (#423)
Michael Edwards [Tue, 8 Jun 2021 20:55:17 +0000 (22:55 +0200)]
Sort by float lists (#420)
Jörn Horstmann [Tue, 8 Jun 2021 20:51:01 +0000 (22:51 +0200)]
Fix bug with null buffer offset in boolean not kernel (#418)
Raphael Taylor-Davies [Tue, 8 Jun 2021 17:27:30 +0000 (18:27 +0100)]
Derive Eq and PartialEq for SortOptions (#425)
Jörn Horstmann [Tue, 8 Jun 2021 15:34:39 +0000 (17:34 +0200)]
Fix out of bounds read in bit chunk iterator (#416)
* Fix out of bounds read in bit chunk iterator
* Add comment why reading one additional byte is enough
Boaz [Tue, 8 Jun 2021 07:13:39 +0000 (10:13 +0300)]
Add set_bit to BooleanBufferBuilder to allow mutating bit in index (#383)
* Add set_bit to BooleanBufferBuilder to allow mutating bits in the builder
* Fix tests
* Update builder.rs
* Update builder.rs
* Fix clippy failures
Co-authored-by: Boaz Berman <boaz@codota.com>
Andrew Lamb [Sat, 5 Jun 2021 13:14:20 +0000 (09:14 -0400)]
Add labels to cherry pick scripts + writeup (#409)
* Add labels when cherry picking with script
* fixup
* document tags
* add note
* prettier
Jiayu Liu [Sat, 5 Jun 2021 05:01:58 +0000 (13:01 +0800)]
use prettiery to auto format md files (#398)
Ádám Lippai [Sat, 5 Jun 2021 04:54:32 +0000 (06:54 +0200)]
MINOR: update install instruction (#400)
We have frequent releases and honoring semver, removed minor and patch version pinning
Gang Liao [Fri, 4 Jun 2021 15:17:07 +0000 (08:17 -0700)]
Add (simd) modulus op (#317)
* Add (simd) modulus op
* fix typo
* fix feature = "simd"
* revert ModulusByZero
Jiayu Liu [Thu, 3 Jun 2021 21:54:00 +0000 (05:54 +0800)]
add more tests for window::shift and handle boundary cases (#386)
* add more doc test for window::shift
* handle i64::MIN first
* use Ok(make_array(array.data_ref().clone()))
Andrew Lamb [Thu, 3 Jun 2021 21:50:43 +0000 (17:50 -0400)]
Automatic cherry-pick script (#339)
* Automatic cherry-pick script
* switch from alamb to apache
* autopep8
* flake8
* add rat
* tweaks
* Add some docs to the README
Wakahisa [Wed, 2 Jun 2021 19:31:50 +0000 (21:31 +0200)]
Respect max rowgroup size in Arrow writer (#381)
* Respect max rowgroup size in Arrow writer
* simplify while loop
* address review feedback
Andrew Lamb [Sun, 30 May 2021 06:25:18 +0000 (02:25 -0400)]
Fix typo in release script, update release location (#380)
* Fix typo in release script
* release to `arrow-rs-{version}` directory
Ádám Lippai [Sat, 29 May 2021 11:02:21 +0000 (13:02 +0200)]
Add doctest for ArrayBuilder (#367)
Navin [Sat, 29 May 2021 10:54:13 +0000 (20:54 +1000)]
Doctests for FixedSizeBinaryArray (#378)
* Doctests for BooleanArray.
* Update arrow/src/array/array_boolean.rs
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* Doctests for FixedSizeBinaryArray.
* Fix formatting.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Ritchie Vink [Sat, 29 May 2021 10:45:55 +0000 (12:45 +0200)]
Reduce memory usage of concat (large)utf8 (#348)
* reduce memory needed for concat
* reuse code for str allocation buffer
Daniël Heres [Sat, 29 May 2021 07:42:49 +0000 (09:42 +0200)]
Simplify window using null array (#370)
Co-authored-by: Daniel Heres <danielheres@MBP-van-Olaf.home>
Dominik Moritz [Thu, 27 May 2021 21:34:15 +0000 (14:34 -0700)]
Fix version in readme (#365)
Closes #364
Dominik Moritz [Wed, 26 May 2021 20:23:30 +0000 (13:23 -0700)]
Remove superfluous space (#363)
Raphael Taylor-Davies [Wed, 26 May 2021 20:22:50 +0000 (21:22 +0100)]
Only register Flight.proto with cargo if it exists (#351)
Dominik Moritz [Wed, 26 May 2021 20:20:04 +0000 (13:20 -0700)]
Add crate badges (#362)
* Add crate badges
* Format markdown
Ritchie Vink [Wed, 26 May 2021 20:07:19 +0000 (22:07 +0200)]
Fix filter UB and add fast path (#341)
* fix ub in filter record_batch
* filter fast path
* add all false fast path
* use new_empty_array
* rename filter kernel argument
rename argument: 'filter' to 'predicate'
to reduce name collissions.
Marco Neumann [Wed, 26 May 2021 20:00:25 +0000 (22:00 +0200)]
allow `SliceableCursor` to be constructed from an `Arc` directly (#369)
This is backwards-compatible since we change the argument from `Vec<u8>`
to `impl Into<Arc<Vec<u8>>>` and the following implementations exists in
std:
- `impl<T, U> Into<U> for T where U: From<T>` (reverse direction)
- `impl<T> From<T> for Arc<T>` (create `Arc` from any type)
Furthermore `Arc<Vec<u8>>` can be passed directly now because the following
implementations exists:
- `impl<T> From<T> for T` (identity)
Closes #368.
Andrew Lamb [Wed, 26 May 2021 19:59:32 +0000 (15:59 -0400)]
Disable MIRI check until it runs cleanly on CI (#360)
Marco Neumann [Tue, 25 May 2021 22:03:01 +0000 (00:03 +0200)]
ensure null-counts are written for all-null columns (#307)
Fixes #306.
kazuhiko kikuchi [Tue, 25 May 2021 21:44:03 +0000 (06:44 +0900)]
allow to read non-standard CSV (#326)
* refactor Reader::from_reader
split into build_csv_reader, from_csv_reader
add escape, quote, terminator arg to build_csv_reader
* add escape,quote,terminator field to ReaderBuilder
schema inference support for non-standard CSV
add fn infer_file_schema_with_csv_options
add fn infer_reader_schema_with_csv_options
ReaderBuilder support for non-standard CSV
add escape, quote, terminator field
add fn with_escape, with_quote, with_terminator
change ReaderBuilder::build for non-standard CSV
* minimize API change
* add tests
add #[test] fn test_non_std_quote
add #[test] fn test_non_std_escape
add #[test] fn test_non_std_terminator
* apply cargo fmt
Navin [Mon, 24 May 2021 21:43:01 +0000 (07:43 +1000)]
Doctests for BooleanArray. (#338)
* Doctests for BooleanArray.
* Update arrow/src/array/array_boolean.rs
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Andrew Lamb [Mon, 24 May 2021 18:02:34 +0000 (14:02 -0400)]
Document and automate new release process (#299)
* Add Release README and Scripts to create and release tarballs
* Suggestions from Andy Grove
Ritchie Vink [Mon, 24 May 2021 12:44:27 +0000 (14:44 +0200)]
respect offset in utf8 and list casts (#335)
Ritchie Vink [Mon, 24 May 2021 12:43:07 +0000 (14:43 +0200)]
feature gate ipc reader/writer (#336)
Kornelijus Survila [Mon, 24 May 2021 01:00:42 +0000 (19:00 -0600)]
parquet: Speed up `BitReader`/`DeltaBitPackDecoder` (#325)
* parquet: Avoid temporary `BufferPtr`s in `BitReader`
From a quick test, this speeds up reading delta-packed int columns by
over 30%.
* parquet: Avoid some allocations in `DeltaBitPackDecoder`
From a quick test, it seems to decode around 10% faster overall.
Roee Shlomo [Sun, 23 May 2021 11:11:22 +0000 (14:11 +0300)]
Enable wasm32 as a target architecture for the SIMD feature (#324)
* Add wasm32 as target_arch for simd
Signed-off-by: roee88 <roee88@gmail.com>
* Allow wasm32 as a target arch for SIMD
Signed-off-by: roee88 <roee88@gmail.com>
Raphael Taylor-Davies [Sun, 23 May 2021 09:48:03 +0000 (10:48 +0100)]
fix comparison of dictionaries with different values arrays (#332) (#333)
Roee Shlomo [Sun, 23 May 2021 09:46:33 +0000 (12:46 +0300)]
Fix undefined behavior in FFI (#323)
- Fix UB due to aliasing
- Enable MIRI in CI for most tests in arrow crate
Signed-off-by: roee88 <roee88@gmail.com>
Wes McKinney [Sat, 22 May 2021 11:06:25 +0000 (06:06 -0500)]
Add ported Rust release verification script (#331)
* Add ported Rust release verification script
* Minor simplifications. (#1)
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
Raphael Taylor-Davies [Fri, 21 May 2021 18:30:23 +0000 (19:30 +0100)]
return reference from DictionaryArray::values() (#313) (#314)
Signed-off-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>
Ritchie Vink [Fri, 21 May 2021 18:30:07 +0000 (20:30 +0200)]
feature gate csv functionality (#312)
* feature gate csv functionality
* mock read_csv example
* clippy
* mock read_csv_infer_schema example
* add tests of --no-default-features to CI
Ritchie Vink [Fri, 21 May 2021 12:31:17 +0000 (14:31 +0200)]
fix invalid null handling in filter (#296)
* fix invalid null handling in filter
* take offset into account
* remove incorrect UB warning
Navin [Thu, 20 May 2021 15:40:05 +0000 (01:40 +1000)]
Doctests for StringArray and LargeStringArray. (#330)
Ritchie Vink [Thu, 20 May 2021 15:30:03 +0000 (17:30 +0200)]
inline PrimitiveArray::value (#329)
Ritchie Vink [Wed, 19 May 2021 13:37:57 +0000 (15:37 +0200)]
Mutablebuffer::shrink_to_fit (#318)
* Mutablebuffer::shrink_to_fit
* add shrink_to_fit explicit test
Roee Shlomo [Mon, 17 May 2021 11:08:24 +0000 (14:08 +0300)]
Fix FFI and add support for Struct type (#287)
* fix: support nested types in FFI
Ported from https://github.com/jorgecarleitao/arrow2
Fix #20
Fix #251
Signed-off-by: roee88 <roee88@gmail.com>
* Removed Clone from FFI_ArrowArray
Signed-off-by: roee88 <roee88@gmail.com>
* Add nesting to FFI struct test
Signed-off-by: roee88 <roee88@gmail.com>
Jorge Leitao [Mon, 17 May 2021 11:07:27 +0000 (13:07 +0200)]
Added changelog generator script and configuration. (#289)
Max Meldrum [Mon, 17 May 2021 10:45:11 +0000 (12:45 +0200)]
Add Send to the ArrayBuilder trait (#291)
Daniël Heres [Mon, 17 May 2021 06:09:38 +0000 (08:09 +0200)]
Version upgrades (#304)
Andrew Lamb [Sun, 16 May 2021 11:36:39 +0000 (07:36 -0400)]
Remove old release scripts (#293)
* Remove old release scripts
* Add rat files back in
Wakahisa [Sat, 15 May 2021 07:21:39 +0000 (09:21 +0200)]
manually bump development version (#288)
Manish Gill [Fri, 14 May 2021 18:42:47 +0000 (20:42 +0200)]
Added Decimal support to pretty-print display utility (#230) (#273)
* Added Decimal support to pretty-print display utility (#230)
* Applied cargo fmt to fix linting errors
* Added proper printing for decimals based on scale, moved tests to pretty.rs
* Applied cargo fmt on pretty test
Co-authored-by: Manish Gill <manish.gill@tomtom.com>
Michael Edwards [Thu, 13 May 2021 11:28:46 +0000 (13:28 +0200)]
Fix subtraction underflow when sorting string arrays with many nulls (#285)
Wakahisa [Tue, 11 May 2021 05:42:41 +0000 (07:42 +0200)]
Fix null struct and list roundtrip (#270)
* fix null struct and list inconsistencies in writer
* fix list reader null and empty slot calculation
* remove stray TODOs
Daniël Heres [Tue, 11 May 2021 05:35:05 +0000 (07:35 +0200)]
Speed up bound checking in `take` (#281)
* WIP improve take performance
* WIP
* Bound checking speed
* Simplify
* fmt
* Improve formatting
Wakahisa [Mon, 10 May 2021 22:26:36 +0000 (00:26 +0200)]
Update PR template by commenting out instructions (#278)
Some contributors don't remove the guidelines when creating PRs, so it might be more convenient if we hide them behind comments.
The comments are still visible when editing, but are not displayed when the markdown is rendered
Marco Neumann [Mon, 10 May 2021 16:44:58 +0000 (18:44 +0200)]
support full u32 and u64 roundtrip through parquet (#258)
* re-export arity kernels in `arrow::compute`
Seems logical since all other kernels are re-exported as well under this
flat hierarchy.
* return file from `parquet::arrow::arrow_writer::tests::[one_column]_roundtrip`
* support full arrow u64 through parquet
- updates arrow to parquet type mapping to use reinterpret/overflow cast
for u64<->i64 similar to what the C++ stack does
- changes statistics calculation to account for the fact that u64 should
be compared unsigned (as per spec)
Fixes #254.
* avoid copying array when reading u64 from parquet
* support full arrow u32 through parquet
This is idential to the solution we now have for u64.
Wakahisa [Fri, 7 May 2021 18:46:24 +0000 (20:46 +0200)]
1.52 clippy fixes (#267)
Dominik Moritz [Fri, 7 May 2021 05:36:56 +0000 (22:36 -0700)]
Fix typo in csv/reader.rs (#265)
hulunbier [Fri, 7 May 2021 05:32:32 +0000 (13:32 +0800)]
Fix empty Schema::metadata deserialization error (#260)
* Fix empty Schema::metadata deserialization error
Hope this fixes issue #241
* Rename UT name to `test_ser_de_metadata`
Co-authored-by: hulunbier <hulunbier>
Jiayu Liu [Thu, 6 May 2021 07:43:09 +0000 (15:43 +0800)]
update datafusion and ballista links (#259)
Jorge Leitao [Wed, 5 May 2021 04:47:26 +0000 (06:47 +0200)]
Added env to run rust in integration. (#253)
Marco Neumann [Wed, 5 May 2021 04:46:24 +0000 (06:46 +0200)]
fix NaN handling in parquet statistics (#256)
Closes #255.
Wakahisa [Tue, 4 May 2021 04:43:57 +0000 (06:43 +0200)]
fix parquet max_definition for non-null structs (#246)
* fix parquet max_definition for non-null structs
* clippy: needless reference
* Update parquet/src/arrow/levels.rs
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Jorge Leitao [Mon, 3 May 2021 14:42:45 +0000 (16:42 +0200)]
Made integration tests always run. (#248)
Andrew Lamb [Mon, 3 May 2021 14:41:30 +0000 (10:41 -0400)]
Improve docs for NullArray, new_null_array and new_empty_array (#240)
* Update docs in null_array.rs so they are discoverable
* Add doc examples for new_null_array and new_empty_array
Michael Edwards [Sat, 1 May 2021 10:19:22 +0000 (12:19 +0200)]
sort_primitive result is capped to the min of limit or values.len (#236)
* sort_primitive result is capped to the min of limit or values.len
fixes #235
* Fixed length calculation of nulls to include
* Add more sort_primitive tests for sorts /w limit
Jorge Leitao [Fri, 30 Apr 2021 11:26:41 +0000 (13:26 +0200)]
Disabled rebase needed until demonstrate working. (#243)
Ritchie Vink [Thu, 29 Apr 2021 17:42:08 +0000 (19:42 +0200)]
pin flatbuffers to 0.8.4 (#239)
* pin flatbuffer to 0.8.4
* =0.8.2
Wakahisa [Thu, 29 Apr 2021 16:26:40 +0000 (18:26 +0200)]
[Parquet] Read list field correctly (#234)
Andrew Lamb [Tue, 27 Apr 2021 12:48:46 +0000 (08:48 -0400)]
Fix code examples for RecordBatch::try_from_iter (#231)
Raphael Taylor-Davies [Tue, 27 Apr 2021 11:25:18 +0000 (12:25 +0100)]
Support string dictionaries in csv reader (#228) (#229)
Andrew Lamb [Tue, 27 Apr 2021 10:44:22 +0000 (06:44 -0400)]
ARROW-12411: [Rust] Create RecordBatches from Iterators (#7)
Ritchie Vink [Mon, 26 Apr 2021 21:49:52 +0000 (23:49 +0200)]
support LargeUtf8 in sort kernel (#26)
Jorge Leitao [Mon, 26 Apr 2021 21:48:56 +0000 (23:48 +0200)]
Removed unused files (#22)
* Removed unused files.
* Removed un-used files.
Daniël Heres [Fri, 23 Apr 2021 07:59:34 +0000 (09:59 +0200)]
Support auto-vectorization for min/max using multiversion (#9)
Andy Grove [Thu, 22 Apr 2021 16:08:23 +0000 (10:08 -0600)]
Add GitHub templates (#17)
Jorge Leitao [Thu, 22 Apr 2021 13:57:26 +0000 (15:57 +0200)]
Added rebase-needed bot (#13)
Raphael Taylor-Davies [Thu, 22 Apr 2021 13:09:59 +0000 (14:09 +0100)]
Buffer::from_slice_ref set correct capacity (#18)
Fixed ARROW-12504
Raphael Taylor-Davies [Thu, 22 Apr 2021 12:49:14 +0000 (13:49 +0100)]
ARROW-12493: Add support for writing dictionary arrays to CSV and JSON (#16)
Raphael Taylor-Davies [Thu, 22 Apr 2021 12:44:34 +0000 (13:44 +0100)]
ARROW-12426: [Rust] Fix concatentation of arrow dictionaries (#15)
Jorge Leitao [Wed, 21 Apr 2021 22:08:43 +0000 (00:08 +0200)]
Added Integration tests against arrow (#10)
* fix indent
* Made CI run on any change. (#5)
* Removed bot comment about title and JIRA. (#4)
* Allow creating issues. (#6)
* Trying running integration tests.
Co-authored-by: Andy Grove <andygrove73@gmail.com>
Co-authored-by: Andy Grove <andygrove@users.noreply.github.com>
Daniël Heres [Wed, 21 Apr 2021 18:19:58 +0000 (20:19 +0200)]
Update URLs (#14)