Jacob Quinn [Sat, 9 Apr 2022 20:35:46 +0000 (14:35 -0600)]
Bump version to 2.3.0 (#312)
Ben Baumgold [Sat, 9 Apr 2022 17:33:17 +0000 (13:33 -0400)]
refactor Arrow.write to support incremental writes (#277)
* refactor Arrow.write to support incremental writes
* bump julia compat due to dependency on interpolation in Base.Threads.@spawn
* PR feedback
* add Arrow.Writer-specific tests and in-code/manual documentation
Co-authored-by: Ben Baumgold <ben.baumgold@mavensecurities.com>
Jarrett Revels [Fri, 18 Mar 2022 21:54:26 +0000 (17:54 -0400)]
add missing arrowtype(b, ::Type{<:Period}) method to enable roundtripping of Period types (#306)
Jarrett Revels [Fri, 18 Mar 2022 20:41:51 +0000 (16:41 -0400)]
fix 232 (incorrectly serialized non-concrete Dict type) by disallowing non-concrete map-like types (#305)
Sutou Kouhei [Tue, 8 Mar 2022 00:07:18 +0000 (09:07 +0900)]
Fix wrong release artifacts URL (#302)
fix #301
We should not add "apache-" prefix in
https://dist.apache.org/repos/dist/release/arrow/ because other
releases don't have "apache-" prefix.
Jacob Quinn [Sun, 6 Mar 2022 12:04:45 +0000 (05:04 -0700)]
Bump version for release (#299)
Sutou Kouhei [Fri, 25 Feb 2022 21:20:08 +0000 (06:20 +0900)]
Add verification script (#292)
fix #288
dev/release/verify_rc.sh verifies RC.
CI jobs for our RC related scripts are also added.
Sutou Kouhei [Fri, 25 Feb 2022 16:41:01 +0000 (01:41 +0900)]
Add release scripts (#290)
* Add release scripts
fix #287
* dev/release/release_rc.sh: This prepares RC related artifacts and upload
them to https://dist.apache.org/repos/dist/dev/arrow .
* dev/release/release.sh: This publishes voted RC to
https://dist.apache.org/repos/dist/release/arrow .
* dev/release/README.md: This describes how to use the above scripts.
* .github/*: Add missing license headers.
They are confirmed as much as possible on local but some processes
can't be confirmed on local. There may be still some problems but they
will be fixed in the next release process.
* Describe how to use JuliaTagBot
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
* Fix a typo
* Fix form
* Fix wrong author
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
Sutou Kouhei [Tue, 22 Feb 2022 23:39:22 +0000 (08:39 +0900)]
Introduce Release audit tool (Rat) (#289)
fix #286
* dev/release/: Add scripts for Rat related
* *: Add missing license header
* scripts/update_apache_arrow_code.jl: Remove because it's no longer needed
Add a CI job that audits licenses.
Jacob Quinn [Sat, 22 Jan 2022 05:26:20 +0000 (22:26 -0700)]
Fix case where metadata is provided but empty (#276)
Fixes #253. Just a simple fix to the `toidict` utility function to
account for the empty case. This would happen when metadata was
_provided_ for a table/column, but was empty, which would probably be a
common case in programmatic environments for reading/writing arrow.
Nathan Daly [Sat, 22 Jan 2022 04:23:18 +0000 (23:23 -0500)]
Proposal: change `@scopedenum` to make modules to avoid type piracy (#267)
* Try creating module instead of overloading type getproperty
* Apply renaming from `@scopedenum` throughout the Package
Example:
- `Meta.UnionMode.Sparse` => `Meta.UnionModes.Sparse`
* Return the module since it's now meant to be user-visible:
```julia
julia> Arrow.FlatBuffers.@scopedenum MyEnum X=1 Y=2
Main.MyEnums
```
Sutou Kouhei [Sat, 1 Jan 2022 05:59:41 +0000 (14:59 +0900)]
Send issue comments notification to github@arrow.apache.org (#274)
#271
Sutou Kouhei [Wed, 29 Dec 2021 20:29:41 +0000 (05:29 +0900)]
Configure repository metadata (#272)
fix #271
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Curtis Vogt [Tue, 2 Nov 2021 06:12:02 +0000 (01:12 -0500)]
Remove use of symlinks in CI matrix (#256)
Curtis Vogt [Fri, 29 Oct 2021 15:43:31 +0000 (10:43 -0500)]
Support `AbstractPath` where file paths are used (#255)
* Support AbstractPath where file paths are used
* Set package to version 2.2.0
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Denis Barucic [Sat, 23 Oct 2021 00:06:12 +0000 (02:06 +0200)]
Replace `Base.n_waiters` with `isempty` (#254)
KronosTheLate [Tue, 12 Oct 2021 15:20:27 +0000 (17:20 +0200)]
Add Avro reference, add links (#252)
* Add Avro reference, add links
Based on [https://github.com/JuliaData/Arrow.jl/issues/251](this issue) I made.
I also added links to the relevant packages.
* Update README.md
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Kristoffer Carlsson [Mon, 4 Oct 2021 15:54:58 +0000 (17:54 +0200)]
add DOCUMENTER_KEY secret to TagBot (#250)
Jacob Quinn [Sun, 3 Oct 2021 05:17:43 +0000 (23:17 -0600)]
Update arrowjson.jl usage of Tables.Columns
Jacob Quinn [Tue, 28 Sep 2021 23:32:39 +0000 (17:32 -0600)]
bump version
Jacob Quinn [Tue, 28 Sep 2021 23:32:21 +0000 (17:32 -0600)]
Add ability to pass directory of inputs to Arrow.Table/Arrow.Stream (#246)
* Add ability to pass directory of inputs to Arrow.Table/Arrow.Stream
Implements #235. The proposed implementation here isn't too complicated;
it introduces a new `ArrowBlob` type that we'll convert any
`IO`/`String`/`Vector{UInt8}` input to, and the main
`Arrow.Table`/`Arrow.Stream` methods take a `Vector{ArrowBlob}` to
operate on. The schema of each input must match in order to be treated
as a single "table". "Must match" in this PR means `==(sch1, sch2)`,
which does include `custom_metadata` and dictionaries, which might not
be desireable, but I haven't quite had the chance to think through all
the way yet. Otherwise, I think the rest here is pretty straightfoward
and non-breaking.
* update docs
Nick Robinson [Mon, 27 Sep 2021 15:14:33 +0000 (16:14 +0100)]
Remove unused line (#245)
Jacob Quinn [Wed, 22 Sep 2021 18:42:32 +0000 (12:42 -0600)]
revert version bump
Jarrett Revels [Wed, 22 Sep 2021 17:58:27 +0000 (13:58 -0400)]
delete deprecations in preparation for v2.0 release (#241)
Jacob Quinn [Wed, 22 Sep 2021 03:30:25 +0000 (21:30 -0600)]
bump version
Jon Alm Eriksen [Thu, 16 Sep 2021 05:23:12 +0000 (07:23 +0200)]
bugfix reading arrays (#234)
* bugfix reading arrays
* add tests
Co-authored-by: Jon Alm Eriksen <jon.alm.eriksen@novelda.com>
Jarrett Revels [Tue, 14 Sep 2021 19:27:57 +0000 (15:27 -0400)]
remove global metadata cache, refactor custom_metadata API (#238)
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Jacob Quinn [Wed, 8 Sep 2021 05:43:05 +0000 (23:43 -0600)]
Update doc deploy actions
Eric Hanson [Tue, 3 Aug 2021 11:59:50 +0000 (13:59 +0200)]
fix ambiguity error (#219)
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Jarrett Revels [Sat, 24 Jul 2021 00:38:07 +0000 (20:38 -0400)]
custom struct getindex should deserialize based on the shape of the actual ArrowType, not the target JuliaType (#229)
Curtis Vogt [Fri, 23 Jul 2021 04:16:08 +0000 (23:16 -0500)]
Support Julia 1.0 for ArrowTypes package (#223)
Eric Hanson [Mon, 19 Jul 2021 13:18:11 +0000 (15:18 +0200)]
use `_id` in warn logging (#225)
Eric Hanson [Thu, 8 Jul 2021 21:49:40 +0000 (23:49 +0200)]
Add `maxlog=1` to not spam logs (#224)
Curtis Vogt [Wed, 7 Jul 2021 15:20:16 +0000 (10:20 -0500)]
Set project version to 1.6.0 (#222)
Curtis Vogt [Wed, 7 Jul 2021 13:44:17 +0000 (08:44 -0500)]
Use standalone ArrowTypes package (#212)
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
Kristoffer Carlsson [Tue, 6 Jul 2021 20:54:56 +0000 (22:54 +0200)]
fix writing an empty table (#221)
Eric Hanson [Tue, 22 Jun 2021 13:41:10 +0000 (15:41 +0200)]
add metadata to `show` method (#217)
Curtis Vogt [Thu, 17 Jun 2021 16:29:27 +0000 (11:29 -0500)]
Remove old arrowtypes.jl file (#215)
Curtis Vogt [Thu, 17 Jun 2021 05:03:06 +0000 (00:03 -0500)]
Set ArrowTypes to version 1.1.0 (#213)
Simeon Schaub [Mon, 7 Jun 2021 20:18:53 +0000 (22:18 +0200)]
Rename LICENSE.md to LICENSE (#195)
Curtis Vogt [Mon, 7 Jun 2021 19:54:57 +0000 (14:54 -0500)]
Test expected log records (#204)
* Test/suppress error log issue #126 test
* Test misc warnings
Jarrett Revels [Mon, 31 May 2021 18:51:50 +0000 (14:51 -0400)]
bump Project.toml from v1.4.1 to v1.5.0 (#208)
Curtis Vogt [Mon, 31 May 2021 17:17:25 +0000 (12:17 -0500)]
Support `VersionNumber` (#205)
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Curtis Vogt [Mon, 31 May 2021 16:33:27 +0000 (11:33 -0500)]
Handle empty tuple (#201)
Pietro Vertechi [Mon, 31 May 2021 16:31:14 +0000 (18:31 +0200)]
support Date with type parameter ms (#207)
Jacob Quinn [Sat, 24 Apr 2021 03:26:48 +0000 (21:26 -0600)]
Add global metadata lock to ensure thread safety of global metadata (#183)
* Add global metadata lock to ensure thread safety of global metadata
store
Follow up to #90, based on discussions in that issue.
* fix
Jacob Quinn [Fri, 23 Apr 2021 02:19:47 +0000 (20:19 -0600)]
Ensure requested List type is requested on List getindex (#182)
* Ensure requested List type is requested on List getindex
Fixes #167. Not tested yet.
* add test
Tanmay Mohapatra [Fri, 23 Apr 2021 00:34:00 +0000 (06:04 +0530)]
ability to append partitions to existing arrow files (#160)
* ability to append partitions to an arrow file
This adds a method to `append` partitions to existing arrow files. Partitiions to append to are supplied in the form of any [Tables.jl](https://github.com/JuliaData/Tables.jl)-compatible table.
Multiple record batches will be written based on the number of `Tables.partitions(tbl)` that are provided.
Each partition being appended must have the same `Tables.Schema` as the destination arrow file that is being appended to.
Other parameters that `append` accepts are similar to what `write` accepts.
* remove unused methods
* add more tests and some fixes
* allow appends to both seekable IO and files
* few changes to Stream,avoid duplication for append
store few additional stream properties in the `Stream` data type and avoid duplicating code for append functionality
* call Tables.schema on result of Tables.columns
Jarrett Revels [Fri, 23 Apr 2021 00:28:14 +0000 (20:28 -0400)]
fix propagation of maxdepth kwarg (#181)
* fix propagation of maxdepth kwarg
* bump Project.toml
Jacob Quinn [Fri, 16 Apr 2021 18:37:50 +0000 (12:37 -0600)]
Bump version
Jacob Quinn [Thu, 15 Apr 2021 15:17:53 +0000 (09:17 -0600)]
Fix case when ipc stream has no record batches, only schema (#175)
* Fix case when ipc stream has no record batches, only schema
Fixes #158. While the Julia implementation currently doesn't provide
way to avoid writing any record batches, the pyarrow implementation has
more fine-grained control over writing and allows closing an ipc stream
without writing any record batches. In that case, on the Julia side when
reading, we just need to check for this case specifically and if so,
populate some empty columns, since we're currently relying on them being
populated when record batches are read.
* fix metadata
Jacob Quinn [Thu, 15 Apr 2021 15:17:37 +0000 (09:17 -0600)]
Fix slight perf hit when checking validity bitmap (#176)
Fixes #131. Dip me in mustard and call me a hotdog, cuz I can't tell
how/why `divrem(i - 1, 8) .+ (1, 1)` ends up being ~30% faster than
`fldmod1(i, 8)`. It'd probably be worth looking into it more, but it
works for now. The divrem code is @expandingman 's code from his
arrow/feather code.
Étienne Tétreault-Pinard [Wed, 14 Apr 2021 15:57:55 +0000 (11:57 -0400)]
fix () -> {} typo (#174)
Jacob Quinn [Wed, 14 Apr 2021 14:35:56 +0000 (08:35 -0600)]
Introduce Arrow.ToTimestamp for performant ZonedDateTime encoding (#173)
Fixes #95 by allowing users to wrap `ZonedDateTime` columns in
`Arrow.ToTimestamp`, which will allow the writing process to skip costly
check/conversion by assuming each element has the same timezone;
`ToTimestamp` uses the timezone of the first element.
Jacob Quinn [Tue, 13 Apr 2021 14:56:27 +0000 (08:56 -0600)]
Warn when converting Arrow.Timestamps to Dates.DateTime or ZonedDateTime (#172)
Fixes #166. The problem OP saw in the original issue was that we didn't
have a proper `ArrowTypes.fromarrow` method defined for `Dates.DateTime`
from `Arrow.Timestamp` with nanosecond precision, which is accurate in
one sense because `Dates.DateTime` only supports up to millisecond
precision. But better than just erroring when trying to access these
values later, we now do the conversion anyway, which may be lossy, and
issue a warning about the potentially lossy conversion. If > millisecond
precision is needed, then users should pass `convert=false` and operate
on the `Arrow.Timestamp` values directly for now.
Eric Hanson [Tue, 13 Apr 2021 04:21:29 +0000 (05:21 +0100)]
use actual deprecation (#171)
Jarrett Revels [Wed, 7 Apr 2021 05:55:57 +0000 (01:55 -0400)]
add missing setmedata! method for Arrow.Table (#170)
Jarrett Revels [Wed, 7 Apr 2021 05:54:16 +0000 (01:54 -0400)]
document guarantee that `getmetadata` returns alias not copy (#169)
Jacob Quinn [Sun, 4 Apr 2021 03:24:10 +0000 (21:24 -0600)]
Don't store table metadata globally (#165)
* Don't store table metadata globally
Fixes #90. There's no need to store table metadata globally when we can
just store it in the Table type itself and overload the `getmetadata`.
This should avoid metadata bloat in the global store.
Jacob Quinn [Fri, 2 Apr 2021 16:47:05 +0000 (10:47 -0600)]
Restructure ArrowTypes so it can be registered as its own package (#163)
* Restructure ArrowTypes so it can be registered as its own package
This will facilitate dependants wishing to overload ArrowTypes interface
functions for their custom types. This is basically a "free" dependency
they can take on to avoid all the extra dependencies that Arrow.jl
itself has.
* Run ArrowTypes tests when we do CI for Arrow.jl
* fix
* fix
Jacob Quinn [Fri, 2 Apr 2021 16:02:16 +0000 (10:02 -0600)]
DataAPI methods (#164)
* Add refpool, refarray and levels for DictEncoded
* Apply tests to the correct variable name.
Co-authored-by: Douglas Bates <dmbates@gmail.com>
Douglas Bates [Thu, 1 Apr 2021 22:13:50 +0000 (17:13 -0500)]
Add refpool, refarray and levels for DictEncoded (#161)
Jacob Quinn [Thu, 1 Apr 2021 21:55:58 +0000 (15:55 -0600)]
Tweak promoteunion to always avoid abstract types (#162)
This makes the 2nd `concretecheck` unnecessary in the `ToArrow`
constructor, but we'll leave it for now as a type of assert. This came
from a discussion with @iamed on slack, where it was pointed out that
values always have a concrete type at runtime (yay Julia!), so we should
_always_ be able to get a `Union{...}` of concrete types. This could
potentially get crazy if someone has, like, thousands of unique concrete
types in a single array, but I guess we'll cross that bridge when we
come to it.
Jacob Quinn [Thu, 1 Apr 2021 16:18:27 +0000 (10:18 -0600)]
Allow avoiding the feather file format when writing arrow data to file name; that means allowing passing when providing a filename::String to Arrow.write
Jacob Quinn [Mon, 29 Mar 2021 13:23:06 +0000 (07:23 -0600)]
Bump version
Jacob Quinn [Mon, 29 Mar 2021 13:22:44 +0000 (07:22 -0600)]
Overhaul type serialization/deserialization machinery (#156)
* Start work on overhauling type serialization architecture
* More work; serialization is pretty much done but not tested
* fix timetype ArrowTypes definitions
* more work to get tests passing
* get tests passing?
* fix
* Fix #75 by supporting Set serialization/deserialization
* Fix #85 by supporting tuple serialization/deserialization
* Lots of cleanup
* few more fixes
* Update src/arrowtypes.jl
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
* Update src/arrowtypes.jl
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
* fix NullKind reading
* Fix #134 by requiring concrete or union of concrete element types for
all columns when serializing
* Add new ArrowTypes.arrowmetadata method for providing additional extension type metadata htat can be used in JuliaType
* Update manual
* tests
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Jacob Quinn [Thu, 18 Mar 2021 06:22:40 +0000 (00:22 -0600)]
Better handle errors when something goes wrong writing partitions (#154)
* Better handle errors when something goes wrong writing partitions
Follow up to #108. There were actually a few different issues all coming
together in @kjanosz's comment. The first being that `ZipFile.Reader`
doesn't like non-main threads touching its files _at all_. The Arrow.jl
problem there is when processing non-first partitions, we were just
`Threads.@spawn`-ing a task which then went off and sometimes became the
proverbial unheard tree falling in the forest. Like really, no one heard
these tasks and their poor exceptions.
The solution there is to be better shepherds of our spawned tasks; we
introduce a `anyerror` atomic bool that threads can set if they run into
issues and then in the main partition handling loop we'll check that. If
a thread ran into something, we'll log out the thread-specific
exception, then throw a generic writing error. This prevents the
"hanging" behavior people were seeing because the felled threads were
actually causing later tasks to hang on `put!`-ing into the
OrderedChannel.
After addressing this, we have the multithreaded ZipFile issue. With the
new `ntasks` keyword, it seems to make sense to me that if you pass
`ntasks=1`, you're really saying, I don't want any concurrency. So
anywhere we were `Threads.@spawn`ing, we now check if `ntasks == 1` and
if so, do an `@async` instead.
* Fix
Jacob Quinn [Thu, 18 Mar 2021 04:35:12 +0000 (22:35 -0600)]
Add ntasks keyword to limit # of tasks allowed to write at a time (#106)
* Add ntasks keyword to limit # of tasks allowed to write at a time
* fix
Jarrett Revels [Wed, 17 Mar 2021 23:11:21 +0000 (19:11 -0400)]
add unexported tobuffer utility for interactive testing/development (#153)
Jarrett Revels [Sun, 14 Mar 2021 03:32:19 +0000 (22:32 -0500)]
revert setting Arrrow.write debug message threshold to -1 (#152)
Jacob Quinn [Fri, 12 Mar 2021 16:07:04 +0000 (09:07 -0700)]
Ensure serializing Arrow.DictEncoded writes dictionary messages (#149)
Fixes #126. The issue here was when `Arrow.write` was faced with the
task of serializing an `Arrow.DictEncoded`. For most arrow array types,
if the input array is already an arrow array type, it's a no-op (e.g. if
you're writing out an `Arrow.Table`). The problem comes from
`Arrow.DictEncoded`, where there is still no conversion required, but we
do need to make a note of the dict encoded column to ensure a dictionary
message is written before the record batch. In addition, we also add
some code for handling delta dictionary messages if required from
multiple record batches that contain `Arrow.DictEncoded`s, which is a
valid use-case where you may have multiple arrow files, with the same
schema, that you wish to serialize as a single arrow file w/ each file
as a separate record batch.
Slightly unrelated, but there's also a fix here in our use of Lockable.
We actually had a race condition I ran into once where the locking was
on the Lockable object, but inside the locked region, we replaced the
entire Lockable instead of the _contents_ of the Lockable. This meant
anyone who started waiting on the Lockable's lock didn't see updates
when unlocked because the entire Lockable had been updated.
Jacob Quinn [Fri, 12 Mar 2021 05:16:34 +0000 (22:16 -0700)]
Ensure dict encoded index types match from record batch to record batch (#148)
Fixes #144. The core issue here was the initial record batch had a
dict-encoded column that ended up having an index type of Int8. However,
in a subsequent record batch, we use a different code path for dict
encoded columns because we need to check if a dictionary delta message
needs to be sent (i.e. there are new pooled values that need to be
serialized). The problem was in this code path, the index type was
computed from the total length of the input column instead of matching
what was already serialized in the initial schema message.
This does open up the question of another possible failure: if an
initial dict encoded column is serialized with an index type of Int8,
yet subsequent record batches end up including enough unique values that
this index type will be overflowed. I've added in an error check for
this case. Currently it's a fatal error that will stop the `Arrow.write`
process completely. I'm not quite sure what the best recommendation
would be in that case; ultimately the user needs to either widen the
first record batch column index type, but perhaps we should allow
passing a dict-encoded index type to the overall `Arrow.write` function
so users can easily specify what that type should be.
The other change that had to be made in this PR is on the reading side,
since we're now tracking the index type in the DictEncoding type itself,
which probably not coincidentally is what the arrow-json struct already
does. For reading, we already have access to the dictionary field, so
it's just a matter of deserializing the index type before constructing
the DictEncoding struct.
Jacob Quinn [Wed, 10 Mar 2021 23:40:26 +0000 (16:40 -0700)]
Introduce new `maxdepth` keyword argument for setting a limit on nesting (#147)
level limit
Alternative fix for #143. This is a more general fix than just
specializing CategoricalArrays. This should prevent more general cases
of the same issue: i.e. someone accidently passes a recursive data
structure and `Arrow.write` gets stuck trying to recursively serialize.
Damien Drix [Fri, 5 Mar 2021 06:06:42 +0000 (06:06 +0000)]
implement Base.IteratorSize for Stream, fixes #141 (#142)
Jarrett Revels [Fri, 5 Feb 2021 04:52:00 +0000 (23:52 -0500)]
fix accidental invocation of _unsafe_load_tuple (#124)
* fix accidental invocation of _unsafe_load_tuple
* bump Project.toml
Jacob Quinn [Thu, 4 Feb 2021 06:51:37 +0000 (23:51 -0700)]
Bump version
Douglas Bates [Thu, 4 Feb 2021 06:51:14 +0000 (00:51 -0600)]
Use pool length in signed int conversion (#122)
Jacob Quinn [Sun, 31 Jan 2021 06:20:11 +0000 (23:20 -0700)]
Bump version
Jacob Quinn [Sun, 31 Jan 2021 06:18:44 +0000 (23:18 -0700)]
Rework dict encoding of PooledArray/CategoricalArray to fix outstandi… (#119)
* Rework dict encoding of PooledArray/CategoricalArray to fix outstanding issues
Fixes #117, #116, and #113. For #116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For #113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For #117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.
* PooledArrays 1.0 compat
* Update src/arraytypes/dictencoding.jl
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
* Check refpool
* Fix test
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Jacob Quinn [Sat, 30 Jan 2021 07:04:34 +0000 (00:04 -0700)]
Make compressed writing threadsafe (#118)
Fixes #82. The problem when trying to write arrow using multiple threads and compression was there was only a single compressor object that each thread was simultaneously trying to use. This PR ensures there is a compressor object per thread that will be used per thread.
Jacob Quinn [Mon, 25 Jan 2021 18:39:49 +0000 (11:39 -0700)]
Bump version
Jacob Quinn [Mon, 25 Jan 2021 18:39:08 +0000 (11:39 -0700)]
Fix copy on DictEncode (#111)
Fixes #102. The issue comes up because DataFrames constructor tries to
make a copy of input columns by default when constructing; for
DictEncode, it's just a wrapper to signal that a column should be
copied, so we just make a shallow copy.
Jacob Quinn [Sat, 23 Jan 2021 04:27:09 +0000 (21:27 -0700)]
Don't use ChainedVector as DictEncoding data array unless necessary (#110)
Fixes #109. The issue here was when reading arrow record batches with
dict encoded columns, we eagerly used `ChainedVector` for the underlying
array backing the `DictEncoding` in case there were subsequent
record batches that added additional elements to the dict encoding. This
is too eager though, since it's probably common, like for "feather"
files, where the dict encoding values are always known and provided in
the first record batch. In fact, several language implementations don't
even support these kind of "delta" dict updates in subsequent record
batches. This PR, therefore, uses a regular array for the dict encoding
backing for the first record batch, and only promotes to a ChainedVector
if we happen to get a delta update.
Jarrett Revels [Tue, 19 Jan 2021 18:12:53 +0000 (13:12 -0500)]
bump Project.toml to v1.2.0 (#107)
Jarrett Revels [Tue, 12 Jan 2021 06:29:34 +0000 (01:29 -0500)]
add isbitstype optimized path for FixedSizeList getindex (#104)
* add isbitstype optimized path for FixedSizeList getindex
* reuse _unsafe_cast strategy from #103
* rebase + DRY _unsafe_cast
Jarrett Revels [Mon, 11 Jan 2021 22:51:20 +0000 (17:51 -0500)]
change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary (#103)
* change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary
* fix tests
* optimize UInt128 <-> NTuple{16,UInt8} casting
Co-authored-by: SimonDanisch <sdanisch@protonmail.com>
Co-authored-by: SimonDanisch <sdanisch@protonmail.com>
Jacob Quinn [Thu, 7 Jan 2021 05:28:13 +0000 (22:28 -0700)]
Add missing license
Jacob Quinn [Thu, 7 Jan 2021 05:10:39 +0000 (22:10 -0700)]
Admin cleanup
Jacob Quinn [Wed, 6 Jan 2021 04:40:57 +0000 (21:40 -0700)]
Add BitIntegers compat (#100)
Jarrett Revels [Tue, 5 Jan 2021 23:59:32 +0000 (18:59 -0500)]
bump Project.toml to v1.1.0 (#94)
Jacob Quinn [Tue, 5 Jan 2021 23:59:19 +0000 (16:59 -0700)]
Fix copy on DictEncoding arrays with missing values (#99)
The copy code for DictEncoding erroneously tried to special-case
`missing` when copying, which is unnecessary since `missing` is just
treated like any other regular ref value. By removing the special-cased
branch of code, we avoid treating it differently and the code copies the
`DictEncoding` array as a `PooledArray` as expected.
Eric Hanson [Tue, 5 Jan 2021 23:51:06 +0000 (00:51 +0100)]
Update make.jl (#97)
Jarrett Revels [Tue, 5 Jan 2021 23:30:27 +0000 (18:30 -0500)]
convert Arrow-flavored eltypes to Julia-flavored eltypes on copy (#98)
* convert Arrow-flavored eltypes to Julia-flavored eltypes on copy
* Update src/arraytypes/primitive.jl
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Eric Hanson [Tue, 5 Jan 2021 05:43:27 +0000 (06:43 +0100)]
Add warning for `Arrow.ArrowTypes.registertype!` (#96)
Jarrett Revels [Wed, 30 Dec 2020 06:13:44 +0000 (01:13 -0500)]
add default UUID <-> UInt128 Arrow type mapping (#89)
* add default UUID <-> UInt128 Arrow type mapping
* add UUIDs to non-standard types table test state
* add test for deprecation path for old UUID autoconversion
Eric Hanson [Thu, 24 Dec 2020 06:03:29 +0000 (07:03 +0100)]
add `ArrowTypes.default` methods and tests for dates (#86)
Jacob Quinn [Wed, 16 Dec 2020 21:02:37 +0000 (14:02 -0700)]
Support new Decimal256 type (#79)
* Support new Decimal256 type
* Fix tests, update manual
Jacob Quinn [Sat, 12 Dec 2020 06:31:43 +0000 (23:31 -0700)]
Update README & ci
Jacob Quinn [Wed, 2 Dec 2020 06:22:20 +0000 (23:22 -0700)]
Bump version
Jacob Quinn [Wed, 2 Dec 2020 06:21:51 +0000 (23:21 -0700)]
Fix Union type deserialization (#77)
* Fix Union type deserialization
Fixes #76. I think this was just a relic of old code from early work on
the package, but `juliaeltype` is meant to return "julia" types, not
arrow types.
* remove unintended change