arrow-julia.git
7 months agoSend issue comments notification to github@arrow.apache.org (#274)
Sutou Kouhei [Sat, 1 Jan 2022 05:59:41 +0000 (14:59 +0900)] 
Send issue comments notification to github@arrow.apache.org (#274)

#271

7 months agoConfigure repository metadata (#272)
Sutou Kouhei [Wed, 29 Dec 2021 20:29:41 +0000 (05:29 +0900)] 
Configure repository metadata (#272)

fix #271

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
9 months agoRemove use of symlinks in CI matrix (#256)
Curtis Vogt [Tue, 2 Nov 2021 06:12:02 +0000 (01:12 -0500)] 
Remove use of symlinks in CI matrix (#256)

9 months agoSupport `AbstractPath` where file paths are used (#255) v2.2.0
Curtis Vogt [Fri, 29 Oct 2021 15:43:31 +0000 (10:43 -0500)] 
Support `AbstractPath` where file paths are used (#255)

* Support AbstractPath where file paths are used
* Set package to version 2.2.0

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
9 months agoReplace `Base.n_waiters` with `isempty` (#254)
Denis Barucic [Sat, 23 Oct 2021 00:06:12 +0000 (02:06 +0200)] 
Replace `Base.n_waiters` with `isempty` (#254)

9 months agoAdd Avro reference, add links (#252)
KronosTheLate [Tue, 12 Oct 2021 15:20:27 +0000 (17:20 +0200)] 
Add Avro reference, add links (#252)

* Add Avro reference, add links

Based on [https://github.com/JuliaData/Arrow.jl/issues/251](this issue) I made.
I also added links to the relevant packages.

* Update README.md

Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
10 months agoadd DOCUMENTER_KEY secret to TagBot (#250)
Kristoffer Carlsson [Mon, 4 Oct 2021 15:54:58 +0000 (17:54 +0200)] 
add DOCUMENTER_KEY secret to TagBot (#250)

10 months agoUpdate arrowjson.jl usage of Tables.Columns
Jacob Quinn [Sun, 3 Oct 2021 05:17:43 +0000 (23:17 -0600)] 
Update arrowjson.jl usage of Tables.Columns

10 months agobump version v2.1.0
Jacob Quinn [Tue, 28 Sep 2021 23:32:39 +0000 (17:32 -0600)] 
bump version

10 months agoAdd ability to pass directory of inputs to Arrow.Table/Arrow.Stream (#246)
Jacob Quinn [Tue, 28 Sep 2021 23:32:21 +0000 (17:32 -0600)] 
Add ability to pass directory of inputs to Arrow.Table/Arrow.Stream (#246)

* Add ability to pass directory of inputs to Arrow.Table/Arrow.Stream

Implements #235. The proposed implementation here isn't too complicated;
it introduces a new `ArrowBlob` type that we'll convert any
`IO`/`String`/`Vector{UInt8}` input to, and the main
`Arrow.Table`/`Arrow.Stream` methods take a `Vector{ArrowBlob}` to
operate on. The schema of each input must match in order to be treated
as a single "table". "Must match" in this PR means `==(sch1, sch2)`,
which does include `custom_metadata` and dictionaries, which might not
be desireable, but I haven't quite had the chance to think through all
the way yet. Otherwise, I think the rest here is pretty straightfoward
and non-breaking.

* update docs

10 months agoRemove unused line (#245)
Nick Robinson [Mon, 27 Sep 2021 15:14:33 +0000 (16:14 +0100)] 
Remove unused line (#245)

10 months agorevert version bump v2.0.0
Jacob Quinn [Wed, 22 Sep 2021 18:42:32 +0000 (12:42 -0600)] 
revert version bump

10 months agodelete deprecations in preparation for v2.0 release (#241)
Jarrett Revels [Wed, 22 Sep 2021 17:58:27 +0000 (13:58 -0400)] 
delete deprecations in preparation for v2.0 release (#241)

10 months agobump version
Jacob Quinn [Wed, 22 Sep 2021 03:30:25 +0000 (21:30 -0600)] 
bump version

10 months agobugfix reading arrays (#234)
Jon Alm Eriksen [Thu, 16 Sep 2021 05:23:12 +0000 (07:23 +0200)] 
bugfix reading arrays (#234)

* bugfix reading arrays

* add tests

Co-authored-by: Jon Alm Eriksen <jon.alm.eriksen@novelda.com>
10 months agoremove global metadata cache, refactor custom_metadata API (#238)
Jarrett Revels [Tue, 14 Sep 2021 19:27:57 +0000 (15:27 -0400)] 
remove global metadata cache, refactor custom_metadata API (#238)

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
10 months agoUpdate doc deploy actions
Jacob Quinn [Wed, 8 Sep 2021 05:43:05 +0000 (23:43 -0600)] 
Update doc deploy actions

12 months agofix ambiguity error (#219)
Eric Hanson [Tue, 3 Aug 2021 11:59:50 +0000 (13:59 +0200)] 
fix ambiguity error (#219)

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
12 months agocustom struct getindex should deserialize based on the shape of the actual ArrowType... v1.6.2
Jarrett Revels [Sat, 24 Jul 2021 00:38:07 +0000 (20:38 -0400)] 
custom struct getindex should deserialize based on the shape of the actual ArrowType, not the target JuliaType (#229)

12 months agoSupport Julia 1.0 for ArrowTypes package (#223)
Curtis Vogt [Fri, 23 Jul 2021 04:16:08 +0000 (23:16 -0500)] 
Support Julia 1.0 for ArrowTypes package (#223)

12 months agouse `_id` in warn logging (#225) v1.6.1
Eric Hanson [Mon, 19 Jul 2021 13:18:11 +0000 (15:18 +0200)] 
use `_id` in warn logging (#225)

13 months agoAdd `maxlog=1` to not spam logs (#224)
Eric Hanson [Thu, 8 Jul 2021 21:49:40 +0000 (23:49 +0200)] 
Add `maxlog=1` to not spam logs (#224)

13 months agoSet project version to 1.6.0 (#222) v1.6.0
Curtis Vogt [Wed, 7 Jul 2021 15:20:16 +0000 (10:20 -0500)] 
Set project version to 1.6.0 (#222)

13 months agoUse standalone ArrowTypes package (#212)
Curtis Vogt [Wed, 7 Jul 2021 13:44:17 +0000 (08:44 -0500)] 
Use standalone ArrowTypes package (#212)

Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
13 months agofix writing an empty table (#221)
Kristoffer Carlsson [Tue, 6 Jul 2021 20:54:56 +0000 (22:54 +0200)] 
fix writing an empty table (#221)

13 months agoadd metadata to `show` method (#217)
Eric Hanson [Tue, 22 Jun 2021 13:41:10 +0000 (15:41 +0200)] 
add metadata to `show` method (#217)

13 months agoRemove old arrowtypes.jl file (#215)
Curtis Vogt [Thu, 17 Jun 2021 16:29:27 +0000 (11:29 -0500)] 
Remove old arrowtypes.jl file (#215)

13 months agoSet ArrowTypes to version 1.1.0 (#213)
Curtis Vogt [Thu, 17 Jun 2021 05:03:06 +0000 (00:03 -0500)] 
Set ArrowTypes to version 1.1.0 (#213)

14 months agoRename LICENSE.md to LICENSE (#195)
Simeon Schaub [Mon, 7 Jun 2021 20:18:53 +0000 (22:18 +0200)] 
Rename LICENSE.md to LICENSE (#195)

14 months agoTest expected log records (#204)
Curtis Vogt [Mon, 7 Jun 2021 19:54:57 +0000 (14:54 -0500)] 
Test expected log records (#204)

* Test/suppress error log issue #126 test

* Test misc warnings

14 months agobump Project.toml from v1.4.1 to v1.5.0 (#208) v1.5.0
Jarrett Revels [Mon, 31 May 2021 18:51:50 +0000 (14:51 -0400)] 
bump Project.toml from v1.4.1 to v1.5.0 (#208)

14 months agoSupport `VersionNumber` (#205)
Curtis Vogt [Mon, 31 May 2021 17:17:25 +0000 (12:17 -0500)] 
Support `VersionNumber` (#205)

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
14 months agoHandle empty tuple (#201)
Curtis Vogt [Mon, 31 May 2021 16:33:27 +0000 (11:33 -0500)] 
Handle empty tuple (#201)

14 months agosupport Date with type parameter ms (#207)
Pietro Vertechi [Mon, 31 May 2021 16:31:14 +0000 (18:31 +0200)] 
support Date with type parameter ms (#207)

15 months agoAdd global metadata lock to ensure thread safety of global metadata (#183)
Jacob Quinn [Sat, 24 Apr 2021 03:26:48 +0000 (21:26 -0600)] 
Add global metadata lock to ensure thread safety of global metadata (#183)

* Add global metadata lock to ensure thread safety of global metadata
store

Follow up to #90, based on discussions in that issue.

* fix

15 months agoEnsure requested List type is requested on List getindex (#182)
Jacob Quinn [Fri, 23 Apr 2021 02:19:47 +0000 (20:19 -0600)] 
Ensure requested List type is requested on List getindex (#182)

* Ensure requested List type is requested on List getindex

Fixes #167. Not tested yet.

* add test

15 months agoability to append partitions to existing arrow files (#160)
Tanmay Mohapatra [Fri, 23 Apr 2021 00:34:00 +0000 (06:04 +0530)] 
ability to append partitions to existing arrow files (#160)

* ability to append partitions to an arrow file

This adds a method to `append` partitions to existing arrow files. Partitiions to append to are supplied in the form of any [Tables.jl](https://github.com/JuliaData/Tables.jl)-compatible table.

Multiple record batches will be written based on the number of `Tables.partitions(tbl)` that are provided.

Each partition being appended must have the same `Tables.Schema` as the destination arrow file that is being appended to.

Other parameters that `append` accepts are similar to what `write` accepts.

* remove unused methods

* add more tests and some fixes

* allow appends to both seekable IO and files

* few changes to Stream,avoid duplication for append

store few additional stream properties in the `Stream` data type and avoid duplicating code for append functionality

* call Tables.schema on result of Tables.columns

15 months agofix propagation of maxdepth kwarg (#181) v1.4.1
Jarrett Revels [Fri, 23 Apr 2021 00:28:14 +0000 (20:28 -0400)] 
fix propagation of maxdepth kwarg (#181)

* fix propagation of maxdepth kwarg

* bump Project.toml

15 months agoBump version v1.4.0
Jacob Quinn [Fri, 16 Apr 2021 18:37:50 +0000 (12:37 -0600)] 
Bump version

15 months agoFix case when ipc stream has no record batches, only schema (#175)
Jacob Quinn [Thu, 15 Apr 2021 15:17:53 +0000 (09:17 -0600)] 
Fix case when ipc stream has no record batches, only schema (#175)

* Fix case when ipc stream has no record batches, only schema

Fixes #158. While the Julia implementation currently doesn't provide
way to avoid writing any record batches, the pyarrow implementation has
more fine-grained control over writing and allows closing an ipc stream
without writing any record batches. In that case, on the Julia side when
reading, we just need to check for this case specifically and if so,
populate some empty columns, since we're currently relying on them being
populated when record batches are read.

* fix metadata

15 months agoFix slight perf hit when checking validity bitmap (#176)
Jacob Quinn [Thu, 15 Apr 2021 15:17:37 +0000 (09:17 -0600)] 
Fix slight perf hit when checking validity bitmap (#176)

Fixes #131. Dip me in mustard and call me a hotdog, cuz I can't tell
how/why `divrem(i - 1, 8) .+ (1, 1)` ends up being ~30% faster than
`fldmod1(i, 8)`. It'd probably be worth looking into it more, but it
works for now. The divrem code is @expandingman 's code from his
arrow/feather code.

15 months agofix () -> {} typo (#174)
Étienne Tétreault-Pinard [Wed, 14 Apr 2021 15:57:55 +0000 (11:57 -0400)] 
fix () -> {} typo (#174)

15 months agoIntroduce Arrow.ToTimestamp for performant ZonedDateTime encoding (#173)
Jacob Quinn [Wed, 14 Apr 2021 14:35:56 +0000 (08:35 -0600)] 
Introduce Arrow.ToTimestamp for performant ZonedDateTime encoding (#173)

Fixes #95 by allowing users to wrap `ZonedDateTime` columns in
`Arrow.ToTimestamp`, which will allow the writing process to skip costly
check/conversion by assuming each element has the same timezone;
`ToTimestamp` uses the timezone of the first element.

15 months agoWarn when converting Arrow.Timestamps to Dates.DateTime or ZonedDateTime (#172)
Jacob Quinn [Tue, 13 Apr 2021 14:56:27 +0000 (08:56 -0600)] 
Warn when converting Arrow.Timestamps to Dates.DateTime or ZonedDateTime (#172)

Fixes #166. The problem OP saw in the original issue was that we didn't
have a proper `ArrowTypes.fromarrow` method defined for `Dates.DateTime`
from `Arrow.Timestamp` with nanosecond precision, which is accurate in
one sense because `Dates.DateTime` only supports up to millisecond
precision. But better than just erroring when trying to access these
values later, we now do the conversion anyway, which may be lossy, and
issue a warning about the potentially lossy conversion. If > millisecond
precision is needed, then users should pass `convert=false` and operate
on the `Arrow.Timestamp` values directly for now.

15 months agouse actual deprecation (#171)
Eric Hanson [Tue, 13 Apr 2021 04:21:29 +0000 (05:21 +0100)] 
use actual deprecation (#171)

16 months agoadd missing setmedata! method for Arrow.Table (#170)
Jarrett Revels [Wed, 7 Apr 2021 05:55:57 +0000 (01:55 -0400)] 
add missing setmedata! method for Arrow.Table (#170)

16 months agodocument guarantee that `getmetadata` returns alias not copy (#169)
Jarrett Revels [Wed, 7 Apr 2021 05:54:16 +0000 (01:54 -0400)] 
document guarantee that `getmetadata` returns alias not copy (#169)

16 months agoDon't store table metadata globally (#165)
Jacob Quinn [Sun, 4 Apr 2021 03:24:10 +0000 (21:24 -0600)] 
Don't store table metadata globally (#165)

* Don't store table metadata globally

Fixes #90. There's no need to store table metadata globally when we can
just store it in the Table type itself and overload the `getmetadata`.
This should avoid metadata bloat in the global store.

16 months agoRestructure ArrowTypes so it can be registered as its own package (#163)
Jacob Quinn [Fri, 2 Apr 2021 16:47:05 +0000 (10:47 -0600)] 
Restructure ArrowTypes so it can be registered as its own package (#163)

* Restructure ArrowTypes so it can be registered as its own package

This will facilitate dependants wishing to overload ArrowTypes interface
functions for their custom types. This is basically a "free" dependency
they can take on to avoid all the extra dependencies that Arrow.jl
itself has.

* Run ArrowTypes tests when we do CI for Arrow.jl

* fix

* fix

16 months agoDataAPI methods (#164)
Jacob Quinn [Fri, 2 Apr 2021 16:02:16 +0000 (10:02 -0600)] 
DataAPI methods (#164)

* Add refpool, refarray and levels for DictEncoded

* Apply tests to the correct variable name.

Co-authored-by: Douglas Bates <dmbates@gmail.com>
16 months agoAdd refpool, refarray and levels for DictEncoded (#161)
Douglas Bates [Thu, 1 Apr 2021 22:13:50 +0000 (17:13 -0500)] 
Add refpool, refarray and levels for DictEncoded (#161)

16 months agoTweak promoteunion to always avoid abstract types (#162)
Jacob Quinn [Thu, 1 Apr 2021 21:55:58 +0000 (15:55 -0600)] 
Tweak promoteunion to always avoid abstract types (#162)

This makes the 2nd `concretecheck` unnecessary in the `ToArrow`
constructor, but we'll leave it for now as a type of assert. This came
from a discussion with @iamed on slack, where it was pointed out that
values always have a concrete type at runtime (yay Julia!), so we should
_always_ be able to get a `Union{...}` of concrete types. This could
potentially get crazy if someone has, like, thousands of unique concrete
types in a single array, but I guess we'll cross that bridge when we
come to it.

16 months agoAllow avoiding the feather file format when writing arrow data to file name; that...
Jacob Quinn [Thu, 1 Apr 2021 16:18:27 +0000 (10:18 -0600)] 
Allow avoiding the feather file format when writing arrow data to file name; that means allowing passing  when providing a filename::String to Arrow.write

16 months agoBump version v1.3.0
Jacob Quinn [Mon, 29 Mar 2021 13:23:06 +0000 (07:23 -0600)] 
Bump version

16 months agoOverhaul type serialization/deserialization machinery (#156)
Jacob Quinn [Mon, 29 Mar 2021 13:22:44 +0000 (07:22 -0600)] 
Overhaul type serialization/deserialization machinery (#156)

* Start work on overhauling type serialization architecture

* More work; serialization is pretty much done but not tested

* fix timetype ArrowTypes definitions

* more work to get tests passing

* get tests passing?

* fix

* Fix #75 by supporting Set serialization/deserialization

* Fix #85 by supporting tuple serialization/deserialization

* Lots of cleanup

* few more fixes

* Update src/arrowtypes.jl

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
* Update src/arrowtypes.jl

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
* fix NullKind reading

* Fix #134 by requiring concrete or union of concrete element types for
all columns when serializing

* Add new ArrowTypes.arrowmetadata method for providing additional extension type metadata htat can be used in JuliaType

* Update manual

* tests

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
16 months agoBetter handle errors when something goes wrong writing partitions (#154)
Jacob Quinn [Thu, 18 Mar 2021 06:22:40 +0000 (00:22 -0600)] 
Better handle errors when something goes wrong writing partitions (#154)

* Better handle errors when something goes wrong writing partitions

Follow up to #108. There were actually a few different issues all coming
together in @kjanosz's comment. The first being that `ZipFile.Reader`
doesn't like non-main threads touching its files _at all_. The Arrow.jl
problem there is when processing non-first partitions, we were just
`Threads.@spawn`-ing a task which then went off and sometimes became the
proverbial unheard tree falling in the forest. Like really, no one heard
these tasks and their poor exceptions.

The solution there is to be better shepherds of our spawned tasks; we
introduce a `anyerror` atomic bool that threads can set if they run into
issues and then in the main partition handling loop we'll check that. If
a thread ran into something, we'll log out the thread-specific
exception, then throw a generic writing error. This prevents the
"hanging" behavior people were seeing because the felled threads were
actually causing later tasks to hang on `put!`-ing into the
OrderedChannel.

After addressing this, we have the multithreaded ZipFile issue. With the
new `ntasks` keyword, it seems to make sense to me that if you pass
`ntasks=1`, you're really saying, I don't want any concurrency. So
anywhere we were `Threads.@spawn`ing, we now check if `ntasks == 1` and
if so, do an `@async` instead.

* Fix

16 months agoAdd ntasks keyword to limit # of tasks allowed to write at a time (#106)
Jacob Quinn [Thu, 18 Mar 2021 04:35:12 +0000 (22:35 -0600)] 
Add ntasks keyword to limit # of tasks allowed to write at a time (#106)

* Add ntasks keyword to limit # of tasks allowed to write at a time

* fix

16 months agoadd unexported tobuffer utility for interactive testing/development (#153)
Jarrett Revels [Wed, 17 Mar 2021 23:11:21 +0000 (19:11 -0400)] 
add unexported tobuffer utility for interactive testing/development (#153)

16 months agorevert setting Arrrow.write debug message threshold to -1 (#152)
Jarrett Revels [Sun, 14 Mar 2021 03:32:19 +0000 (22:32 -0500)] 
revert setting Arrrow.write debug message threshold to -1 (#152)

16 months agoEnsure serializing Arrow.DictEncoded writes dictionary messages (#149)
Jacob Quinn [Fri, 12 Mar 2021 16:07:04 +0000 (09:07 -0700)] 
Ensure serializing Arrow.DictEncoded writes dictionary messages (#149)

Fixes #126. The issue here was when `Arrow.write` was faced with the
task of serializing an `Arrow.DictEncoded`. For most arrow array types,
if the input array is already an arrow array type, it's a no-op (e.g. if
you're writing out an `Arrow.Table`). The problem comes from
`Arrow.DictEncoded`, where there is still no conversion required, but we
do need to make a note of the dict encoded column to ensure a dictionary
message is written before the record batch. In addition, we also add
some code for handling delta dictionary messages if required from
multiple record batches that contain `Arrow.DictEncoded`s, which is a
valid use-case where you may have multiple arrow files, with the same
schema, that you wish to serialize as a single arrow file w/ each file
as a separate record batch.

Slightly unrelated, but there's also a fix here in our use of Lockable.
We actually had a race condition I ran into once where the locking was
on the Lockable object, but inside the locked region, we replaced the
entire Lockable instead of the _contents_ of the Lockable. This meant
anyone who started waiting on the Lockable's lock didn't see updates
when unlocked because the entire Lockable had been updated.

16 months agoEnsure dict encoded index types match from record batch to record batch (#148)
Jacob Quinn [Fri, 12 Mar 2021 05:16:34 +0000 (22:16 -0700)] 
Ensure dict encoded index types match from record batch to record batch (#148)

Fixes #144. The core issue here was the initial record batch had a
dict-encoded column that ended up having an index type of Int8. However,
in a subsequent record batch, we use a different code path for dict
encoded columns because we need to check if a dictionary delta message
needs to be sent (i.e. there are new pooled values that need to be
serialized). The problem was in this code path, the index type was
computed from the total length of the input column instead of matching
what was already serialized in the initial schema message.

This does open up the question of another possible failure: if an
initial dict encoded column is serialized with an index type of Int8,
yet subsequent record batches end up including enough unique values that
this index type will be overflowed. I've added in an error check for
this case. Currently it's a fatal error that will stop the `Arrow.write`
process completely. I'm not quite sure what the best recommendation
would be in that case; ultimately the user needs to either widen the
first record batch column index type, but perhaps we should allow
passing a dict-encoded index type to the overall `Arrow.write` function
so users can easily specify what that type should be.

The other change that had to be made in this PR is on the reading side,
since we're now tracking the index type in the DictEncoding type itself,
which probably not coincidentally is what the arrow-json struct already
does. For reading, we already have access to the dictionary field, so
it's just a matter of deserializing the index type before constructing
the DictEncoding struct.

16 months agoIntroduce new `maxdepth` keyword argument for setting a limit on nesting (#147)
Jacob Quinn [Wed, 10 Mar 2021 23:40:26 +0000 (16:40 -0700)] 
Introduce new `maxdepth` keyword argument for setting a limit on nesting (#147)

level limit

Alternative fix for #143. This is a more general fix than just
specializing CategoricalArrays. This should prevent more general cases
of the same issue: i.e. someone accidently passes a recursive data
structure and `Arrow.write` gets stuck trying to recursively serialize.

17 months agoimplement Base.IteratorSize for Stream, fixes #141 (#142)
Damien Drix [Fri, 5 Mar 2021 06:06:42 +0000 (06:06 +0000)] 
implement Base.IteratorSize for Stream, fixes #141 (#142)

18 months agofix accidental invocation of _unsafe_load_tuple (#124) v1.2.4
Jarrett Revels [Fri, 5 Feb 2021 04:52:00 +0000 (23:52 -0500)] 
fix accidental invocation of _unsafe_load_tuple (#124)

* fix accidental invocation of _unsafe_load_tuple

* bump Project.toml

18 months agoBump version v1.2.3
Jacob Quinn [Thu, 4 Feb 2021 06:51:37 +0000 (23:51 -0700)] 
Bump version

18 months agoUse pool length in signed int conversion (#122)
Douglas Bates [Thu, 4 Feb 2021 06:51:14 +0000 (00:51 -0600)] 
Use pool length in signed int conversion (#122)

18 months agoBump version v1.2.2
Jacob Quinn [Sun, 31 Jan 2021 06:20:11 +0000 (23:20 -0700)] 
Bump version

18 months agoRework dict encoding of PooledArray/CategoricalArray to fix outstandi… (#119)
Jacob Quinn [Sun, 31 Jan 2021 06:18:44 +0000 (23:18 -0700)] 
Rework dict encoding of PooledArray/CategoricalArray to fix outstandi… (#119)

* Rework dict encoding of PooledArray/CategoricalArray to fix outstanding issues

Fixes #117, #116, and #113. For #116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For #113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For #117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.

* PooledArrays 1.0 compat

* Update src/arraytypes/dictencoding.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
* Check refpool

* Fix test

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
18 months agoMake compressed writing threadsafe (#118)
Jacob Quinn [Sat, 30 Jan 2021 07:04:34 +0000 (00:04 -0700)] 
Make compressed writing threadsafe (#118)

Fixes #82. The problem when trying to write arrow using multiple threads and compression was there was only a single compressor object that each thread was simultaneously trying to use. This PR ensures there is a compressor object per thread that will be used per thread.

18 months agoBump version v1.2.1
Jacob Quinn [Mon, 25 Jan 2021 18:39:49 +0000 (11:39 -0700)] 
Bump version

18 months agoFix copy on DictEncode (#111)
Jacob Quinn [Mon, 25 Jan 2021 18:39:08 +0000 (11:39 -0700)] 
Fix copy on DictEncode (#111)

Fixes #102. The issue comes up because DataFrames constructor tries to
make a copy of input columns by default when constructing; for
DictEncode, it's just a wrapper to signal that a column should be
copied, so we just make a shallow copy.

18 months agoDon't use ChainedVector as DictEncoding data array unless necessary (#110)
Jacob Quinn [Sat, 23 Jan 2021 04:27:09 +0000 (21:27 -0700)] 
Don't use ChainedVector as DictEncoding data array unless necessary (#110)

Fixes #109. The issue here was when reading arrow record batches with
dict encoded columns, we eagerly used `ChainedVector` for the underlying
array backing the `DictEncoding` in case there were subsequent
record batches that added additional elements to the dict encoding. This
is too eager though, since it's probably common, like for "feather"
files, where the dict encoding values are always known and provided in
the first record batch. In fact, several language implementations don't
even support these kind of "delta" dict updates in subsequent record
batches. This PR, therefore, uses a regular array for the dict encoding
backing for the first record batch, and only promotes to a ChainedVector
if we happen to get a delta update.

18 months agobump Project.toml to v1.2.0 (#107) v1.2.0
Jarrett Revels [Tue, 19 Jan 2021 18:12:53 +0000 (13:12 -0500)] 
bump Project.toml to v1.2.0 (#107)

18 months agoadd isbitstype optimized path for FixedSizeList getindex (#104)
Jarrett Revels [Tue, 12 Jan 2021 06:29:34 +0000 (01:29 -0500)] 
add isbitstype optimized path for FixedSizeList getindex (#104)

* add isbitstype optimized path for FixedSizeList getindex

* reuse _unsafe_cast strategy from #103

* rebase + DRY _unsafe_cast

18 months agochange UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary (...
Jarrett Revels [Mon, 11 Jan 2021 22:51:20 +0000 (17:51 -0500)] 
change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary (#103)

* change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary

* fix tests

* optimize UInt128 <-> NTuple{16,UInt8} casting

Co-authored-by: SimonDanisch <sdanisch@protonmail.com>
Co-authored-by: SimonDanisch <sdanisch@protonmail.com>
19 months agoAdd missing license
Jacob Quinn [Thu, 7 Jan 2021 05:28:13 +0000 (22:28 -0700)] 
Add missing license

19 months agoAdmin cleanup
Jacob Quinn [Thu, 7 Jan 2021 05:10:39 +0000 (22:10 -0700)] 
Admin cleanup

19 months agoAdd BitIntegers compat (#100) v1.1.0
Jacob Quinn [Wed, 6 Jan 2021 04:40:57 +0000 (21:40 -0700)] 
Add BitIntegers compat (#100)

19 months agobump Project.toml to v1.1.0 (#94)
Jarrett Revels [Tue, 5 Jan 2021 23:59:32 +0000 (18:59 -0500)] 
bump Project.toml to v1.1.0 (#94)

19 months agoFix copy on DictEncoding arrays with missing values (#99)
Jacob Quinn [Tue, 5 Jan 2021 23:59:19 +0000 (16:59 -0700)] 
Fix copy on DictEncoding arrays with missing values (#99)

The copy code for DictEncoding erroneously tried to special-case
`missing` when copying, which is unnecessary since `missing` is just
treated like any other regular ref value. By removing the special-cased
branch of code, we avoid treating it differently and the code copies the
`DictEncoding` array as a `PooledArray` as expected.

19 months agoUpdate make.jl (#97)
Eric Hanson [Tue, 5 Jan 2021 23:51:06 +0000 (00:51 +0100)] 
Update make.jl (#97)

19 months agoconvert Arrow-flavored eltypes to Julia-flavored eltypes on copy (#98)
Jarrett Revels [Tue, 5 Jan 2021 23:30:27 +0000 (18:30 -0500)] 
convert Arrow-flavored eltypes to Julia-flavored eltypes on copy (#98)

* convert Arrow-flavored eltypes to Julia-flavored eltypes on copy

* Update src/arraytypes/primitive.jl

Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
19 months agoAdd warning for `Arrow.ArrowTypes.registertype!` (#96)
Eric Hanson [Tue, 5 Jan 2021 05:43:27 +0000 (06:43 +0100)] 
Add warning for `Arrow.ArrowTypes.registertype!` (#96)

19 months agoadd default UUID <-> UInt128 Arrow type mapping (#89)
Jarrett Revels [Wed, 30 Dec 2020 06:13:44 +0000 (01:13 -0500)] 
add default UUID <-> UInt128 Arrow type mapping (#89)

* add default UUID <-> UInt128 Arrow type mapping

* add UUIDs to non-standard types table test state

* add test for  deprecation path for old UUID autoconversion

19 months agoadd `ArrowTypes.default` methods and tests for dates (#86)
Eric Hanson [Thu, 24 Dec 2020 06:03:29 +0000 (07:03 +0100)] 
add `ArrowTypes.default` methods and tests for dates (#86)

19 months agoSupport new Decimal256 type (#79)
Jacob Quinn [Wed, 16 Dec 2020 21:02:37 +0000 (14:02 -0700)] 
Support new Decimal256 type (#79)

* Support new Decimal256 type

* Fix tests, update manual

19 months agoUpdate README & ci
Jacob Quinn [Sat, 12 Dec 2020 06:31:43 +0000 (23:31 -0700)] 
Update README & ci

20 months agoBump version v1.0.3
Jacob Quinn [Wed, 2 Dec 2020 06:22:20 +0000 (23:22 -0700)] 
Bump version

20 months agoFix Union type deserialization (#77)
Jacob Quinn [Wed, 2 Dec 2020 06:21:51 +0000 (23:21 -0700)] 
Fix Union type deserialization (#77)

* Fix Union type deserialization

Fixes #76. I think this was just a relic of old code from early work on
the package, but `juliaeltype` is meant to return "julia" types, not
arrow types.

* remove unintended change

20 months agoBump version v1.0.2
Jacob Quinn [Sat, 28 Nov 2020 20:19:28 +0000 (13:19 -0700)] 
Bump version

20 months agoFinish support for automatic custom struct deserialization (#73)
Jacob Quinn [Sat, 28 Nov 2020 20:18:27 +0000 (13:18 -0700)] 
Finish support for automatic custom struct deserialization (#73)

* Finish support for automatic custom struct deserialization

As pointed out on a slack post, we were supporting automatic custom
struct _serialization_, but not deserialization; the custom structs were
just deserialized as `NamedTuple`s. In this PR, I propose using the
custom extension type machinery to ensure custom structs can be
deserialized. Currently this will all happend automatically for the
user, but I'd like to update the documentation around how users should
approach using arrow for custom types, because they _should_ get in the
habit of calling `ArrowTypes.registertype!` to ensure their serialized
custom struct can always be deserialized. For example, if a user
serializes and deserialized a custom struct in the same Julia session,
it will currently "just work", but if the custom struct column is
serialized in a session, then deserialized in a new session, where the
type hasn't been defined or registered, the column will be deserialized
as `NamedTuple`.

* remove unnecessary definition

* Switch CI to github actions and update docs for custom structs

* Update CI

20 months agoReword error message when input file doesn't exist; fixes #71
Jacob Quinn [Sun, 22 Nov 2020 05:50:38 +0000 (22:50 -0700)] 
Reword error message when input file doesn't exist; fixes #71

20 months agoAdd custom extension type nullability test
Jacob Quinn [Fri, 20 Nov 2020 04:38:15 +0000 (21:38 -0700)] 
Add custom extension type nullability test

20 months agoBump version v1.0.1
Jacob Quinn [Thu, 19 Nov 2020 17:37:34 +0000 (10:37 -0700)] 
Bump version

20 months agoCheck field nullability for custom extension types (#69)
Jacob Quinn [Thu, 19 Nov 2020 17:36:57 +0000 (10:36 -0700)] 
Check field nullability for custom extension types (#69)

For custom extension types (currently automatically supported for `Char`
and `Symbol` types), we were failing to take into account whether the
field was nullable or not; this led to the case where a column might be
`['a', missing]`, but when deserializing, the column type was just
`Char` instead of `Union{Char, Missing}`. The fix is to enhance the
`ArrowTypes.extensiontype` function to also take the `field` argument
and check the nullability before returning.

20 months agoUpdate Project.toml v1.0.0
Jacob Quinn [Thu, 19 Nov 2020 06:51:53 +0000 (23:51 -0700)] 
Update Project.toml

20 months agoUpdate README.md
Jacob Quinn [Thu, 19 Nov 2020 06:51:19 +0000 (23:51 -0700)] 
Update README.md

20 months agoAuto-convert DateTime to arrow Timestamp instead of millisecond Date; left over bug...
Jacob Quinn [Thu, 19 Nov 2020 06:50:09 +0000 (23:50 -0700)] 
Auto-convert DateTime to arrow Timestamp instead of millisecond Date; left over bug from when we switched to fully supporting Timestamps (#66)

20 months agoAdd validity check for columns with different lengths; fixes #60 (#65)
Jacob Quinn [Thu, 19 Nov 2020 06:22:47 +0000 (23:22 -0700)] 
Add validity check for columns with different lengths; fixes #60 (#65)

20 months agofixed docs link (#64)
ExpandingMan [Wed, 18 Nov 2020 06:25:25 +0000 (01:25 -0500)] 
fixed docs link (#64)