Simeon Schaub [Mon, 7 Jun 2021 20:18:53 +0000 (22:18 +0200)]
Rename LICENSE.md to LICENSE (#195)
Curtis Vogt [Mon, 7 Jun 2021 19:54:57 +0000 (14:54 -0500)]
Test expected log records (#204)
* Test/suppress error log issue #126 test
* Test misc warnings
Jarrett Revels [Mon, 31 May 2021 18:51:50 +0000 (14:51 -0400)]
bump Project.toml from v1.4.1 to v1.5.0 (#208)
Curtis Vogt [Mon, 31 May 2021 17:17:25 +0000 (12:17 -0500)]
Support `VersionNumber` (#205)
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Curtis Vogt [Mon, 31 May 2021 16:33:27 +0000 (11:33 -0500)]
Handle empty tuple (#201)
Pietro Vertechi [Mon, 31 May 2021 16:31:14 +0000 (18:31 +0200)]
support Date with type parameter ms (#207)
Jacob Quinn [Sat, 24 Apr 2021 03:26:48 +0000 (21:26 -0600)]
Add global metadata lock to ensure thread safety of global metadata (#183)
* Add global metadata lock to ensure thread safety of global metadata
store
Follow up to #90, based on discussions in that issue.
* fix
Jacob Quinn [Fri, 23 Apr 2021 02:19:47 +0000 (20:19 -0600)]
Ensure requested List type is requested on List getindex (#182)
* Ensure requested List type is requested on List getindex
Fixes #167. Not tested yet.
* add test
Tanmay Mohapatra [Fri, 23 Apr 2021 00:34:00 +0000 (06:04 +0530)]
ability to append partitions to existing arrow files (#160)
* ability to append partitions to an arrow file
This adds a method to `append` partitions to existing arrow files. Partitiions to append to are supplied in the form of any [Tables.jl](https://github.com/JuliaData/Tables.jl)-compatible table.
Multiple record batches will be written based on the number of `Tables.partitions(tbl)` that are provided.
Each partition being appended must have the same `Tables.Schema` as the destination arrow file that is being appended to.
Other parameters that `append` accepts are similar to what `write` accepts.
* remove unused methods
* add more tests and some fixes
* allow appends to both seekable IO and files
* few changes to Stream,avoid duplication for append
store few additional stream properties in the `Stream` data type and avoid duplicating code for append functionality
* call Tables.schema on result of Tables.columns
Jarrett Revels [Fri, 23 Apr 2021 00:28:14 +0000 (20:28 -0400)]
fix propagation of maxdepth kwarg (#181)
* fix propagation of maxdepth kwarg
* bump Project.toml
Jacob Quinn [Fri, 16 Apr 2021 18:37:50 +0000 (12:37 -0600)]
Bump version
Jacob Quinn [Thu, 15 Apr 2021 15:17:53 +0000 (09:17 -0600)]
Fix case when ipc stream has no record batches, only schema (#175)
* Fix case when ipc stream has no record batches, only schema
Fixes #158. While the Julia implementation currently doesn't provide
way to avoid writing any record batches, the pyarrow implementation has
more fine-grained control over writing and allows closing an ipc stream
without writing any record batches. In that case, on the Julia side when
reading, we just need to check for this case specifically and if so,
populate some empty columns, since we're currently relying on them being
populated when record batches are read.
* fix metadata
Jacob Quinn [Thu, 15 Apr 2021 15:17:37 +0000 (09:17 -0600)]
Fix slight perf hit when checking validity bitmap (#176)
Fixes #131. Dip me in mustard and call me a hotdog, cuz I can't tell
how/why `divrem(i - 1, 8) .+ (1, 1)` ends up being ~30% faster than
`fldmod1(i, 8)`. It'd probably be worth looking into it more, but it
works for now. The divrem code is @expandingman 's code from his
arrow/feather code.
Étienne Tétreault-Pinard [Wed, 14 Apr 2021 15:57:55 +0000 (11:57 -0400)]
fix () -> {} typo (#174)
Jacob Quinn [Wed, 14 Apr 2021 14:35:56 +0000 (08:35 -0600)]
Introduce Arrow.ToTimestamp for performant ZonedDateTime encoding (#173)
Fixes #95 by allowing users to wrap `ZonedDateTime` columns in
`Arrow.ToTimestamp`, which will allow the writing process to skip costly
check/conversion by assuming each element has the same timezone;
`ToTimestamp` uses the timezone of the first element.
Jacob Quinn [Tue, 13 Apr 2021 14:56:27 +0000 (08:56 -0600)]
Warn when converting Arrow.Timestamps to Dates.DateTime or ZonedDateTime (#172)
Fixes #166. The problem OP saw in the original issue was that we didn't
have a proper `ArrowTypes.fromarrow` method defined for `Dates.DateTime`
from `Arrow.Timestamp` with nanosecond precision, which is accurate in
one sense because `Dates.DateTime` only supports up to millisecond
precision. But better than just erroring when trying to access these
values later, we now do the conversion anyway, which may be lossy, and
issue a warning about the potentially lossy conversion. If > millisecond
precision is needed, then users should pass `convert=false` and operate
on the `Arrow.Timestamp` values directly for now.
Eric Hanson [Tue, 13 Apr 2021 04:21:29 +0000 (05:21 +0100)]
use actual deprecation (#171)
Jarrett Revels [Wed, 7 Apr 2021 05:55:57 +0000 (01:55 -0400)]
add missing setmedata! method for Arrow.Table (#170)
Jarrett Revels [Wed, 7 Apr 2021 05:54:16 +0000 (01:54 -0400)]
document guarantee that `getmetadata` returns alias not copy (#169)
Jacob Quinn [Sun, 4 Apr 2021 03:24:10 +0000 (21:24 -0600)]
Don't store table metadata globally (#165)
* Don't store table metadata globally
Fixes #90. There's no need to store table metadata globally when we can
just store it in the Table type itself and overload the `getmetadata`.
This should avoid metadata bloat in the global store.
Jacob Quinn [Fri, 2 Apr 2021 16:47:05 +0000 (10:47 -0600)]
Restructure ArrowTypes so it can be registered as its own package (#163)
* Restructure ArrowTypes so it can be registered as its own package
This will facilitate dependants wishing to overload ArrowTypes interface
functions for their custom types. This is basically a "free" dependency
they can take on to avoid all the extra dependencies that Arrow.jl
itself has.
* Run ArrowTypes tests when we do CI for Arrow.jl
* fix
* fix
Jacob Quinn [Fri, 2 Apr 2021 16:02:16 +0000 (10:02 -0600)]
DataAPI methods (#164)
* Add refpool, refarray and levels for DictEncoded
* Apply tests to the correct variable name.
Co-authored-by: Douglas Bates <dmbates@gmail.com>
Douglas Bates [Thu, 1 Apr 2021 22:13:50 +0000 (17:13 -0500)]
Add refpool, refarray and levels for DictEncoded (#161)
Jacob Quinn [Thu, 1 Apr 2021 21:55:58 +0000 (15:55 -0600)]
Tweak promoteunion to always avoid abstract types (#162)
This makes the 2nd `concretecheck` unnecessary in the `ToArrow`
constructor, but we'll leave it for now as a type of assert. This came
from a discussion with @iamed on slack, where it was pointed out that
values always have a concrete type at runtime (yay Julia!), so we should
_always_ be able to get a `Union{...}` of concrete types. This could
potentially get crazy if someone has, like, thousands of unique concrete
types in a single array, but I guess we'll cross that bridge when we
come to it.
Jacob Quinn [Thu, 1 Apr 2021 16:18:27 +0000 (10:18 -0600)]
Allow avoiding the feather file format when writing arrow data to file name; that means allowing passing when providing a filename::String to Arrow.write
Jacob Quinn [Mon, 29 Mar 2021 13:23:06 +0000 (07:23 -0600)]
Bump version
Jacob Quinn [Mon, 29 Mar 2021 13:22:44 +0000 (07:22 -0600)]
Overhaul type serialization/deserialization machinery (#156)
* Start work on overhauling type serialization architecture
* More work; serialization is pretty much done but not tested
* fix timetype ArrowTypes definitions
* more work to get tests passing
* get tests passing?
* fix
* Fix #75 by supporting Set serialization/deserialization
* Fix #85 by supporting tuple serialization/deserialization
* Lots of cleanup
* few more fixes
* Update src/arrowtypes.jl
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
* Update src/arrowtypes.jl
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
* fix NullKind reading
* Fix #134 by requiring concrete or union of concrete element types for
all columns when serializing
* Add new ArrowTypes.arrowmetadata method for providing additional extension type metadata htat can be used in JuliaType
* Update manual
* tests
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Jacob Quinn [Thu, 18 Mar 2021 06:22:40 +0000 (00:22 -0600)]
Better handle errors when something goes wrong writing partitions (#154)
* Better handle errors when something goes wrong writing partitions
Follow up to #108. There were actually a few different issues all coming
together in @kjanosz's comment. The first being that `ZipFile.Reader`
doesn't like non-main threads touching its files _at all_. The Arrow.jl
problem there is when processing non-first partitions, we were just
`Threads.@spawn`-ing a task which then went off and sometimes became the
proverbial unheard tree falling in the forest. Like really, no one heard
these tasks and their poor exceptions.
The solution there is to be better shepherds of our spawned tasks; we
introduce a `anyerror` atomic bool that threads can set if they run into
issues and then in the main partition handling loop we'll check that. If
a thread ran into something, we'll log out the thread-specific
exception, then throw a generic writing error. This prevents the
"hanging" behavior people were seeing because the felled threads were
actually causing later tasks to hang on `put!`-ing into the
OrderedChannel.
After addressing this, we have the multithreaded ZipFile issue. With the
new `ntasks` keyword, it seems to make sense to me that if you pass
`ntasks=1`, you're really saying, I don't want any concurrency. So
anywhere we were `Threads.@spawn`ing, we now check if `ntasks == 1` and
if so, do an `@async` instead.
* Fix
Jacob Quinn [Thu, 18 Mar 2021 04:35:12 +0000 (22:35 -0600)]
Add ntasks keyword to limit # of tasks allowed to write at a time (#106)
* Add ntasks keyword to limit # of tasks allowed to write at a time
* fix
Jarrett Revels [Wed, 17 Mar 2021 23:11:21 +0000 (19:11 -0400)]
add unexported tobuffer utility for interactive testing/development (#153)
Jarrett Revels [Sun, 14 Mar 2021 03:32:19 +0000 (22:32 -0500)]
revert setting Arrrow.write debug message threshold to -1 (#152)
Jacob Quinn [Fri, 12 Mar 2021 16:07:04 +0000 (09:07 -0700)]
Ensure serializing Arrow.DictEncoded writes dictionary messages (#149)
Fixes #126. The issue here was when `Arrow.write` was faced with the
task of serializing an `Arrow.DictEncoded`. For most arrow array types,
if the input array is already an arrow array type, it's a no-op (e.g. if
you're writing out an `Arrow.Table`). The problem comes from
`Arrow.DictEncoded`, where there is still no conversion required, but we
do need to make a note of the dict encoded column to ensure a dictionary
message is written before the record batch. In addition, we also add
some code for handling delta dictionary messages if required from
multiple record batches that contain `Arrow.DictEncoded`s, which is a
valid use-case where you may have multiple arrow files, with the same
schema, that you wish to serialize as a single arrow file w/ each file
as a separate record batch.
Slightly unrelated, but there's also a fix here in our use of Lockable.
We actually had a race condition I ran into once where the locking was
on the Lockable object, but inside the locked region, we replaced the
entire Lockable instead of the _contents_ of the Lockable. This meant
anyone who started waiting on the Lockable's lock didn't see updates
when unlocked because the entire Lockable had been updated.
Jacob Quinn [Fri, 12 Mar 2021 05:16:34 +0000 (22:16 -0700)]
Ensure dict encoded index types match from record batch to record batch (#148)
Fixes #144. The core issue here was the initial record batch had a
dict-encoded column that ended up having an index type of Int8. However,
in a subsequent record batch, we use a different code path for dict
encoded columns because we need to check if a dictionary delta message
needs to be sent (i.e. there are new pooled values that need to be
serialized). The problem was in this code path, the index type was
computed from the total length of the input column instead of matching
what was already serialized in the initial schema message.
This does open up the question of another possible failure: if an
initial dict encoded column is serialized with an index type of Int8,
yet subsequent record batches end up including enough unique values that
this index type will be overflowed. I've added in an error check for
this case. Currently it's a fatal error that will stop the `Arrow.write`
process completely. I'm not quite sure what the best recommendation
would be in that case; ultimately the user needs to either widen the
first record batch column index type, but perhaps we should allow
passing a dict-encoded index type to the overall `Arrow.write` function
so users can easily specify what that type should be.
The other change that had to be made in this PR is on the reading side,
since we're now tracking the index type in the DictEncoding type itself,
which probably not coincidentally is what the arrow-json struct already
does. For reading, we already have access to the dictionary field, so
it's just a matter of deserializing the index type before constructing
the DictEncoding struct.
Jacob Quinn [Wed, 10 Mar 2021 23:40:26 +0000 (16:40 -0700)]
Introduce new `maxdepth` keyword argument for setting a limit on nesting (#147)
level limit
Alternative fix for #143. This is a more general fix than just
specializing CategoricalArrays. This should prevent more general cases
of the same issue: i.e. someone accidently passes a recursive data
structure and `Arrow.write` gets stuck trying to recursively serialize.
Damien Drix [Fri, 5 Mar 2021 06:06:42 +0000 (06:06 +0000)]
implement Base.IteratorSize for Stream, fixes #141 (#142)
Jarrett Revels [Fri, 5 Feb 2021 04:52:00 +0000 (23:52 -0500)]
fix accidental invocation of _unsafe_load_tuple (#124)
* fix accidental invocation of _unsafe_load_tuple
* bump Project.toml
Jacob Quinn [Thu, 4 Feb 2021 06:51:37 +0000 (23:51 -0700)]
Bump version
Douglas Bates [Thu, 4 Feb 2021 06:51:14 +0000 (00:51 -0600)]
Use pool length in signed int conversion (#122)
Jacob Quinn [Sun, 31 Jan 2021 06:20:11 +0000 (23:20 -0700)]
Bump version
Jacob Quinn [Sun, 31 Jan 2021 06:18:44 +0000 (23:18 -0700)]
Rework dict encoding of PooledArray/CategoricalArray to fix outstandi… (#119)
* Rework dict encoding of PooledArray/CategoricalArray to fix outstanding issues
Fixes #117, #116, and #113. For #116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For #113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For #117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.
* PooledArrays 1.0 compat
* Update src/arraytypes/dictencoding.jl
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
* Check refpool
* Fix test
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Jacob Quinn [Sat, 30 Jan 2021 07:04:34 +0000 (00:04 -0700)]
Make compressed writing threadsafe (#118)
Fixes #82. The problem when trying to write arrow using multiple threads and compression was there was only a single compressor object that each thread was simultaneously trying to use. This PR ensures there is a compressor object per thread that will be used per thread.
Jacob Quinn [Mon, 25 Jan 2021 18:39:49 +0000 (11:39 -0700)]
Bump version
Jacob Quinn [Mon, 25 Jan 2021 18:39:08 +0000 (11:39 -0700)]
Fix copy on DictEncode (#111)
Fixes #102. The issue comes up because DataFrames constructor tries to
make a copy of input columns by default when constructing; for
DictEncode, it's just a wrapper to signal that a column should be
copied, so we just make a shallow copy.
Jacob Quinn [Sat, 23 Jan 2021 04:27:09 +0000 (21:27 -0700)]
Don't use ChainedVector as DictEncoding data array unless necessary (#110)
Fixes #109. The issue here was when reading arrow record batches with
dict encoded columns, we eagerly used `ChainedVector` for the underlying
array backing the `DictEncoding` in case there were subsequent
record batches that added additional elements to the dict encoding. This
is too eager though, since it's probably common, like for "feather"
files, where the dict encoding values are always known and provided in
the first record batch. In fact, several language implementations don't
even support these kind of "delta" dict updates in subsequent record
batches. This PR, therefore, uses a regular array for the dict encoding
backing for the first record batch, and only promotes to a ChainedVector
if we happen to get a delta update.
Jarrett Revels [Tue, 19 Jan 2021 18:12:53 +0000 (13:12 -0500)]
bump Project.toml to v1.2.0 (#107)
Jarrett Revels [Tue, 12 Jan 2021 06:29:34 +0000 (01:29 -0500)]
add isbitstype optimized path for FixedSizeList getindex (#104)
* add isbitstype optimized path for FixedSizeList getindex
* reuse _unsafe_cast strategy from #103
* rebase + DRY _unsafe_cast
Jarrett Revels [Mon, 11 Jan 2021 22:51:20 +0000 (17:51 -0500)]
change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary (#103)
* change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary
* fix tests
* optimize UInt128 <-> NTuple{16,UInt8} casting
Co-authored-by: SimonDanisch <sdanisch@protonmail.com>
Co-authored-by: SimonDanisch <sdanisch@protonmail.com>
Jacob Quinn [Thu, 7 Jan 2021 05:28:13 +0000 (22:28 -0700)]
Add missing license
Jacob Quinn [Thu, 7 Jan 2021 05:10:39 +0000 (22:10 -0700)]
Admin cleanup
Jacob Quinn [Wed, 6 Jan 2021 04:40:57 +0000 (21:40 -0700)]
Add BitIntegers compat (#100)
Jarrett Revels [Tue, 5 Jan 2021 23:59:32 +0000 (18:59 -0500)]
bump Project.toml to v1.1.0 (#94)
Jacob Quinn [Tue, 5 Jan 2021 23:59:19 +0000 (16:59 -0700)]
Fix copy on DictEncoding arrays with missing values (#99)
The copy code for DictEncoding erroneously tried to special-case
`missing` when copying, which is unnecessary since `missing` is just
treated like any other regular ref value. By removing the special-cased
branch of code, we avoid treating it differently and the code copies the
`DictEncoding` array as a `PooledArray` as expected.
Eric Hanson [Tue, 5 Jan 2021 23:51:06 +0000 (00:51 +0100)]
Update make.jl (#97)
Jarrett Revels [Tue, 5 Jan 2021 23:30:27 +0000 (18:30 -0500)]
convert Arrow-flavored eltypes to Julia-flavored eltypes on copy (#98)
* convert Arrow-flavored eltypes to Julia-flavored eltypes on copy
* Update src/arraytypes/primitive.jl
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Eric Hanson [Tue, 5 Jan 2021 05:43:27 +0000 (06:43 +0100)]
Add warning for `Arrow.ArrowTypes.registertype!` (#96)
Jarrett Revels [Wed, 30 Dec 2020 06:13:44 +0000 (01:13 -0500)]
add default UUID <-> UInt128 Arrow type mapping (#89)
* add default UUID <-> UInt128 Arrow type mapping
* add UUIDs to non-standard types table test state
* add test for deprecation path for old UUID autoconversion
Eric Hanson [Thu, 24 Dec 2020 06:03:29 +0000 (07:03 +0100)]
add `ArrowTypes.default` methods and tests for dates (#86)
Jacob Quinn [Wed, 16 Dec 2020 21:02:37 +0000 (14:02 -0700)]
Support new Decimal256 type (#79)
* Support new Decimal256 type
* Fix tests, update manual
Jacob Quinn [Sat, 12 Dec 2020 06:31:43 +0000 (23:31 -0700)]
Update README & ci
Jacob Quinn [Wed, 2 Dec 2020 06:22:20 +0000 (23:22 -0700)]
Bump version
Jacob Quinn [Wed, 2 Dec 2020 06:21:51 +0000 (23:21 -0700)]
Fix Union type deserialization (#77)
* Fix Union type deserialization
Fixes #76. I think this was just a relic of old code from early work on
the package, but `juliaeltype` is meant to return "julia" types, not
arrow types.
* remove unintended change
Jacob Quinn [Sat, 28 Nov 2020 20:19:28 +0000 (13:19 -0700)]
Bump version
Jacob Quinn [Sat, 28 Nov 2020 20:18:27 +0000 (13:18 -0700)]
Finish support for automatic custom struct deserialization (#73)
* Finish support for automatic custom struct deserialization
As pointed out on a slack post, we were supporting automatic custom
struct _serialization_, but not deserialization; the custom structs were
just deserialized as `NamedTuple`s. In this PR, I propose using the
custom extension type machinery to ensure custom structs can be
deserialized. Currently this will all happend automatically for the
user, but I'd like to update the documentation around how users should
approach using arrow for custom types, because they _should_ get in the
habit of calling `ArrowTypes.registertype!` to ensure their serialized
custom struct can always be deserialized. For example, if a user
serializes and deserialized a custom struct in the same Julia session,
it will currently "just work", but if the custom struct column is
serialized in a session, then deserialized in a new session, where the
type hasn't been defined or registered, the column will be deserialized
as `NamedTuple`.
* remove unnecessary definition
* Switch CI to github actions and update docs for custom structs
* Update CI
Jacob Quinn [Sun, 22 Nov 2020 05:50:38 +0000 (22:50 -0700)]
Reword error message when input file doesn't exist; fixes #71
Jacob Quinn [Fri, 20 Nov 2020 04:38:15 +0000 (21:38 -0700)]
Add custom extension type nullability test
Jacob Quinn [Thu, 19 Nov 2020 17:37:34 +0000 (10:37 -0700)]
Bump version
Jacob Quinn [Thu, 19 Nov 2020 17:36:57 +0000 (10:36 -0700)]
Check field nullability for custom extension types (#69)
For custom extension types (currently automatically supported for `Char`
and `Symbol` types), we were failing to take into account whether the
field was nullable or not; this led to the case where a column might be
`['a', missing]`, but when deserializing, the column type was just
`Char` instead of `Union{Char, Missing}`. The fix is to enhance the
`ArrowTypes.extensiontype` function to also take the `field` argument
and check the nullability before returning.
Jacob Quinn [Thu, 19 Nov 2020 06:51:53 +0000 (23:51 -0700)]
Update Project.toml
Jacob Quinn [Thu, 19 Nov 2020 06:51:19 +0000 (23:51 -0700)]
Update README.md
Jacob Quinn [Thu, 19 Nov 2020 06:50:09 +0000 (23:50 -0700)]
Auto-convert DateTime to arrow Timestamp instead of millisecond Date; left over bug from when we switched to fully supporting Timestamps (#66)
Jacob Quinn [Thu, 19 Nov 2020 06:22:47 +0000 (23:22 -0700)]
Add validity check for columns with different lengths; fixes #60 (#65)
ExpandingMan [Wed, 18 Nov 2020 06:25:25 +0000 (01:25 -0500)]
fixed docs link (#64)
ExpandingMan [Mon, 16 Nov 2020 23:19:44 +0000 (18:19 -0500)]
get documentation going (#62)
* get documentation going
* fixed site name
Jacob Quinn [Tue, 10 Nov 2020 06:08:54 +0000 (23:08 -0700)]
Create TagBot.yml
Jacob Quinn [Wed, 4 Nov 2020 00:23:59 +0000 (17:23 -0700)]
Bump version
Jacob Quinn [Wed, 4 Nov 2020 00:23:25 +0000 (17:23 -0700)]
Several fixes for writing large Arrow tables (#57)
* Switch FlatBuffer Array type to use Int64 for byte position
Fixes #56. The issue was the underlying flatbuffer array type was using
an `UInt32` for the byte position where the flatbuffer array was
located; in super large files, this overflowed. As this Array field is a
Julia-side controlled type, we can easily switch to an Int64 to avoid
this issue all together.
* A few optimizations when writing
The bitpack encoding algorithm was allocating, which caused very large
tables to slow down considerably with so much memory recycling.
Rewriting it to avoid allocations leads to drastically fewer allocations
and much faster writing performance. For non-optimized array writing, we
also switch to writing to a buffer first to avoid hitting the global IO
lock too much, which can also hurt performance on large files.
* A few fixes for dictionary encoding writing
Just a few minor cleanups to ensure dictionary encoding types are
consistent, and that variable names work correctly between writing
first-time dictionary encodings and deltas.
* a few fixes
Jacob Quinn [Wed, 28 Oct 2020 15:03:29 +0000 (09:03 -0600)]
Add TimeZones compat
Jacob Quinn [Wed, 28 Oct 2020 14:44:25 +0000 (08:44 -0600)]
Bump version
Jacob Quinn [Wed, 28 Oct 2020 07:11:20 +0000 (01:11 -0600)]
Compute and set our PooledArray ref type manually
Should fix #52. The issue here is a little tricky, and I'm not sure
we're 100% doing things correct yet. Part of the problem, however, is
that when a user requests a column to be `DictEncode`-ed, we were just
calling `PooledArray(x)` to do the pooling for us, but PooledArrays uses
unsigned integer types as the ref type by default. The arrow format
encourages the use of *signed* integers, however, so when we were
serializing the *type* of the dict-encoded indices, we serialized the
*signed* version, even if the indices were unsigned. That's bad because
the indices may have "fit" in the unsigned type domain (like UInt8), but
not been valid in the signed domain (like 129 for Int8). So one change
made here is we don't auto-convert to the signed type, we just used
whatever the indices type is. The second change, to try and follow what
the arrow format encourages, is we'll compute our own ref type using the
`encodingtype` function, which produces a signed integer type, and pass
that to `PooledArray`.
Jacob Quinn [Wed, 28 Oct 2020 07:08:25 +0000 (01:08 -0600)]
Check if input file is valid before reading arrow data
Fixes #49.
Jacob Quinn [Wed, 28 Oct 2020 06:49:08 +0000 (00:49 -0600)]
Ensure any AbstractString is serialized correctly
Fixes #53. The issue here is we had a couple of spots where `String` was
hard-coded instead of checking if a type was `<: AbstractString`. This
lead to the original issue in #53 where `SubString{String}` was
serialized as "binary" instead of as a string.
Jacob Quinn [Wed, 28 Oct 2020 06:35:37 +0000 (00:35 -0600)]
Fix travis docs job
Jacob Quinn [Wed, 28 Oct 2020 06:31:19 +0000 (00:31 -0600)]
Docs update (#54)
* Big round of doc updates
* docs infra
* fix travis
Jacob Quinn [Mon, 26 Oct 2020 15:21:02 +0000 (09:21 -0600)]
Add TimeZones dependency for auto-converting Timestamp (#50)
Implements #17. We now convert all Timestamp objects to ZonedDateTime
via auto-conversion (can be turned off by `convert=false`). The one
piece of awkwardness here is that it is currently assumed that with a
`Vector{ZonedDateTime}`, it is assumed that each element will have the
same timezone. Probably ok in practice, but frankly it'd be nicer if
there was a `ZonedDateTime` type that could be parameterized by the
timezone itself so that could be enforced via type parameter.
Jacob Quinn [Fri, 23 Oct 2020 20:06:31 +0000 (14:06 -0600)]
Prevent infinite loop while writing on CategoricalArrays (#44)
Supporting CategoricalArrays is a bit tricky. Ideally, we could use the
same interface as PooledArrays and just rely on `DataAPI.refarray` and
`DataAPI.refpool`, but alas, a `CategoricalArray` returns a
`CategoricalRefPool` from `DataAPI.refpool`, with `CategoricalValue`
elements. The core issue comes when we try to serialize the pool, which
is a recursive process: recursive until we reach a known "leaf" type, at
which point the recursion stops. Unfortunately, a `CategoricalValue`
isn't a known leaf type, so it's treated as a `StructType`, where each
of its fields are serialized. One of the fields is the
`CategoricalRefPool`, so we get stuck in a never-ending recursive loop
serializing `CategoricalValue`s and `CategoricalRefPool`s.
This PR proposes a quick hack where we check the `DataType` name for
`:CategoricalRefPool` or `:CategoricalArray` and if so, just unwrap the
values so the recursion will be broken. It's obviously a little hacky,
but also avoids taking on the always-problematic CategoricalArrays
dependency.
Jacob Quinn [Fri, 23 Oct 2020 06:58:06 +0000 (00:58 -0600)]
Use threaded tasks to read record batches in parallel for Arrow.Table (#47)
Jacob Quinn [Fri, 23 Oct 2020 04:20:07 +0000 (22:20 -0600)]
Pass "ArrowVector" through arrowvector calls (#46)
If users construct `ArrowVector` types themselves, or read, then write,
we can be a bit more efficient by not making copies.
Jacob Quinn [Thu, 22 Oct 2020 13:34:04 +0000 (07:34 -0600)]
Refactor nested dict encoding and isdelta dictionary batch support (#43)
Fixes #32 among other issues. Turns out the probably-rare-in-practice
data race mentioned in #32 was the least of the worries. While digging
into things, I realized we weren't doing isDelta dictionary batches
right at all. In particular, we were basically writing each record
batch/dictionary batch independent of each other, but using the same
dictionary batch ids. We didn't have tests failures because we weren't
testing isDelta batches anyway :P.
In this PR, everything is cleaned up quite a bit. We now generate a
dictionary encoding id based on the column index, nesting level, and
field index (in the case of structs and unions). This allows us to
re-use the same constructed dict encodings from batch to batch, and also
allows us to support the use-case of re-using a single dict encoding
across multiple columns if so desired (user would just pass in their own
`DictEncode` column pointing to the same id). We also avoid race
conditions by putting a lock around dict encodings so different threads
writing will have to take turns.
TheCedarPrince [Sun, 18 Oct 2020 03:59:11 +0000 (23:59 -0400)]
Added a Note about Large Numbers of Columns (#42)
* Added information about large numbers of columns
* Update README.md
Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>
Jacob Quinn [Wed, 14 Oct 2020 19:32:50 +0000 (13:32 -0600)]
Introduce BoolVector for Bool column types (#40)
Fixes #38. From back in the original feather days, I remember that bool
columns were always bitpacked. Unfortunately, the arrow spec doesn't
really point this bitpacking out very obviously (it's mentioned in
passing as a possibility). This PR introduces a new BoolVector type and
corresponding `ArrowTypes.BoolType` that ensures Bool columns will be
written bitpacked, and read similarly.
Jacob Quinn [Wed, 14 Oct 2020 14:19:17 +0000 (08:19 -0600)]
Rearrange code by array type (#39)
Code for the various array types was a little all over the place. This
PR consolidates arraytype-specific code into new per-arraytype files.
This also simplifies a few things I noticed reviewing the code: removing
unnecessary utils, stop passing eltypes to arraytype-specific
`arrowvector` methods, reuse `ToList` for `MapType` and get rid of
`ToMap`, and clarify a few of the ArrowTypes interfaces. This should
simplify things a bit for adding support for bitpacked Bool arrays and
supporting CategoricalArrays.
Jacob Quinn [Thu, 8 Oct 2020 04:12:06 +0000 (22:12 -0600)]
Switch license to apache-2 in preparation for code donation (#36)
Jacob Quinn [Tue, 6 Oct 2020 15:50:58 +0000 (09:50 -0600)]
Allow specifying custom alignment for buffer writing padding (#35)
* Allow specifying custom alignment for buffer writing padding
Implements #31. Pretty easy, just have to thread the new `alignment`
keyword down everywhere.
* Update docs and add test
Jacob Quinn [Tue, 6 Oct 2020 07:20:40 +0000 (01:20 -0600)]
Allow passing custom lz4 & zstd compressors to Arrow.write (#34)
* Allow passing custom lz4 & zstd compressors to Arrow.write
* update docs
John Myles White [Mon, 5 Oct 2020 02:45:19 +0000 (22:45 -0400)]
Typo fix in README (#33)
Jacob Quinn [Sat, 3 Oct 2020 22:05:29 +0000 (16:05 -0600)]
compat bounds for codecs
Jacob Quinn [Sat, 3 Oct 2020 21:38:29 +0000 (15:38 -0600)]
Fix docs
Jacob Quinn [Sat, 3 Oct 2020 21:37:18 +0000 (15:37 -0600)]
Try to fix travis badge again
Jacob Quinn [Sat, 3 Oct 2020 21:36:21 +0000 (15:36 -0600)]
Update travis badge
Jacob Quinn [Sat, 3 Oct 2020 21:34:41 +0000 (15:34 -0600)]
Update docs