incubator-hivemall.git
17 months agoImplement Korean text tokenizer
Makoto Yui [Thu, 22 Apr 2021 14:53:10 +0000 (23:53 +0900)] 
Implement Korean text tokenizer

## What changes were proposed in this pull request?

Implement Korean text tokenizer

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-307

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
-- show version of lucene-analyzers-nori
select tokenize_ko();
> 8.8.2

select tokenize_ko("소설 무궁화꽃이 피었습니다.");
> ["소설","무궁","화","꽃","피"]

select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "mixed");
> ["소설","무궁화","무궁","화","꽃","피"]

select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "discard", array("E", "VV"));
> ["소설","무궁","화","꽃","이"]

select tokenize_ko("Hello, world.", null, "none", array(), true);
> ["h","e","l","l","o","w","o","r","l","d"]

select tokenize_ko("Hello, world.", null, "none", array(), false);
> ["hello","world"]

select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", null, "discard", array());
> ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]

select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", array("C++"), "discard", array());
> ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #237 from myui/korean_tokenizer.

17 months agoFixed gitbook build
Makoto Yui [Thu, 22 Apr 2021 12:24:33 +0000 (21:24 +0900)] 
Fixed gitbook build

## What changes were proposed in this pull request?

Fixed gitbook build

## What type of PR is it?

Documentation

Author: Makoto Yui <myui@apache.org>

Closes #236 from myui/fix_gitbook.

17 months ago[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary
Makoto Yui [Thu, 22 Apr 2021 03:39:21 +0000 (12:39 +0900)] 
[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary

## What changes were proposed in this pull request?

Add tokenize_ja_neologd UDF that uses Neologd dictionary for Kuromoji tokenization.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-305

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
tokenize_ja_neologd(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)

select tokenize_ja_neologd("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。");
> ["彼女","ペンパイナッポーアッポーペン","恋ダンス","踊る"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #235 from myui/neologd.

17 months ago[HIVEMALL-304] Updated lucene version from 5.5.5 (java7) to 8.8.2 (java8)
Makoto Yui [Mon, 19 Apr 2021 06:39:03 +0000 (15:39 +0900)] 
[HIVEMALL-304] Updated lucene version from 5.5.5 (java7) to 8.8.2 (java8)

## What changes were proposed in this pull request?

Updated lucene version from 5.5.5 (java7) to 8.8.2 (java8)

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-304

## How was this patch tested?

unit tests

## How to use this feature?

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #234 from myui/lucene_version_up.

17 months ago[HIVEMALL-303] Changed compilation target to Java 8
Makoto Yui [Thu, 15 Apr 2021 05:29:23 +0000 (14:29 +0900)] 
[HIVEMALL-303] Changed compilation target to Java 8

## What changes were proposed in this pull request?

Change compilation target to Java 8 from Java 7.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-303

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #233 from myui/HIVEMALL-303-java8.

18 months ago[HIVEMALL-301] Remove macros and replace them with UDF
Makoto Yui [Mon, 29 Mar 2021 07:42:58 +0000 (16:42 +0900)] 
[HIVEMALL-301] Remove macros and replace them with UDF

## What changes were proposed in this pull request?

Remove macros and replace them with UDF

## What type of PR is it?

Improvement, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-301

## How was this patch tested?

manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #232 from myui/HIVEMALL-301-tfidf.

2 years agoRevised bagging doc entry
Makoto Yui [Thu, 27 Aug 2020 09:06:58 +0000 (18:06 +0900)] 
Revised bagging doc entry

2 years agoTrivial doc fix
Makoto Yui [Fri, 21 Aug 2020 09:54:15 +0000 (18:54 +0900)] 
Trivial doc fix

2 years agoAdded user guide entry for bagging classifiers
Makoto Yui [Fri, 21 Aug 2020 06:21:45 +0000 (15:21 +0900)] 
Added user guide entry for bagging classifiers

2 years ago[HIVEMALL-297] Fixed null element handling in feature vector
Makoto Yui [Thu, 6 Aug 2020 07:05:37 +0000 (16:05 +0900)] 
[HIVEMALL-297] Fixed null element handling in feature vector

## What changes were proposed in this pull request?

Fixed null element handling in feature vector

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-297

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #231 from myui/HIVEMALL-297.

2 years ago[HIVEMALL-296][BUGFIX] Fixed corner case NPE bug when count is zero
Makoto Yui [Mon, 27 Jul 2020 06:36:48 +0000 (15:36 +0900)] 
[HIVEMALL-296][BUGFIX] Fixed corner case NPE bug when count is zero

## What changes were proposed in this pull request?

Fixed corner case NPE bug when count is zero.

```
Caused by: java.lang.NullPointerException
at hivemall.GeneralLearnerBaseUDTF.forwardModel(GeneralLearnerBaseUDTF.java:763)
at hivemall.GeneralLearnerBaseUDTF.close(GeneralLearnerBaseUDTF.java:560)
at org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:152)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:279)
```

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-296

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #230 from myui/HIVEMALL-296.

2 years ago[HIVEMALL-295][BUGFIX] transpose_and_dot throws UDFArgumentException for 0 rows input
Makoto Yui [Thu, 4 Jun 2020 05:21:05 +0000 (14:21 +0900)] 
[HIVEMALL-295][BUGFIX] transpose_and_dot throws UDFArgumentException for 0 rows input

## What changes were proposed in this pull request?

transpose_and_dot throws UDFArgumentException for 0 rows input.

```
WITH INPUT AS(
  SELECT
    ARRAY(1.0,2.0,3.0) AS X,
    ARRAY(0,1) AS Y
)
SELECT
  transpose_and_dot(Y,X) AS observed
FROM
  INPUT
WHERE false

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.exec.UDFArgumentException
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1126)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:464)
        ... 15 more
Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at hivemall.utils.lang.Preconditions.checkNotNull(Preconditions.java:43)
        at hivemall.tools.matrix.TransposeAndDotUDAF$TransposeAndDotUDAFEvaluator.iterate(TransposeAndDotUDAF.java:172)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:192)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1117)
        ... 21 more
```

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-295

## How was this patch tested?

manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #229 from myui/HIVEMALL-295.

2 years ago[HIVEMALL-294] Fix XGboost to report progress report for each iteration
Makoto Yui [Fri, 29 May 2020 07:42:14 +0000 (16:42 +0900)] 
[HIVEMALL-294] Fix XGboost to report progress report for each iteration

## What changes were proposed in this pull request?

Fix XGboost to report progress report for each iteration.

## What type of PR is it?

Improvement, Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-294

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #228 from myui/HIVEMALL-294.

2 years ago[HIVEMALL-291] Fixed dedup behavior of to_ordered_list UDAF
Makoto Yui [Tue, 3 Mar 2020 06:44:04 +0000 (15:44 +0900)] 
[HIVEMALL-291] Fixed dedup behavior of to_ordered_list UDAF

## What changes were proposed in this pull request?

Fixed dedup behavior of to_ordered_list UDAF

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-291

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #227 from myui/HIVEMALL-291-2.

2 years ago[HIVEMALL-291] Support deduplication in to_order_list UDAF
Makoto Yui [Thu, 23 Jan 2020 10:20:55 +0000 (19:20 +0900)] 
[HIVEMALL-291] Support deduplication in to_order_list UDAF

## What changes were proposed in this pull request?

Add -dedup option to to_ordered_list.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-291

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
WITH data as (
    SELECT 5 as key, 'apple' as value
    UNION ALL
    SELECT 3 as key, 'banana' as value
    UNION ALL
    SELECT 4 as key, 'candy' as value
    UNION ALL
    SELECT 1 as key, 'donut' as value
    UNION ALL
    SELECT 2 as key, 'egg' as value
    UNION ALL
    SELECT 4 as key, 'candy' as value -- both key and value duplicates
)
select
  to_ordered_list(value, key, '-k 4 -dedup -vk_map'),
  to_ordered_list(value, key, '-k 4 -vk_map'),
  to_ordered_list(value, key, '-k 4 -dedup'),
  to_ordered_list(value, key, '-k 4')
from
  data
```

> {"apple":5,"candy":4,"banana":3,"egg":2}        {"apple":5,"candy":4,"banana":3}        ["apple","candy","banana","egg"]      [
"apple","candy","candy","banana"]

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #226 from myui/HIVEMALL-291.

2 years agoFixed docs
Makoto Yui [Tue, 21 Jan 2020 09:25:37 +0000 (18:25 +0900)] 
Fixed docs

2 years ago[HIVEMALL-289] Add str_contain(string str, array<string> match, boolean or=true) UDF
Makoto Yui [Thu, 26 Dec 2019 07:48:02 +0000 (16:48 +0900)] 
[HIVEMALL-289] Add str_contain(string str, array<string> match, boolean or=true) UDF

## What changes were proposed in this pull request?

Add str_contain(string str, array<string> match, boolean or=true) UDF

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-289

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
select
  str_contains('There are apple and orange', array('apple')),
  str_contains('There are apple and orange', array('apple', 'banana'), true),
  str_contains('There are apple and orange', array('apple', 'banana'), false);
> true, true, false
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #225 from myui/HIVEMALL-289.

2 years agoFixed docs for UDF preparation
Makoto Yui [Wed, 25 Dec 2019 09:30:06 +0000 (18:30 +0900)] 
Fixed docs for UDF preparation

2 years agoFixed a typo
Makoto Yui [Fri, 20 Dec 2019 15:45:22 +0000 (00:45 +0900)] 
Fixed a typo

2 years agoAdded ChangeLog
Makoto Yui [Fri, 20 Dec 2019 13:05:12 +0000 (22:05 +0900)] 
Added ChangeLog

2 years agoReplaced http with https and added verification procedure
Makoto Yui [Thu, 19 Dec 2019 12:31:55 +0000 (21:31 +0900)] 
Replaced http with https and added verification procedure

2 years agoFixed links in doc
Makoto Yui [Thu, 19 Dec 2019 10:18:04 +0000 (19:18 +0900)] 
Fixed links in doc

2 years agoUpdated download page
Makoto Yui [Thu, 19 Dec 2019 08:36:51 +0000 (17:36 +0900)] 
Updated download page

2 years agoMerge remote-tracking branch 'origin/v0.6.0'
Makoto Yui [Thu, 19 Dec 2019 08:06:44 +0000 (17:06 +0900)] 
Merge remote-tracking branch 'origin/v0.6.0'

2 years agoUpdated copyrights holders
Makoto Yui [Thu, 19 Dec 2019 05:18:25 +0000 (14:18 +0900)] 
Updated copyrights holders

2 years ago[HIVEMALL-288] mf_predict throws SemanticException No matching method with (array...
Makoto Yui [Thu, 12 Dec 2019 08:32:27 +0000 (17:32 +0900)] 
[HIVEMALL-288] mf_predict throws SemanticException No matching method with (array<double>, array<double>, int)

## What changes were proposed in this pull request?

`mf_predict` throws SemanticException No matching method with (array<double>, array<double>, int)

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-288

## How was this patch tested?

manual tests on EMR

```sql
select
  -- 3 arguments
  mf_predict(array(cast(1.0 as float),cast(2.0 as float),cast(3.0 as float)), array(cast(1.0 as float),cast(2.0 as float),cast(3.0 as float)), 1),
  mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 1),
  mf_predict(array(cast(1.0 as DOUBLE),cast(2.0 as DOUBLE),cast(3.0 as DOUBLE)), array(cast(1.0 as DOUBLE),cast(2.0 as DOUBLE),cast(3.0 as DOUBLE)), 1),
  -- 2 arguments
  mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0)),
  -- 4 arguments
  mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 0, 0),
  -- 5 arguments
  mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 0, 0, 1);
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #224 from myui/HIVEMALL-288.

2 years agoUpdate date
Makoto Yui [Tue, 3 Dec 2019 06:21:35 +0000 (15:21 +0900)] 
Update date

2 years ago[DOC] update titanic random forest doc for decision_path
Makoto Yui [Mon, 2 Dec 2019 10:25:54 +0000 (19:25 +0900)] 
[DOC] update titanic random forest doc for decision_path

2 years agoFixed release guide
Makoto Yui [Thu, 28 Nov 2019 18:26:51 +0000 (03:26 +0900)] 
Fixed release guide

2 years ago[maven-release-plugin] prepare for next development iteration v0.6.0
Makoto Yui [Thu, 28 Nov 2019 16:43:53 +0000 (01:43 +0900)] 
[maven-release-plugin] prepare for next development iteration

2 years ago[maven-release-plugin] prepare release v0.6.0-rc1 v0.6.0-rc1
Makoto Yui [Thu, 28 Nov 2019 16:43:43 +0000 (01:43 +0900)] 
[maven-release-plugin] prepare release v0.6.0-rc1

2 years agoBumped version string to 0.6.0-incubating
Makoto Yui [Thu, 28 Nov 2019 16:41:45 +0000 (01:41 +0900)] 
Bumped version string to 0.6.0-incubating

2 years agoMinor refactoring and fixed function docs
Makoto Yui [Thu, 28 Nov 2019 07:46:02 +0000 (16:46 +0900)] 
Minor refactoring and fixed function docs

2 years ago[HIVEMALL-159][DOC] Add documentation about One-hot encoding
Makoto Yui [Thu, 28 Nov 2019 07:11:17 +0000 (16:11 +0900)] 
[HIVEMALL-159][DOC] Add documentation about One-hot encoding

## What changes were proposed in this pull request?

Add documentation about One-hot encoding

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-159

## How to use this feature?

See userguide

Author: Makoto Yui <myui@apache.org>

Closes #223 from myui/onehot_docs.

2 years ago[HIVEMALL-56][DOC] Add documentation about Similarity/Distance functions
Makoto Yui [Wed, 27 Nov 2019 09:03:41 +0000 (18:03 +0900)] 
[HIVEMALL-56][DOC] Add documentation about Similarity/Distance functions

## What changes were proposed in this pull request?

Add documentation about Similarity/Distance functions

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-56

## Checklist

Author: Makoto Yui <myui@apache.org>

Closes #222 from myui/HIVEMALL-56.

2 years ago[HIVEMALL-158][DOC] Refine deprecated userguide contents
Makoto Yui [Wed, 27 Nov 2019 07:42:34 +0000 (16:42 +0900)] 
[HIVEMALL-158][DOC] Refine deprecated userguide contents

## What changes were proposed in this pull request?

Refine deprecated userguide contents

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-158

Author: Makoto Yui <myui@apache.org>

Closes #221 from myui/HIVEMALL-158.

2 years ago[HIVEMALL-285] Add -inspect_opts option to show hyperparameters
Makoto Yui [Wed, 27 Nov 2019 07:11:56 +0000 (16:11 +0900)] 
[HIVEMALL-285] Add -inspect_opts option to show hyperparameters

## What changes were proposed in this pull request?

Add `-inspect_opts` option to show hyperparameters

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-285

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
select train_regressor(array(), 0, '-inspect_opts -optimizer adam -reg elasticnet');

FAILED: UDFArgumentException Inspected Optimizer options ...
{disable_cvtest=false, regularization=ElasticNet, loss_function=SquaredLoss, eps=1.0E-8, decay=0.0, iterations=10, eta0=0.1, l1_ratio=0.5, lambda=1.0E-4, eta=Invscaling, optimizer=adam, beta1=0.9, beta2=0.999, alpha=1.0, cv_rate=0.005, power_t=0.1}
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #220 from myui/HIVEMALL-285.

2 years agoRevised exception type
Makoto Yui [Tue, 26 Nov 2019 06:43:09 +0000 (15:43 +0900)] 
Revised exception type

2 years agoMinor refactoring
Makoto Yui [Tue, 26 Nov 2019 06:39:30 +0000 (15:39 +0900)] 
Minor refactoring

2 years ago[HIVEMALL-283] Bump up netty version to 4.1.42.Final
Makoto Yui [Tue, 26 Nov 2019 04:54:43 +0000 (13:54 +0900)] 
[HIVEMALL-283] Bump up netty version to 4.1.42.Final

## What changes were proposed in this pull request?

Bump up netty version to 4.1.42.Final

This closes #206 and closes #207

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-283

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #219 from myui/HIVEMALL-283.

2 years ago[HIVEMALL-226] Move hivemall.fm and hivemall.mf packages to under hivemall.factorization
Makoto Yui [Mon, 25 Nov 2019 18:58:42 +0000 (03:58 +0900)] 
[HIVEMALL-226] Move hivemall.fm and hivemall.mf packages to under hivemall.factorization

## What changes were proposed in this pull request?

Move hivemall.fm and hivemall.mf packages to under hivemall.factorization

## What type of PR is it?

Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-226

## How was this patch tested?

unit tests and manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #218 from myui/HIVEMALL-266.

2 years agoUpdate javadoc and applied formatter
Makoto Yui [Mon, 25 Nov 2019 17:05:56 +0000 (02:05 +0900)] 
Update javadoc and applied formatter

2 years ago[HIVEMALL-165] Fixed to accept any primitive
Makoto Yui [Mon, 25 Nov 2019 16:53:29 +0000 (01:53 +0900)] 
[HIVEMALL-165] Fixed to accept any primitive

## What changes were proposed in this pull request?

Fix a bug that `array_remove` UDF throws exception when the first argument is null

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-165

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
WITH data4 as (
  select false as n, array(2.0, 3.0, 4.0) as nums
  union all
   select true as n, array(2.0, 3.0, 4.0) as nums
)
select
  array_remove(if(n = true, null, nums), 2.0) as c1,
  array_remove(if(n = true, null, nums), array(3.0,2.0)) as c2,
  array_remove(if(n = false, null, nums), 2.0) as c3
from
  data4;
> c1      c2      c3
> [3,4]   [4]     NULL
> NULL    NULL    [3,4]

select array_remove(array(2.0,2.1,3.0,4.0,2.0),2), array_remove(array(2.0,3.0,4.0),array(3,2.0));
> [2.1,3,4]       [4]

SELECT array_remove(array(1,null,3),null);
> [1,3]

SELECT array_remove(array(1,null,3,null,5),null);
> [1,3,5]

SELECT array_remove(array(1,null,3),array(null));
> [1,3]

SELECT array_remove(array('aaa','bbb'),'bbb');
> ["aaa"]

SELECT array_remove(array('aaa','bbb','ccc','bbb'), array('bbb','ccc'));
> ["aaa"]

select array_remove(array(null),null);
> []

select array_remove(array(null,'bbb'),'aaa');
> [null,"bbb"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #217 from myui/HIVEMALL-165.

2 years ago[HIVEMALL-121] Add -libsvm formatting option to feature_hashing UDF
Makoto Yui [Mon, 25 Nov 2019 10:03:15 +0000 (19:03 +0900)] 
[HIVEMALL-121] Add -libsvm formatting option to feature_hashing UDF

## What changes were proposed in this pull request?

Add `-libsvm` formatting option for `feature_hashing

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-121

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```sql
select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-libsvm');
> ["4063537:1.0","4063537:1","8459207:2.0"]

select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-features 10 -libsvm');
> ["1:2.0","7:1.0","7:1"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #216 from myui/HIVEMALL-121.

2 years ago[HIVEMALL-249] Fix fmeasure UDAF to support any integers
Makoto Yui [Mon, 25 Nov 2019 08:50:35 +0000 (17:50 +0900)] 
[HIVEMALL-249] Fix fmeasure UDAF to support any integers

## What changes were proposed in this pull request?

Fix fmeasure UDAF to support any integers

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-249

## How to use this feature?

```sql
create table data2 as
  select 1.1 as truth, 0 as predicted
union all
  select 0.0 as truth, 1 as predicted
union all
  select 0.0 as truth, 0 as predicted
union all
  select 1.0 as truth, 1 as predicted
union all
  select 0.0 as truth, 1 as predicted
union all
  select 0.0 as truth, 0 as predicted
;

select fmeasure(truth, predicted, '-average binary') from data;
```

## How was this patch tested?

manual tests on EMR

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #215 from myui/HIVEMALL-249.

2 years ago[HIVEMALL-276] Stable support for XGBoost v0.90
Makoto Yui [Fri, 22 Nov 2019 15:56:36 +0000 (00:56 +0900)] 
[HIVEMALL-276] Stable support for XGBoost v0.90

## What changes were proposed in this pull request?

- Fix xgboost module to create DMatrix from CSRMatrix
- Support xgboost v0.90 hyperparameters
- Replace xgboost4j with [xgboost-predictor](https://github.com/komiya-atsushi/xgboost-predictor-java) for prediction
- Add documentation about Xgboost

## What type of PR is it?

Refactoring, Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-276
https://issues.apache.org/jira/browse/HIVEMALL-275
https://issues.apache.org/jira/browse/HIVEMALL-279
https://issues.apache.org/jira/browse/HIVEMALL-272
https://issues.apache.org/jira/browse/HIVEMALL-27

## How to use this feature?

as described in [user guide](http://hivemall.apache.org/userguide/index.html).

## How was this patch tested?

unit tests and manual tests on EMR

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #213 from myui/HIVEMALL-275-2.

2 years ago[HIVEMALL-281] Support max_by, min_by, majority_vote UDAFs
Makoto Yui [Fri, 22 Nov 2019 14:17:11 +0000 (23:17 +0900)] 
[HIVEMALL-281] Support max_by, min_by, majority_vote UDAFs

## What changes were proposed in this pull request?

upport max_by, min_by, majority_vote UDAFs

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-281

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql

create table data1 as (
  select 'jake' as name, 18 as age
  union all
  select 'tom' as name, 64 as age
  union all
  select 'lisa' as name, 32 as age
);

select
  max_by(name, age) as max_name,
  min_by(name, age) as min_name
from
  data1;
> tom, jake

create table data2 as
  select
    explode(array('1', '2', '2', '2', '5', '4', '1', '2')) as k;

select
  majority_vote(k) as k
from
  data2;
> 2
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #214 from myui/HIVEMALL-281.

2 years ago[HOTFIX] bumped matrix4j version to 0.9.2
Makoto Yui [Mon, 11 Nov 2019 05:38:54 +0000 (14:38 +0900)] 
[HOTFIX] bumped matrix4j version to 0.9.2

2 years ago[HIVEMALL-278] Bumped matrix4j version to v0.9.1
Makoto Yui [Fri, 1 Nov 2019 09:27:53 +0000 (18:27 +0900)] 
[HIVEMALL-278] Bumped matrix4j version to v0.9.1

## What changes were proposed in this pull request?

Bumped matrix4j version to v0.9.1 since matrix4j v0.9.0 had a bug on constructing CSRMatrix in an unordered column order.

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-278

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #212 from myui/HIVEMALL-278.

2 years agoadd missing junit dependency
Makoto Yui [Thu, 31 Oct 2019 10:58:20 +0000 (19:58 +0900)] 
add missing junit dependency

2 years agoAdded SparseDMatrixBuilder 211/head
Makoto Yui [Thu, 31 Oct 2019 10:17:54 +0000 (19:17 +0900)] 
Added SparseDMatrixBuilder

2 years agoRenamed XGBoostUDTF as XGBoostBaseUDTF
Makoto Yui [Thu, 31 Oct 2019 10:17:31 +0000 (19:17 +0900)] 
Renamed XGBoostUDTF as XGBoostBaseUDTF

2 years ago[HIVEMALL-274] Fix wrong column name of train_regressor() in tutorial
Aki Ariga [Thu, 31 Oct 2019 07:44:44 +0000 (16:44 +0900)] 
[HIVEMALL-274] Fix wrong column name of train_regressor() in tutorial

## What changes were proposed in this pull request?

Fix document bug reported in HIVEMALL-274

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/projects/HIVEMALL/issues/HIVEMALL-274

## How was this patch tested?

N/A

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Aki Ariga <ariga@treasure-data.com>

Closes #210 from chezou/HIVEMALL-274.

2 years agoAdded document about xgboost_version() UDF
Makoto Yui [Wed, 30 Oct 2019 08:59:49 +0000 (17:59 +0900)] 
Added document about xgboost_version() UDF

2 years ago[HIVEMALL-273] Support xgboost v0.90
Makoto Yui [Wed, 30 Oct 2019 07:41:21 +0000 (16:41 +0900)] 
[HIVEMALL-273] Support xgboost v0.90

## What changes were proposed in this pull request?

Support xgboost v0.90

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-273

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

https://gist.github.com/myui/aa6e142a95ca8f995cc8e49146dbe2eb

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #209 from myui/HIVEMALL-273.

2 years ago[HIVEMALL-260] Remove dependencies to Scala library in xgboost classifier
Makoto Yui [Tue, 29 Oct 2019 06:37:43 +0000 (15:37 +0900)] 
[HIVEMALL-260] Remove dependencies to Scala library in xgboost classifier

## What changes were proposed in this pull request?

Remove dependencies to Scala library in xgboost classifier

## What type of PR is it?

Bug Fix, Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-260

## How was this patch tested?

manual tests on EMR

## How to use this feature?

to appear

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #205 from myui/HIVEMALL-260.

2 years agoRemove rand_gid/rand_gid2 macro
Makoto Yui [Wed, 23 Oct 2019 09:44:41 +0000 (18:44 +0900)] 
Remove rand_gid/rand_gid2 macro

## What changes were proposed in this pull request?

Remove rand_gid/rand_gid2 macro

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-270

Author: Makoto Yui <myui@apache.org>

Closes #204 from myui/HIVEMALL-270.

2 years ago[HIVEMALL-261][HIVEMALL-262] argmin/argmax/argsort UDF
Makoto Yui [Wed, 23 Oct 2019 09:01:51 +0000 (18:01 +0900)] 
[HIVEMALL-261][HIVEMALL-262] argmin/argmax/argsort UDF

## What changes were proposed in this pull request?

Introduce argmin/argmax/argsort UDF

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-261
https://issues.apache.org/jira/browse/HIVEMALL-262

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```sql
SELECT argmax(array(5,2,0,1));
> 0

SELECT array_slice(array(5,2,0,1), argmax(array(5,2,0,1)));
> 5

SELECT argmin(array(5,2,0,1));
> 2

SELECT argsort(array(5,2,0,1));
> 2, 3, 1, 0

SELECT array_slice(array(5,2,0,1), argsort(array(5,2,0,1)));
> 0, 1, 2, 5

SELECT argsort(argsort(array(5,2,0,1))), argrank(array(5,2,0,1));
> 3, 2, 0, 1

SELECT arange(5), arange(1, 5), arange(1, 5, 1), arange(0, 5, 1);
> [0,1,2,3,4]     [1,2,3,4]       [1,2,3,4]       [0,1,2,3,4]

SELECT arange(1, 6, 2);
> 1, 3, 5

SELECT arange(-1, -6, 2);
> -1, -3, -5

SELECT argsort(array(5, 2, 0, 1)), argrank(array(5, 2, 0, 1)), argsort(argsort(array(5, 2, 0, 1)));
> [2,3,1,0]       [3,2,0,1]       [3,2,0,1]
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #197 from myui/argmax.

2 years ago[HIVEMALL-244] Support Java9, Java11(LTS)
Makoto Yui [Mon, 21 Oct 2019 07:22:05 +0000 (16:22 +0900)] 
[HIVEMALL-244] Support Java9, Java11(LTS)

## What changes were proposed in this pull request?

Support Java9, Java11(LTS)

## What type of PR is it?

Improvement | Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-244

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #203 from myui/HIVEMALL-244.

2 years ago[HIVEMALL-269] Modified to use matrix4j for matrix module
Makoto Yui [Fri, 18 Oct 2019 08:42:16 +0000 (17:42 +0900)] 
[HIVEMALL-269] Modified to use matrix4j for matrix module

## What changes were proposed in this pull request?

 Use matrix4j for matrix module

## What type of PR is it?

Hot Fix | Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-269

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #202 from myui/HIVEMALL-269.

2 years agoFixed annotations
Makoto Yui [Tue, 8 Oct 2019 07:15:24 +0000 (16:15 +0900)] 
Fixed annotations

2 years agoMoved matrix/random package to utils/random
Makoto Yui [Mon, 7 Oct 2019 07:16:19 +0000 (16:16 +0900)] 
Moved matrix/random package to utils/random

2 years agoMerged ArrayUtilsTest
Makoto Yui [Mon, 7 Oct 2019 05:44:39 +0000 (14:44 +0900)] 
Merged ArrayUtilsTest

2 years ago[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)
Makoto Yui [Fri, 4 Oct 2019 05:28:49 +0000 (14:28 +0900)] 
[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)

## What changes were proposed in this pull request?

Drop Spark Dataframe support (SparkSQL remain supported).

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-267

## How was this patch tested?

unit tests, manual tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #201 from myui/HIVEMALL-267.

2 years ago[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines
Makoto Yui [Thu, 3 Oct 2019 08:34:10 +0000 (17:34 +0900)] 
[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines

## What changes were proposed in this pull request?

Fix the default vInit, eta initialization bug in FactorizationMachines

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-268

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #200 from myui/HIVEMALL-268.

3 years ago[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees
Makoto Yui [Fri, 27 Sep 2019 18:39:01 +0000 (03:39 +0900)] 
[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees

## What changes were proposed in this pull request?

Introduce `decision_path` UDF providing tracing of decision tree prediction paths

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-171

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

to be described in the user guide

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #199 from myui/HIVEMALL-171.

3 years ago[HIVEMALL-245] Refactor RandomForest for Sparse Data handling
Makoto Yui [Fri, 13 Sep 2019 09:23:00 +0000 (18:23 +0900)] 
[HIVEMALL-245] Refactor RandomForest for Sparse Data handling

## What changes were proposed in this pull request?

Refactor RandomForest for Sparse Data handling

## What type of PR is it?

Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-245
https://issues.apache.org/jira/browse/HIVEMALL-171

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #198 from myui/HIVEMALL-245.

3 years agoFixed a documentation bug
Makoto Yui [Fri, 26 Jul 2019 07:33:22 +0000 (16:33 +0900)] 
Fixed a documentation bug

3 years agoAdd test of sparse input for randomforest classifier
Makoto Yui [Thu, 18 Jul 2019 07:51:33 +0000 (16:51 +0900)] 
Add test of sparse input for randomforest classifier

3 years agoFixed a minor typo in doc
Makoto Yui [Sat, 13 Jul 2019 14:45:52 +0000 (23:45 +0900)] 
Fixed a minor typo in doc

3 years agoAdded sanity checks for training data in RandomForest
Makoto Yui [Wed, 10 Jul 2019 07:17:20 +0000 (16:17 +0900)] 
Added sanity checks for training data in RandomForest

3 years agoRefactor Matrix module for NNZ and zero value handling
Makoto Yui [Wed, 10 Jul 2019 05:58:39 +0000 (14:58 +0900)] 
Refactor Matrix module for NNZ and zero value handling

## What changes were proposed in this pull request?

Refactor Matrix module for NNZ and zero value handling.

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

no JIRA issue

## How was this patch tested?

Unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #196 from myui/refactor_randomforest.

3 years agoFixed ToC
Makoto Yui [Fri, 28 Jun 2019 16:57:48 +0000 (01:57 +0900)] 
Fixed ToC

3 years agoAdded usage for feature_binning UDF
Makoto Yui [Fri, 28 Jun 2019 16:55:39 +0000 (01:55 +0900)] 
Added usage for feature_binning UDF

3 years agoFixed a doc
Makoto Yui [Fri, 28 Jun 2019 16:30:53 +0000 (01:30 +0900)] 
Fixed a doc

3 years agoFixed feature binning documentation
Makoto Yui [Fri, 28 Jun 2019 06:43:05 +0000 (15:43 +0900)] 
Fixed feature binning documentation

3 years ago[HIVEMALL-259][DOC] Refactor feature_binning UDF
Makoto Yui [Thu, 27 Jun 2019 18:02:38 +0000 (03:02 +0900)] 
[HIVEMALL-259][DOC] Refactor feature_binning UDF

## What changes were proposed in this pull request?

Refactor feature_binning UDF and update the function usage

## What type of PR is it?

Documentation, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-259

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```
WITH extracted as (
  select
    extract_feature(feature) as index,
    extract_weight(feature) as value
  from
    input l
    LATERAL VIEW explode(features) r as feature
),
mapping as (
  select
    index,
    build_bins(value, 5, true) as quantiles -- 5 bins with auto bin shrinking
  from
    extracted
  group by
    index
),
bins as (
   select
    to_map(index, quantiles) as quantiles
   from
    mapping
)
select
  l.features as original,
  feature_binning(l.features, r.quantiles) as features
from
  input l
  cross join bins r
```

see https://gist.github.com/myui/f943fa3ce1a7e1ac3f2dd9a7f9fa703b

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #195 from myui/HIVEMALL-259.

3 years agoFixed imports
Makoto Yui [Tue, 25 Jun 2019 12:52:12 +0000 (21:52 +0900)] 
Fixed imports

3 years ago[HIVEMALL-253-2] map_roulette UDF
Solodye [Tue, 25 Jun 2019 10:31:02 +0000 (19:31 +0900)] 
[HIVEMALL-253-2] map_roulette UDF

revise #192

Author: Makoto Yui <myui@apache.org>

Closes #193 from myui/HIVEMALL-253-2.

3 years ago[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format
Makoto Yui [Thu, 20 Jun 2019 10:35:42 +0000 (19:35 +0900)] 
[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format

## What changes were proposed in this pull request?

Add UDF to convert feature/label in Libsvm format

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-258

## How was this patch tested?

unit tests and manual tests

## How to use this feature?

```sql
Usage:
 select to_libsvm_format(array('apple:3.4','orange:2.1'))
 > 6284535:3.4 8104713:2.1
 select to_libsvm_format(array('apple:3.4','orange:2.1'), '-features 10')
 > 3:2.1 7:3.4
 select to_libsvm_format(array('7:3.4','3:2.1'), 5.0)
 > 5.0 3:2.1 7:3.4
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #194 from myui/libsvm.

3 years agoFixed a bug in document
Makoto Yui [Thu, 20 Jun 2019 07:09:16 +0000 (16:09 +0900)] 
Fixed a bug in document

3 years agoFixed the usage of min-max scaling and zscore
Makoto Yui [Wed, 19 Jun 2019 10:12:03 +0000 (19:12 +0900)] 
Fixed the usage of min-max scaling and zscore

3 years agoIncreased write buffer from 1MB to 2MB
Makoto Yui [Wed, 12 Jun 2019 08:27:24 +0000 (17:27 +0900)] 
Increased write buffer from 1MB to 2MB

3 years agoUpdate doc
Makoto Yui [Fri, 19 Apr 2019 07:16:32 +0000 (16:16 +0900)] 
Update doc

3 years ago[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja
Makoto Yui [Fri, 19 Apr 2019 07:04:01 +0000 (16:04 +0900)] 
[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja

## What changes were proposed in this pull request?

Add option to return PartOfSpeech information for `tokenize_ja` UDF.

## What type of PR is it?

Feature, Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-251

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
WITH tmp as (
  select
    tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
)
select
  r.tokens,
  r.pos,
  r.tokens[0] as token0,
  r.pos[0] as pos0
from
  tmp;
```

| tokens |pos | token0 | pos0 |
|:-:|:-:|:-:|:-:|
| ["kuromoji","使う","分かち書き","テスト"] | ["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"] | kuromoji | 名詞-一般 |

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #191 from myui/HIVEMALL-251.

3 years ago[HIVEMALL-246] Add feature name validation in feature UDF
Makoto Yui [Sat, 13 Apr 2019 21:24:42 +0000 (06:24 +0900)] 
[HIVEMALL-246] Add feature name validation in feature UDF

## What changes were proposed in this pull request?

This PR adds feature name validation in feature UDF

feature(name, value) should validate name not to include ":". Fail-fast behavior is preferable.

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-246

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #190 from myui/HIVEMALL-246.

3 years ago[HIVEMALL-237-1] Add usage in ML function reference page
Makoto Yui [Sat, 13 Apr 2019 20:37:14 +0000 (05:37 +0900)] 
[HIVEMALL-237-1] Add usage in ML function reference page

## What changes were proposed in this pull request?

Add usage in ML function reference page

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-237

## How was this patch tested?

via CI

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?

Author: Makoto Yui <myui@apache.org>
Author: Makoto YUI <yuin405@gmail.com>

Closes #183 from myui/HIVEMALL-237.

3 years ago[HIVEMALL-248] UDF for Kuromoji stoptags
Makoto Yui [Sat, 13 Apr 2019 20:09:38 +0000 (05:09 +0900)] 
[HIVEMALL-248] UDF for Kuromoji stoptags

## What changes were proposed in this pull request?

In tokenize_ja, user need to provide stoptags that matched tokens removed from the token stream. So, stoptag is "exclusive" rule.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-248

## How was this patch tested?

unit tests, functional test on EMR

## How to use this feature?

```sql
select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));
```
> ["分かち書き","テスト"]

`stoptags_exclude(array<string> tags, [, const string lang='ja'])` is a useful UDF for getting [stoptags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt) excluding given part-of-speech tags as seen below:

```sql
select stoptags_exclude(array("名詞-固有名詞"));
```
> ["その他","その他-間投","フィラー","副詞","副詞-一般","副詞-助詞類接続","助動詞","助詞","助詞-並立助詞"
,"助詞-係助詞","助詞-副助詞","助詞-副助詞/並立助詞/終助詞","助詞-副詞化","助詞-接続助詞","助詞-格助詞
","助詞-格助詞-一般","助詞-格助詞-引用","助詞-格助詞-連語","助詞-特殊","助詞-終助詞","助詞-連体化","助
詞-間投助詞","動詞","動詞-接尾","動詞-自立","動詞-非自立","名詞","名詞-サ変接続","名詞-ナイ形容詞語幹",
"名詞-一般","名詞-代名詞","名詞-代名詞-一般","名詞-代名詞-縮約","名詞-副詞可能","名詞-動詞非自立的","名
詞-引用文字列","名詞-形容動詞語幹","名詞-接尾","名詞-接尾-サ変接続","名詞-接尾-一般","名詞-接尾-人名","
名詞-接尾-副詞可能","名詞-接尾-助動詞語幹","名詞-接尾-助数詞","名詞-接尾-地域","名詞-接尾-形容動詞語幹"
,"名詞-接尾-特殊","名詞-接続詞的","名詞-数","名詞-特殊","名詞-特殊-助動詞語幹","名詞-非自立","名詞-非自
立-一般","名詞-非自立-副詞可能","名詞-非自立-助動詞語幹","名詞-非自立-形容動詞語幹","形容詞","形容詞-接
尾","形容詞-自立","形容詞-非自立","感動詞","接続詞","接頭詞","接頭詞-動詞接続","接頭詞-名詞接続","接頭
詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #189 from myui/HIVEMALL-248.

3 years ago[HIVEMALL-247][DOC] Recommend hive.optimize.cte.materialize.threshold=2 in Hive tunin...
Makoto Yui [Fri, 12 Apr 2019 07:02:17 +0000 (16:02 +0900)] 
[HIVEMALL-247][DOC] Recommend hive.optimize.cte.materialize.threshold=2 in Hive tuning tips

## What changes were proposed in this pull request?

Recommend `hive.optimize.cte.materialize.threshold=2` in Hive tuning tips

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-247

Author: Makoto Yui <myui@apache.org>

Closes #188 from myui/HIVEMALL-247.

3 years ago[HIVEMALL-250][DOC] Add tutorial for binarize_label
Makoto Yui [Fri, 12 Apr 2019 06:38:53 +0000 (15:38 +0900)] 
[HIVEMALL-250][DOC] Add tutorial for binarize_label

## What changes were proposed in this pull request?

Add tutorial for `binarize_label` UDTF

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-250

## How to use this feature?

as described in tutorial

Author: Makoto Yui <myui@apache.org>

Closes #187 from myui/HIVEMALL-250.

3 years agoAdded a unit test for PA regression
Makoto Yui [Mon, 25 Mar 2019 08:27:09 +0000 (17:27 +0900)] 
Added a unit test for PA regression

3 years agoFixed links
Makoto Yui [Mon, 18 Mar 2019 09:43:50 +0000 (18:43 +0900)] 
Fixed links

3 years agoworkaround for maven-project-info-reports-plugin erros on building site
Makoto Yui [Mon, 18 Mar 2019 09:37:04 +0000 (18:37 +0900)] 
workaround for maven-project-info-reports-plugin erros on building site

3 years agoUpdated scm tag
Makoto Yui [Mon, 18 Mar 2019 07:22:20 +0000 (16:22 +0900)] 
Updated scm tag

3 years agoExcluded JDK's tools.jar from Bytecode Version enforcer
Makoto Yui [Mon, 18 Mar 2019 06:40:35 +0000 (15:40 +0900)] 
Excluded JDK's tools.jar from Bytecode Version enforcer

3 years agoAdded Java API compatibility checks
Makoto Yui [Mon, 18 Mar 2019 05:51:58 +0000 (14:51 +0900)] 
Added Java API compatibility checks

3 years ago[HIVEMALL-242][HIVEMALL-241] Drop support for Spark 2.1 and Deprecate Java7 for packaging
Makoto Yui [Mon, 18 Mar 2019 05:14:14 +0000 (14:14 +0900)] 
[HIVEMALL-242][HIVEMALL-241] Drop support for Spark 2.1 and Deprecate Java7 for packaging

## What changes were proposed in this pull request?

- Drop support for Spark 2.1
- Require Java8 for packaging, deprecating Java7 (class file compatibility is Java7 or later)

Runtime Java compatibility: Java7 or later
Packaging/Compile-time Java compatibility: Java8 or later

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-242
https://issues.apache.org/jira/browse/HIVEMALL-241

## How was this patch tested?

unit tests, manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #186 from myui/drop-spark2_1.

3 years ago[HIVEMALL-243] Fix nominal variable handling in DecisionTree and RegressionTre
Makoto Yui [Wed, 13 Mar 2019 07:56:17 +0000 (16:56 +0900)] 
[HIVEMALL-243] Fix nominal variable handling in DecisionTree and RegressionTre

## What changes were proposed in this pull request?

For NOMINAL variable, the maximum attribute index 'm' is used for computing splits.

This cause performance issues for sparse nominal variables. So, revise this handling for a better performance.

https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L703

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-243

## How was this patch tested?

- [x] manual test on EMR

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #185 from myui/HIVEMALL-243.

3 years agoApplied refactoring
Makoto Yui [Thu, 21 Feb 2019 07:11:35 +0000 (16:11 +0900)] 
Applied refactoring

3 years agoApplied formatter
Makoto Yui [Thu, 21 Feb 2019 06:59:41 +0000 (15:59 +0900)] 
Applied formatter