incubator-hivemall.git
2 years agoAdded SparseDMatrixBuilder 211/head
Makoto Yui [Thu, 31 Oct 2019 10:17:54 +0000 (19:17 +0900)] 
Added SparseDMatrixBuilder

2 years agoRenamed XGBoostUDTF as XGBoostBaseUDTF
Makoto Yui [Thu, 31 Oct 2019 10:17:31 +0000 (19:17 +0900)] 
Renamed XGBoostUDTF as XGBoostBaseUDTF

2 years ago[HIVEMALL-274] Fix wrong column name of train_regressor() in tutorial
Aki Ariga [Thu, 31 Oct 2019 07:44:44 +0000 (16:44 +0900)] 
[HIVEMALL-274] Fix wrong column name of train_regressor() in tutorial

## What changes were proposed in this pull request?

Fix document bug reported in HIVEMALL-274

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/projects/HIVEMALL/issues/HIVEMALL-274

## How was this patch tested?

N/A

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Aki Ariga <ariga@treasure-data.com>

Closes #210 from chezou/HIVEMALL-274.

2 years agoAdded document about xgboost_version() UDF
Makoto Yui [Wed, 30 Oct 2019 08:59:49 +0000 (17:59 +0900)] 
Added document about xgboost_version() UDF

2 years ago[HIVEMALL-273] Support xgboost v0.90
Makoto Yui [Wed, 30 Oct 2019 07:41:21 +0000 (16:41 +0900)] 
[HIVEMALL-273] Support xgboost v0.90

## What changes were proposed in this pull request?

Support xgboost v0.90

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-273

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

https://gist.github.com/myui/aa6e142a95ca8f995cc8e49146dbe2eb

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #209 from myui/HIVEMALL-273.

2 years ago[HIVEMALL-260] Remove dependencies to Scala library in xgboost classifier
Makoto Yui [Tue, 29 Oct 2019 06:37:43 +0000 (15:37 +0900)] 
[HIVEMALL-260] Remove dependencies to Scala library in xgboost classifier

## What changes were proposed in this pull request?

Remove dependencies to Scala library in xgboost classifier

## What type of PR is it?

Bug Fix, Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-260

## How was this patch tested?

manual tests on EMR

## How to use this feature?

to appear

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #205 from myui/HIVEMALL-260.

2 years agoRemove rand_gid/rand_gid2 macro
Makoto Yui [Wed, 23 Oct 2019 09:44:41 +0000 (18:44 +0900)] 
Remove rand_gid/rand_gid2 macro

## What changes were proposed in this pull request?

Remove rand_gid/rand_gid2 macro

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-270

Author: Makoto Yui <myui@apache.org>

Closes #204 from myui/HIVEMALL-270.

2 years ago[HIVEMALL-261][HIVEMALL-262] argmin/argmax/argsort UDF
Makoto Yui [Wed, 23 Oct 2019 09:01:51 +0000 (18:01 +0900)] 
[HIVEMALL-261][HIVEMALL-262] argmin/argmax/argsort UDF

## What changes were proposed in this pull request?

Introduce argmin/argmax/argsort UDF

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-261
https://issues.apache.org/jira/browse/HIVEMALL-262

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```sql
SELECT argmax(array(5,2,0,1));
> 0

SELECT array_slice(array(5,2,0,1), argmax(array(5,2,0,1)));
> 5

SELECT argmin(array(5,2,0,1));
> 2

SELECT argsort(array(5,2,0,1));
> 2, 3, 1, 0

SELECT array_slice(array(5,2,0,1), argsort(array(5,2,0,1)));
> 0, 1, 2, 5

SELECT argsort(argsort(array(5,2,0,1))), argrank(array(5,2,0,1));
> 3, 2, 0, 1

SELECT arange(5), arange(1, 5), arange(1, 5, 1), arange(0, 5, 1);
> [0,1,2,3,4]     [1,2,3,4]       [1,2,3,4]       [0,1,2,3,4]

SELECT arange(1, 6, 2);
> 1, 3, 5

SELECT arange(-1, -6, 2);
> -1, -3, -5

SELECT argsort(array(5, 2, 0, 1)), argrank(array(5, 2, 0, 1)), argsort(argsort(array(5, 2, 0, 1)));
> [2,3,1,0]       [3,2,0,1]       [3,2,0,1]
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #197 from myui/argmax.

2 years ago[HIVEMALL-244] Support Java9, Java11(LTS)
Makoto Yui [Mon, 21 Oct 2019 07:22:05 +0000 (16:22 +0900)] 
[HIVEMALL-244] Support Java9, Java11(LTS)

## What changes were proposed in this pull request?

Support Java9, Java11(LTS)

## What type of PR is it?

Improvement | Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-244

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #203 from myui/HIVEMALL-244.

2 years ago[HIVEMALL-269] Modified to use matrix4j for matrix module
Makoto Yui [Fri, 18 Oct 2019 08:42:16 +0000 (17:42 +0900)] 
[HIVEMALL-269] Modified to use matrix4j for matrix module

## What changes were proposed in this pull request?

 Use matrix4j for matrix module

## What type of PR is it?

Hot Fix | Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-269

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #202 from myui/HIVEMALL-269.

2 years agoFixed annotations
Makoto Yui [Tue, 8 Oct 2019 07:15:24 +0000 (16:15 +0900)] 
Fixed annotations

2 years agoMoved matrix/random package to utils/random
Makoto Yui [Mon, 7 Oct 2019 07:16:19 +0000 (16:16 +0900)] 
Moved matrix/random package to utils/random

2 years agoMerged ArrayUtilsTest
Makoto Yui [Mon, 7 Oct 2019 05:44:39 +0000 (14:44 +0900)] 
Merged ArrayUtilsTest

2 years ago[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)
Makoto Yui [Fri, 4 Oct 2019 05:28:49 +0000 (14:28 +0900)] 
[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)

## What changes were proposed in this pull request?

Drop Spark Dataframe support (SparkSQL remain supported).

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-267

## How was this patch tested?

unit tests, manual tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #201 from myui/HIVEMALL-267.

2 years ago[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines
Makoto Yui [Thu, 3 Oct 2019 08:34:10 +0000 (17:34 +0900)] 
[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines

## What changes were proposed in this pull request?

Fix the default vInit, eta initialization bug in FactorizationMachines

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-268

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #200 from myui/HIVEMALL-268.

2 years ago[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees
Makoto Yui [Fri, 27 Sep 2019 18:39:01 +0000 (03:39 +0900)] 
[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees

## What changes were proposed in this pull request?

Introduce `decision_path` UDF providing tracing of decision tree prediction paths

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-171

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

to be described in the user guide

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #199 from myui/HIVEMALL-171.

3 years ago[HIVEMALL-245] Refactor RandomForest for Sparse Data handling
Makoto Yui [Fri, 13 Sep 2019 09:23:00 +0000 (18:23 +0900)] 
[HIVEMALL-245] Refactor RandomForest for Sparse Data handling

## What changes were proposed in this pull request?

Refactor RandomForest for Sparse Data handling

## What type of PR is it?

Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-245
https://issues.apache.org/jira/browse/HIVEMALL-171

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #198 from myui/HIVEMALL-245.

3 years agoFixed a documentation bug
Makoto Yui [Fri, 26 Jul 2019 07:33:22 +0000 (16:33 +0900)] 
Fixed a documentation bug

3 years agoAdd test of sparse input for randomforest classifier
Makoto Yui [Thu, 18 Jul 2019 07:51:33 +0000 (16:51 +0900)] 
Add test of sparse input for randomforest classifier

3 years agoFixed a minor typo in doc
Makoto Yui [Sat, 13 Jul 2019 14:45:52 +0000 (23:45 +0900)] 
Fixed a minor typo in doc

3 years agoAdded sanity checks for training data in RandomForest
Makoto Yui [Wed, 10 Jul 2019 07:17:20 +0000 (16:17 +0900)] 
Added sanity checks for training data in RandomForest

3 years agoRefactor Matrix module for NNZ and zero value handling
Makoto Yui [Wed, 10 Jul 2019 05:58:39 +0000 (14:58 +0900)] 
Refactor Matrix module for NNZ and zero value handling

## What changes were proposed in this pull request?

Refactor Matrix module for NNZ and zero value handling.

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

no JIRA issue

## How was this patch tested?

Unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #196 from myui/refactor_randomforest.

3 years agoFixed ToC
Makoto Yui [Fri, 28 Jun 2019 16:57:48 +0000 (01:57 +0900)] 
Fixed ToC

3 years agoAdded usage for feature_binning UDF
Makoto Yui [Fri, 28 Jun 2019 16:55:39 +0000 (01:55 +0900)] 
Added usage for feature_binning UDF

3 years agoFixed a doc
Makoto Yui [Fri, 28 Jun 2019 16:30:53 +0000 (01:30 +0900)] 
Fixed a doc

3 years agoFixed feature binning documentation
Makoto Yui [Fri, 28 Jun 2019 06:43:05 +0000 (15:43 +0900)] 
Fixed feature binning documentation

3 years ago[HIVEMALL-259][DOC] Refactor feature_binning UDF
Makoto Yui [Thu, 27 Jun 2019 18:02:38 +0000 (03:02 +0900)] 
[HIVEMALL-259][DOC] Refactor feature_binning UDF

## What changes were proposed in this pull request?

Refactor feature_binning UDF and update the function usage

## What type of PR is it?

Documentation, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-259

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```
WITH extracted as (
  select
    extract_feature(feature) as index,
    extract_weight(feature) as value
  from
    input l
    LATERAL VIEW explode(features) r as feature
),
mapping as (
  select
    index,
    build_bins(value, 5, true) as quantiles -- 5 bins with auto bin shrinking
  from
    extracted
  group by
    index
),
bins as (
   select
    to_map(index, quantiles) as quantiles
   from
    mapping
)
select
  l.features as original,
  feature_binning(l.features, r.quantiles) as features
from
  input l
  cross join bins r
```

see https://gist.github.com/myui/f943fa3ce1a7e1ac3f2dd9a7f9fa703b

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #195 from myui/HIVEMALL-259.

3 years agoFixed imports
Makoto Yui [Tue, 25 Jun 2019 12:52:12 +0000 (21:52 +0900)] 
Fixed imports

3 years ago[HIVEMALL-253-2] map_roulette UDF
Solodye [Tue, 25 Jun 2019 10:31:02 +0000 (19:31 +0900)] 
[HIVEMALL-253-2] map_roulette UDF

revise #192

Author: Makoto Yui <myui@apache.org>

Closes #193 from myui/HIVEMALL-253-2.

3 years ago[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format
Makoto Yui [Thu, 20 Jun 2019 10:35:42 +0000 (19:35 +0900)] 
[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format

## What changes were proposed in this pull request?

Add UDF to convert feature/label in Libsvm format

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-258

## How was this patch tested?

unit tests and manual tests

## How to use this feature?

```sql
Usage:
 select to_libsvm_format(array('apple:3.4','orange:2.1'))
 > 6284535:3.4 8104713:2.1
 select to_libsvm_format(array('apple:3.4','orange:2.1'), '-features 10')
 > 3:2.1 7:3.4
 select to_libsvm_format(array('7:3.4','3:2.1'), 5.0)
 > 5.0 3:2.1 7:3.4
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #194 from myui/libsvm.

3 years agoFixed a bug in document
Makoto Yui [Thu, 20 Jun 2019 07:09:16 +0000 (16:09 +0900)] 
Fixed a bug in document

3 years agoFixed the usage of min-max scaling and zscore
Makoto Yui [Wed, 19 Jun 2019 10:12:03 +0000 (19:12 +0900)] 
Fixed the usage of min-max scaling and zscore

3 years agoIncreased write buffer from 1MB to 2MB
Makoto Yui [Wed, 12 Jun 2019 08:27:24 +0000 (17:27 +0900)] 
Increased write buffer from 1MB to 2MB

3 years agoUpdate doc
Makoto Yui [Fri, 19 Apr 2019 07:16:32 +0000 (16:16 +0900)] 
Update doc

3 years ago[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja
Makoto Yui [Fri, 19 Apr 2019 07:04:01 +0000 (16:04 +0900)] 
[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja

## What changes were proposed in this pull request?

Add option to return PartOfSpeech information for `tokenize_ja` UDF.

## What type of PR is it?

Feature, Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-251

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
WITH tmp as (
  select
    tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
)
select
  r.tokens,
  r.pos,
  r.tokens[0] as token0,
  r.pos[0] as pos0
from
  tmp;
```

| tokens |pos | token0 | pos0 |
|:-:|:-:|:-:|:-:|
| ["kuromoji","使う","分かち書き","テスト"] | ["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"] | kuromoji | 名詞-一般 |

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #191 from myui/HIVEMALL-251.

3 years ago[HIVEMALL-246] Add feature name validation in feature UDF
Makoto Yui [Sat, 13 Apr 2019 21:24:42 +0000 (06:24 +0900)] 
[HIVEMALL-246] Add feature name validation in feature UDF

## What changes were proposed in this pull request?

This PR adds feature name validation in feature UDF

feature(name, value) should validate name not to include ":". Fail-fast behavior is preferable.

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-246

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #190 from myui/HIVEMALL-246.

3 years ago[HIVEMALL-237-1] Add usage in ML function reference page
Makoto Yui [Sat, 13 Apr 2019 20:37:14 +0000 (05:37 +0900)] 
[HIVEMALL-237-1] Add usage in ML function reference page

## What changes were proposed in this pull request?

Add usage in ML function reference page

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-237

## How was this patch tested?

via CI

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?

Author: Makoto Yui <myui@apache.org>
Author: Makoto YUI <yuin405@gmail.com>

Closes #183 from myui/HIVEMALL-237.

3 years ago[HIVEMALL-248] UDF for Kuromoji stoptags
Makoto Yui [Sat, 13 Apr 2019 20:09:38 +0000 (05:09 +0900)] 
[HIVEMALL-248] UDF for Kuromoji stoptags

## What changes were proposed in this pull request?

In tokenize_ja, user need to provide stoptags that matched tokens removed from the token stream. So, stoptag is "exclusive" rule.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-248

## How was this patch tested?

unit tests, functional test on EMR

## How to use this feature?

```sql
select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));
```
> ["分かち書き","テスト"]

`stoptags_exclude(array<string> tags, [, const string lang='ja'])` is a useful UDF for getting [stoptags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt) excluding given part-of-speech tags as seen below:

```sql
select stoptags_exclude(array("名詞-固有名詞"));
```
> ["その他","その他-間投","フィラー","副詞","副詞-一般","副詞-助詞類接続","助動詞","助詞","助詞-並立助詞"
,"助詞-係助詞","助詞-副助詞","助詞-副助詞/並立助詞/終助詞","助詞-副詞化","助詞-接続助詞","助詞-格助詞
","助詞-格助詞-一般","助詞-格助詞-引用","助詞-格助詞-連語","助詞-特殊","助詞-終助詞","助詞-連体化","助
詞-間投助詞","動詞","動詞-接尾","動詞-自立","動詞-非自立","名詞","名詞-サ変接続","名詞-ナイ形容詞語幹",
"名詞-一般","名詞-代名詞","名詞-代名詞-一般","名詞-代名詞-縮約","名詞-副詞可能","名詞-動詞非自立的","名
詞-引用文字列","名詞-形容動詞語幹","名詞-接尾","名詞-接尾-サ変接続","名詞-接尾-一般","名詞-接尾-人名","
名詞-接尾-副詞可能","名詞-接尾-助動詞語幹","名詞-接尾-助数詞","名詞-接尾-地域","名詞-接尾-形容動詞語幹"
,"名詞-接尾-特殊","名詞-接続詞的","名詞-数","名詞-特殊","名詞-特殊-助動詞語幹","名詞-非自立","名詞-非自
立-一般","名詞-非自立-副詞可能","名詞-非自立-助動詞語幹","名詞-非自立-形容動詞語幹","形容詞","形容詞-接
尾","形容詞-自立","形容詞-非自立","感動詞","接続詞","接頭詞","接頭詞-動詞接続","接頭詞-名詞接続","接頭
詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #189 from myui/HIVEMALL-248.

3 years ago[HIVEMALL-247][DOC] Recommend hive.optimize.cte.materialize.threshold=2 in Hive tunin...
Makoto Yui [Fri, 12 Apr 2019 07:02:17 +0000 (16:02 +0900)] 
[HIVEMALL-247][DOC] Recommend hive.optimize.cte.materialize.threshold=2 in Hive tuning tips

## What changes were proposed in this pull request?

Recommend `hive.optimize.cte.materialize.threshold=2` in Hive tuning tips

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-247

Author: Makoto Yui <myui@apache.org>

Closes #188 from myui/HIVEMALL-247.

3 years ago[HIVEMALL-250][DOC] Add tutorial for binarize_label
Makoto Yui [Fri, 12 Apr 2019 06:38:53 +0000 (15:38 +0900)] 
[HIVEMALL-250][DOC] Add tutorial for binarize_label

## What changes were proposed in this pull request?

Add tutorial for `binarize_label` UDTF

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-250

## How to use this feature?

as described in tutorial

Author: Makoto Yui <myui@apache.org>

Closes #187 from myui/HIVEMALL-250.

3 years agoAdded a unit test for PA regression
Makoto Yui [Mon, 25 Mar 2019 08:27:09 +0000 (17:27 +0900)] 
Added a unit test for PA regression

3 years agoFixed links
Makoto Yui [Mon, 18 Mar 2019 09:43:50 +0000 (18:43 +0900)] 
Fixed links

3 years agoworkaround for maven-project-info-reports-plugin erros on building site
Makoto Yui [Mon, 18 Mar 2019 09:37:04 +0000 (18:37 +0900)] 
workaround for maven-project-info-reports-plugin erros on building site

3 years agoUpdated scm tag
Makoto Yui [Mon, 18 Mar 2019 07:22:20 +0000 (16:22 +0900)] 
Updated scm tag

3 years agoExcluded JDK's tools.jar from Bytecode Version enforcer
Makoto Yui [Mon, 18 Mar 2019 06:40:35 +0000 (15:40 +0900)] 
Excluded JDK's tools.jar from Bytecode Version enforcer

3 years agoAdded Java API compatibility checks
Makoto Yui [Mon, 18 Mar 2019 05:51:58 +0000 (14:51 +0900)] 
Added Java API compatibility checks

3 years ago[HIVEMALL-242][HIVEMALL-241] Drop support for Spark 2.1 and Deprecate Java7 for packaging
Makoto Yui [Mon, 18 Mar 2019 05:14:14 +0000 (14:14 +0900)] 
[HIVEMALL-242][HIVEMALL-241] Drop support for Spark 2.1 and Deprecate Java7 for packaging

## What changes were proposed in this pull request?

- Drop support for Spark 2.1
- Require Java8 for packaging, deprecating Java7 (class file compatibility is Java7 or later)

Runtime Java compatibility: Java7 or later
Packaging/Compile-time Java compatibility: Java8 or later

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-242
https://issues.apache.org/jira/browse/HIVEMALL-241

## How was this patch tested?

unit tests, manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #186 from myui/drop-spark2_1.

3 years ago[HIVEMALL-243] Fix nominal variable handling in DecisionTree and RegressionTre
Makoto Yui [Wed, 13 Mar 2019 07:56:17 +0000 (16:56 +0900)] 
[HIVEMALL-243] Fix nominal variable handling in DecisionTree and RegressionTre

## What changes were proposed in this pull request?

For NOMINAL variable, the maximum attribute index 'm' is used for computing splits.

This cause performance issues for sparse nominal variables. So, revise this handling for a better performance.

https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L703

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-243

## How was this patch tested?

- [x] manual test on EMR

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #185 from myui/HIVEMALL-243.

3 years agoApplied refactoring
Makoto Yui [Thu, 21 Feb 2019 07:11:35 +0000 (16:11 +0900)] 
Applied refactoring

3 years agoApplied formatter
Makoto Yui [Thu, 21 Feb 2019 06:59:41 +0000 (15:59 +0900)] 
Applied formatter

3 years ago[HIVEMALL-238] Fixed from_json UDF to support top-level Map object
Makoto Yui [Thu, 21 Feb 2019 06:55:39 +0000 (15:55 +0900)] 
[HIVEMALL-238] Fixed from_json UDF to support top-level Map object

## What changes were proposed in this pull request?

Fixed from_json UDF to support top-level Map object

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-238

## How was this patch tested?

unit tests, manual tests

## How to use this feature?

```sql
select
  from_json(to_json(map('one',1,'two',2)), 'map<string,int>')
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #184 from myui/HIVEMALL-238.

3 years agoFixed scala test errors
Makoto Yui [Fri, 15 Feb 2019 06:10:56 +0000 (15:10 +0900)] 
Fixed scala test errors

3 years agoFixed CI error due to a bug in unit test
Makoto Yui [Thu, 14 Feb 2019 06:07:12 +0000 (15:07 +0900)] 
Fixed CI error due to a bug in unit test

3 years agoRefined tutorial documents
Makoto Yui [Fri, 8 Feb 2019 06:10:54 +0000 (15:10 +0900)] 
Refined tutorial documents

3 years agoApplied refactoring and documentation improvement
Makoto Yui [Fri, 8 Feb 2019 06:10:29 +0000 (15:10 +0900)] 
Applied refactoring and documentation improvement

3 years agoRenamed map_index UDF to map_get
Makoto Yui [Thu, 7 Feb 2019 06:12:39 +0000 (15:12 +0900)] 
Renamed map_index UDF to map_get

3 years agoAdded usages
Makoto Yui [Wed, 6 Feb 2019 08:16:24 +0000 (17:16 +0900)] 
Added usages

3 years agoModified to_string_array to be a generic UDF
Makoto Yui [Wed, 6 Feb 2019 08:15:47 +0000 (17:15 +0900)] 
Modified to_string_array to be a generic UDF

3 years ago[HIVEMALL-236] to_json/from_json cause KryoException/NullPointerException with ArrayL...
Makoto Yui [Tue, 5 Feb 2019 08:17:37 +0000 (17:17 +0900)] 
[HIVEMALL-236] to_json/from_json cause KryoException/NullPointerException with ArrayList due to Kryo bug

## What changes were proposed in this pull request?

Avoid NPE in Kryo serialization of List object created by `Arrays.asList`.

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-236

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #182 from myui/json_fix.

3 years ago[HIVEMALL-233-2] RandomForest regressor accepts sparse vector input
Takuya Kitazawa [Tue, 5 Feb 2019 04:55:55 +0000 (13:55 +0900)] 
[HIVEMALL-233-2] RandomForest regressor accepts sparse vector input

## What changes were proposed in this pull request?

Enable RandomForestRegressor to accept sparse vector input as RandomForestClassifier already does.

This closes #178

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-233

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
with customers as (
  select 1 as id, "male" as gender, 23 as age, "Japan" as country, 12 as num_purchases
  union all
  select 2 as id, "female" as gender, 43 as age, "US" as country, 4 as num_purchases
  union all
  select 3 as id, "other" as gender, 19 as age, "UK" as country, 2 as num_purchases
  union all
  select 4 as id, "male" as gender, 31 as age, "US" as country, 20 as num_purchases
  union all
  select 5 as id, "female" as gender, 37 as age, "Australia" as country, 9 as num_purchases
),
training as (
  select
    array_concat(
      quantitative_features(
        array("age"),
        age
      ),
      categorical_features(
        array("country", "gender"),
        country, gender
      )
    ) as features,
    num_purchases
  from
    customers
)
select
  train_randomforest_regressor(
    feature_hashing(features), -- feature vector
    num_purchases, -- target value
    '-trees 40 -seed 31' -- hyper-parameters
  )
from
  training
;
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Takuya Kitazawa <k.takuti@gmail.com>
Author: Makoto Yui <myui@apache.org>

Closes #181 from myui/HIVEMALL-233-2.

3 years ago[HIVEMALL-234] Define `EtaEstimator` default values as constants
Takuya Kitazawa [Wed, 30 Jan 2019 05:01:35 +0000 (14:01 +0900)] 
[HIVEMALL-234] Define `EtaEstimator` default values as constants

## What changes were proposed in this pull request?

Fix mismatched default values declared in `getOptions()` and `EtaEstimator`.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-234

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?

Author: Takuya Kitazawa <k.takuti@gmail.com>

Closes #179 from takuti/HIVEMALL-234.

3 years ago[HIVEMALL-235] Fix a bug in expansion of array where size is zero
Makoto Yui [Wed, 30 Jan 2019 04:50:25 +0000 (13:50 +0900)] 
[HIVEMALL-235] Fix a bug in expansion of array where size is zero

## What changes were proposed in this pull request?

Fix a bug in expansion of array where size is zero.

See for detail
https://github.com/apache/incubator-hivemall/pull/178/commits/d7695d461056b21eab25465e015c582edc2b57ce

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-235

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #180 from myui/HIVEMALL-235.

3 years ago[HIVEMALL-232][DOC] Fix typo in the Top-K document
Kengo Seki [Thu, 10 Jan 2019 18:33:49 +0000 (03:33 +0900)] 
[HIVEMALL-232][DOC] Fix typo in the Top-K document

## What changes were proposed in this pull request?

`DISTRIBUTE BY x CLASS SORT BY x` in the Top-K document looks like a typo, so fixing it.

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-232

## How was this patch tested?

I think no test is needed since it's just a minor documentation fix.

Author: Kengo Seki <sekikn@apache.org>

Closes #177 from sekikn/HIVEMALL-232.

3 years agoFixed to update generic_func.md properly
Makoto Yui [Wed, 9 Jan 2019 07:00:53 +0000 (16:00 +0900)] 
Fixed to update generic_func.md properly

3 years ago[HIVEMALL-231] Replaced subarray UDF implementation with SubarrayUDF
Makoto Yui [Tue, 8 Jan 2019 11:02:07 +0000 (20:02 +0900)] 
[HIVEMALL-231] Replaced subarray UDF implementation with SubarrayUDF

## What changes were proposed in this pull request?

Replaced subarray UDF implementation with SubarrayUDF for backward compatibility.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-231

## How was this patch tested?

manual tests on EMR

## How to use this feature?

To be described in [userguide](http://hivemall.incubator.apache.org/userguide/misc/generic_funcs.html#array).

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #176 from myui/subarray.

3 years agoMoved git repos to Gitbox
Makoto Yui [Tue, 8 Jan 2019 06:21:59 +0000 (15:21 +0900)] 
Moved git repos to Gitbox

3 years ago[HIVEMALL-214][DOC] Update userguide for General Classifier/Regressor example
Makoto Yui [Wed, 26 Dec 2018 10:15:43 +0000 (19:15 +0900)] 
[HIVEMALL-214][DOC] Update userguide for General Classifier/Regressor example

## What changes were proposed in this pull request?

Refine user guide for generic classifier/regressor and so on.

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-214

## How to use this feature?

See user guide.

Author: Makoto Yui <myui@apache.org>

Closes #159 from myui/HIVEMALL-214.

3 years ago[HIVEMALL-230] Revise Optimizer Implementation
Makoto Yui [Wed, 26 Dec 2018 10:14:23 +0000 (19:14 +0900)] 
[HIVEMALL-230] Revise Optimizer Implementation

## What changes were proposed in this pull request?

Revise Optimizer implementation.

1. Revise default hyperparameters of AdaDelta and Adam.
2. Support AdamW, Amsgrad, AdamHD, Eve, and YellowFin optimizer.

- [x] Nesterov’s Accelerated Gradient
https://arxiv.org/abs/1212.0901
- [x] Rmsprop
Geoffrey Hinton, Nitish Srivastava, Kevin Swersky. 2014. Lecture 6e: Rmsprop: Divide the gradient by a running average of its recent magnitude
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
- [x] RMSpropGraves - Generating Sequences With Recurrent Neural Networks
https://arxiv.org/abs/1308.0850
- [x] Fixing Weight Decay Regularization in Adam
https://openreview.net/forum?id=rk6qdGgCZ
- [x] On the Convergence of Adam and Beyond
https://openreview.net/forum?id=ryQu7f-RZ
- [x] AdamHD (Adam with Hypergradient descent)
https://arxiv.org/pdf/1703.04782.pdf
- [x] Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
https://openreview.net/forum?id=r1WUqIceg
- [x] nadam: Adam with Nesterov momentum
https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ
http://cs229.stanford.edu/proj2015/054_report.pdf
http://www.cs.toronto.edu/~fritz/absps/momentum.pdf
- [ ] ~YellowFin and the Art of Momentum Tuning~
https://openreview.net/forum?id=SyrGJYlRZ

## What type of PR is it?

Improvement, Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-230

## How was this patch tested?

unit tests, emr

## How to use this feature?

Described in [tutorial](http://hivemall.incubator.apache.org/userguide/index.html)

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #175 from myui/adam_test.

3 years agoFixed ANN message and download page
Makoto Yui [Tue, 4 Dec 2018 07:13:25 +0000 (16:13 +0900)] 
Fixed ANN message and download page

3 years agoUpdate the project top page
Makoto Yui [Mon, 3 Dec 2018 09:27:36 +0000 (18:27 +0900)] 
Update the project top page

3 years agoUpdated release history
Makoto Yui [Mon, 3 Dec 2018 09:03:02 +0000 (18:03 +0900)] 
Updated release history

3 years agoMerge remote-tracking branch 'origin/v0.5.2'
Makoto Yui [Mon, 3 Dec 2018 07:32:03 +0000 (16:32 +0900)] 
Merge remote-tracking branch 'origin/v0.5.2'

3 years ago[DOC] Added workaround for a Surefire error
Makoto Yui [Wed, 21 Nov 2018 06:11:29 +0000 (15:11 +0900)] 
[DOC] Added workaround for a Surefire error

3 years ago[HIVEMALL-227-2] Updated release guide to use SHA-512
Makoto Yui [Mon, 19 Nov 2018 10:29:28 +0000 (19:29 +0900)] 
[HIVEMALL-227-2] Updated release guide to use SHA-512

3 years ago[maven-release-plugin] prepare for next development iteration v0.5.2
Makoto Yui [Mon, 19 Nov 2018 08:44:42 +0000 (17:44 +0900)] 
[maven-release-plugin] prepare for next development iteration

3 years ago[maven-release-plugin] prepare release v0.5.2-rc2 v0.5.2 v0.5.2-rc2
Makoto Yui [Mon, 19 Nov 2018 08:44:31 +0000 (17:44 +0900)] 
[maven-release-plugin] prepare release v0.5.2-rc2

3 years agoBumped up ASF parent pom version to 21 to use SHA-512 instead of SHA-1
Makoto Yui [Mon, 19 Nov 2018 08:34:02 +0000 (17:34 +0900)] 
Bumped up ASF parent pom version to 21 to use SHA-512 instead of SHA-1

3 years ago[HIVEMALL-227][DOC] Removed md5 and replace sha1 with sha512 following new ASF policy
Makoto Yui [Thu, 15 Nov 2018 09:39:58 +0000 (18:39 +0900)] 
[HIVEMALL-227][DOC] Removed md5 and replace sha1 with sha512 following new ASF policy

## What changes were proposed in this pull request?

Removed md5 and replace sha1 with sha512 following new ASF policy

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-227

Author: Makoto Yui <myui@apache.org>

Closes #173 from myui/HIVEMALL-227.

3 years agoBumped version string to 0.5.2-incubating
Makoto Yui [Thu, 15 Nov 2018 06:54:44 +0000 (15:54 +0900)] 
Bumped version string to 0.5.2-incubating

3 years agoPrepare for the next Snapshot release of v0.5.2
Makoto Yui [Thu, 15 Nov 2018 06:16:28 +0000 (15:16 +0900)] 
Prepare for the next Snapshot release of v0.5.2

3 years ago[SPARK][HOTFIX] Fix the existing test failures in spark-2.3
Takeshi Yamamuro [Wed, 14 Nov 2018 17:33:01 +0000 (02:33 +0900)] 
[SPARK][HOTFIX] Fix the existing test failures in spark-2.3

## What changes were proposed in this pull request?
This pr is to fix the test failures for spark-2.3.

## How was this patch tested?
Run the existing tests.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #171 from maropu/HOTFIX-20181114.

3 years agoFix typo
Vladimir Kroz [Wed, 14 Nov 2018 06:25:19 +0000 (15:25 +0900)] 
Fix typo

## What changes were proposed in this pull request?

Fix minor typo in documentation

## What type of PR is it?

Documentation

## What is the Jira issue?

n/a

## How was this patch tested?

n/a

## How to use this feature?

n/a

## Checklist

n/a

Author: Vladimir Kroz <vkroz@users.noreply.github.com>

Closes #172 from vkroz/patch-1.

3 years agoFixed tutorial docs
Makoto Yui [Tue, 13 Nov 2018 09:29:07 +0000 (18:29 +0900)] 
Fixed tutorial docs

3 years ago[HIVEMALL-223] Add -kv_map and -vk_map option to to_ordered_list UDAF
Makoto Yui [Tue, 13 Nov 2018 09:18:35 +0000 (18:18 +0900)] 
[HIVEMALL-223] Add -kv_map and -vk_map option to to_ordered_list UDAF

## What changes were proposed in this pull request?

Add `-kv_map` and `-vk_map` option to `to_ordered_list` UDAF.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-223

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

Will be described in
http://hivemall.incubator.apache.org/userguide/misc/generic_funcs.html#array

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #170 from myui/HIVEMALL-223.

3 years agoFixed Travis CI bug [[: not found
Makoto Yui [Fri, 9 Nov 2018 06:39:35 +0000 (15:39 +0900)] 
Fixed Travis CI bug [[: not found

3 years agoFixed scalatest version used for Spark 2.3 to avoid scalatest version conflict
Makoto Yui [Wed, 7 Nov 2018 17:47:50 +0000 (02:47 +0900)] 
Fixed scalatest version used for Spark 2.3 to avoid scalatest version conflict

3 years agoFixed release guide for MAVEN_OPT
Makoto Yui [Wed, 7 Nov 2018 17:37:17 +0000 (02:37 +0900)] 
Fixed release guide for MAVEN_OPT

3 years agoUpdated Netty version to cope with NoSuchMethodError PooledByteBufAllocator.metric...
Makoto Yui [Wed, 7 Nov 2018 10:24:24 +0000 (19:24 +0900)] 
Updated Netty version to cope with NoSuchMethodError PooledByteBufAllocator.metric() for Spark v2.3

3 years agoFixed a bug introduced in the previous commit
Makoto Yui [Wed, 7 Nov 2018 09:42:09 +0000 (18:42 +0900)] 
Fixed a bug introduced in the previous commit

3 years agoRemoved unknown host
Makoto Yui [Wed, 7 Nov 2018 09:36:05 +0000 (18:36 +0900)] 
Removed unknown host

3 years agoFixed scala test for subarray UDF misusage
Makoto Yui [Wed, 7 Nov 2018 07:36:49 +0000 (16:36 +0900)] 
Fixed scala test for subarray UDF misusage

3 years agoFixed GeneralRegressorUDTFTest to cope with behavioral change where dloss is zero
Makoto Yui [Wed, 7 Nov 2018 06:41:24 +0000 (15:41 +0900)] 
Fixed GeneralRegressorUDTFTest to cope with behavioral change where dloss is zero

3 years agoFixed a bug in ArrayFlattenUDFTest
Makoto Yui [Tue, 6 Nov 2018 14:12:47 +0000 (23:12 +0900)] 
Fixed a bug in ArrayFlattenUDFTest

3 years agoFixed a possible Json deserialize bug caused by illegal Text use
Makoto Yui [Tue, 6 Nov 2018 10:42:28 +0000 (19:42 +0900)] 
Fixed a possible Json deserialize bug caused by illegal Text use

3 years agoFixed failing test
Makoto Yui [Tue, 6 Nov 2018 10:41:15 +0000 (19:41 +0900)] 
Fixed failing test

3 years agoUpdated release guide for SSL related workaround
Makoto Yui [Mon, 5 Nov 2018 08:51:13 +0000 (17:51 +0900)] 
Updated release guide for SSL related workaround

3 years agoAdded missing license header
Makoto Yui [Sat, 3 Nov 2018 07:54:04 +0000 (16:54 +0900)] 
Added missing license header

3 years agoAdded Koji to the Mentor list
Makoto Yui [Sat, 3 Nov 2018 07:46:08 +0000 (16:46 +0900)] 
Added Koji to the Mentor list

3 years agoFixed term vector space tutorial
Makoto Yui [Sat, 3 Nov 2018 07:38:47 +0000 (16:38 +0900)] 
Fixed term vector space tutorial

3 years agoFixed bm25() UDF for help message
Makoto Yui [Sat, 3 Nov 2018 07:38:13 +0000 (16:38 +0900)] 
Fixed bm25() UDF for help message