incubator-hivemall.git
2 years ago[HIVEMALL-269] Modified to use matrix4j for matrix module
Makoto Yui [Fri, 18 Oct 2019 08:42:16 +0000 (17:42 +0900)] 
[HIVEMALL-269] Modified to use matrix4j for matrix module

## What changes were proposed in this pull request?

 Use matrix4j for matrix module

## What type of PR is it?

Hot Fix | Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-269

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #202 from myui/HIVEMALL-269.

2 years agoFixed annotations
Makoto Yui [Tue, 8 Oct 2019 07:15:24 +0000 (16:15 +0900)] 
Fixed annotations

2 years agoMoved matrix/random package to utils/random
Makoto Yui [Mon, 7 Oct 2019 07:16:19 +0000 (16:16 +0900)] 
Moved matrix/random package to utils/random

2 years agoMerged ArrayUtilsTest
Makoto Yui [Mon, 7 Oct 2019 05:44:39 +0000 (14:44 +0900)] 
Merged ArrayUtilsTest

2 years ago[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)
Makoto Yui [Fri, 4 Oct 2019 05:28:49 +0000 (14:28 +0900)] 
[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)

## What changes were proposed in this pull request?

Drop Spark Dataframe support (SparkSQL remain supported).

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-267

## How was this patch tested?

unit tests, manual tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #201 from myui/HIVEMALL-267.

2 years ago[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines
Makoto Yui [Thu, 3 Oct 2019 08:34:10 +0000 (17:34 +0900)] 
[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines

## What changes were proposed in this pull request?

Fix the default vInit, eta initialization bug in FactorizationMachines

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-268

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #200 from myui/HIVEMALL-268.

3 years ago[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees
Makoto Yui [Fri, 27 Sep 2019 18:39:01 +0000 (03:39 +0900)] 
[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees

## What changes were proposed in this pull request?

Introduce `decision_path` UDF providing tracing of decision tree prediction paths

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-171

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

to be described in the user guide

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #199 from myui/HIVEMALL-171.

3 years ago[HIVEMALL-245] Refactor RandomForest for Sparse Data handling
Makoto Yui [Fri, 13 Sep 2019 09:23:00 +0000 (18:23 +0900)] 
[HIVEMALL-245] Refactor RandomForest for Sparse Data handling

## What changes were proposed in this pull request?

Refactor RandomForest for Sparse Data handling

## What type of PR is it?

Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-245
https://issues.apache.org/jira/browse/HIVEMALL-171

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #198 from myui/HIVEMALL-245.

3 years agoFixed a documentation bug
Makoto Yui [Fri, 26 Jul 2019 07:33:22 +0000 (16:33 +0900)] 
Fixed a documentation bug

3 years agoAdd test of sparse input for randomforest classifier
Makoto Yui [Thu, 18 Jul 2019 07:51:33 +0000 (16:51 +0900)] 
Add test of sparse input for randomforest classifier

3 years agoFixed a minor typo in doc
Makoto Yui [Sat, 13 Jul 2019 14:45:52 +0000 (23:45 +0900)] 
Fixed a minor typo in doc

3 years agoAdded sanity checks for training data in RandomForest
Makoto Yui [Wed, 10 Jul 2019 07:17:20 +0000 (16:17 +0900)] 
Added sanity checks for training data in RandomForest

3 years agoRefactor Matrix module for NNZ and zero value handling
Makoto Yui [Wed, 10 Jul 2019 05:58:39 +0000 (14:58 +0900)] 
Refactor Matrix module for NNZ and zero value handling

## What changes were proposed in this pull request?

Refactor Matrix module for NNZ and zero value handling.

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

no JIRA issue

## How was this patch tested?

Unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #196 from myui/refactor_randomforest.

3 years agoFixed ToC
Makoto Yui [Fri, 28 Jun 2019 16:57:48 +0000 (01:57 +0900)] 
Fixed ToC

3 years agoAdded usage for feature_binning UDF
Makoto Yui [Fri, 28 Jun 2019 16:55:39 +0000 (01:55 +0900)] 
Added usage for feature_binning UDF

3 years agoFixed a doc
Makoto Yui [Fri, 28 Jun 2019 16:30:53 +0000 (01:30 +0900)] 
Fixed a doc

3 years agoFixed feature binning documentation
Makoto Yui [Fri, 28 Jun 2019 06:43:05 +0000 (15:43 +0900)] 
Fixed feature binning documentation

3 years ago[HIVEMALL-259][DOC] Refactor feature_binning UDF
Makoto Yui [Thu, 27 Jun 2019 18:02:38 +0000 (03:02 +0900)] 
[HIVEMALL-259][DOC] Refactor feature_binning UDF

## What changes were proposed in this pull request?

Refactor feature_binning UDF and update the function usage

## What type of PR is it?

Documentation, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-259

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```
WITH extracted as (
  select
    extract_feature(feature) as index,
    extract_weight(feature) as value
  from
    input l
    LATERAL VIEW explode(features) r as feature
),
mapping as (
  select
    index,
    build_bins(value, 5, true) as quantiles -- 5 bins with auto bin shrinking
  from
    extracted
  group by
    index
),
bins as (
   select
    to_map(index, quantiles) as quantiles
   from
    mapping
)
select
  l.features as original,
  feature_binning(l.features, r.quantiles) as features
from
  input l
  cross join bins r
```

see https://gist.github.com/myui/f943fa3ce1a7e1ac3f2dd9a7f9fa703b

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #195 from myui/HIVEMALL-259.

3 years agoFixed imports
Makoto Yui [Tue, 25 Jun 2019 12:52:12 +0000 (21:52 +0900)] 
Fixed imports

3 years ago[HIVEMALL-253-2] map_roulette UDF
Solodye [Tue, 25 Jun 2019 10:31:02 +0000 (19:31 +0900)] 
[HIVEMALL-253-2] map_roulette UDF

revise #192

Author: Makoto Yui <myui@apache.org>

Closes #193 from myui/HIVEMALL-253-2.

3 years ago[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format
Makoto Yui [Thu, 20 Jun 2019 10:35:42 +0000 (19:35 +0900)] 
[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format

## What changes were proposed in this pull request?

Add UDF to convert feature/label in Libsvm format

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-258

## How was this patch tested?

unit tests and manual tests

## How to use this feature?

```sql
Usage:
 select to_libsvm_format(array('apple:3.4','orange:2.1'))
 > 6284535:3.4 8104713:2.1
 select to_libsvm_format(array('apple:3.4','orange:2.1'), '-features 10')
 > 3:2.1 7:3.4
 select to_libsvm_format(array('7:3.4','3:2.1'), 5.0)
 > 5.0 3:2.1 7:3.4
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #194 from myui/libsvm.

3 years agoFixed a bug in document
Makoto Yui [Thu, 20 Jun 2019 07:09:16 +0000 (16:09 +0900)] 
Fixed a bug in document

3 years agoFixed the usage of min-max scaling and zscore
Makoto Yui [Wed, 19 Jun 2019 10:12:03 +0000 (19:12 +0900)] 
Fixed the usage of min-max scaling and zscore

3 years agoIncreased write buffer from 1MB to 2MB
Makoto Yui [Wed, 12 Jun 2019 08:27:24 +0000 (17:27 +0900)] 
Increased write buffer from 1MB to 2MB

3 years agoUpdate doc
Makoto Yui [Fri, 19 Apr 2019 07:16:32 +0000 (16:16 +0900)] 
Update doc

3 years ago[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja
Makoto Yui [Fri, 19 Apr 2019 07:04:01 +0000 (16:04 +0900)] 
[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja

## What changes were proposed in this pull request?

Add option to return PartOfSpeech information for `tokenize_ja` UDF.

## What type of PR is it?

Feature, Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-251

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
WITH tmp as (
  select
    tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
)
select
  r.tokens,
  r.pos,
  r.tokens[0] as token0,
  r.pos[0] as pos0
from
  tmp;
```

| tokens |pos | token0 | pos0 |
|:-:|:-:|:-:|:-:|
| ["kuromoji","使う","分かち書き","テスト"] | ["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"] | kuromoji | 名詞-一般 |

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #191 from myui/HIVEMALL-251.

3 years ago[HIVEMALL-246] Add feature name validation in feature UDF
Makoto Yui [Sat, 13 Apr 2019 21:24:42 +0000 (06:24 +0900)] 
[HIVEMALL-246] Add feature name validation in feature UDF

## What changes were proposed in this pull request?

This PR adds feature name validation in feature UDF

feature(name, value) should validate name not to include ":". Fail-fast behavior is preferable.

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-246

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #190 from myui/HIVEMALL-246.

3 years ago[HIVEMALL-237-1] Add usage in ML function reference page
Makoto Yui [Sat, 13 Apr 2019 20:37:14 +0000 (05:37 +0900)] 
[HIVEMALL-237-1] Add usage in ML function reference page

## What changes were proposed in this pull request?

Add usage in ML function reference page

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-237

## How was this patch tested?

via CI

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?

Author: Makoto Yui <myui@apache.org>
Author: Makoto YUI <yuin405@gmail.com>

Closes #183 from myui/HIVEMALL-237.

3 years ago[HIVEMALL-248] UDF for Kuromoji stoptags
Makoto Yui [Sat, 13 Apr 2019 20:09:38 +0000 (05:09 +0900)] 
[HIVEMALL-248] UDF for Kuromoji stoptags

## What changes were proposed in this pull request?

In tokenize_ja, user need to provide stoptags that matched tokens removed from the token stream. So, stoptag is "exclusive" rule.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-248

## How was this patch tested?

unit tests, functional test on EMR

## How to use this feature?

```sql
select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));
```
> ["分かち書き","テスト"]

`stoptags_exclude(array<string> tags, [, const string lang='ja'])` is a useful UDF for getting [stoptags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt) excluding given part-of-speech tags as seen below:

```sql
select stoptags_exclude(array("名詞-固有名詞"));
```
> ["その他","その他-間投","フィラー","副詞","副詞-一般","副詞-助詞類接続","助動詞","助詞","助詞-並立助詞"
,"助詞-係助詞","助詞-副助詞","助詞-副助詞/並立助詞/終助詞","助詞-副詞化","助詞-接続助詞","助詞-格助詞
","助詞-格助詞-一般","助詞-格助詞-引用","助詞-格助詞-連語","助詞-特殊","助詞-終助詞","助詞-連体化","助
詞-間投助詞","動詞","動詞-接尾","動詞-自立","動詞-非自立","名詞","名詞-サ変接続","名詞-ナイ形容詞語幹",
"名詞-一般","名詞-代名詞","名詞-代名詞-一般","名詞-代名詞-縮約","名詞-副詞可能","名詞-動詞非自立的","名
詞-引用文字列","名詞-形容動詞語幹","名詞-接尾","名詞-接尾-サ変接続","名詞-接尾-一般","名詞-接尾-人名","
名詞-接尾-副詞可能","名詞-接尾-助動詞語幹","名詞-接尾-助数詞","名詞-接尾-地域","名詞-接尾-形容動詞語幹"
,"名詞-接尾-特殊","名詞-接続詞的","名詞-数","名詞-特殊","名詞-特殊-助動詞語幹","名詞-非自立","名詞-非自
立-一般","名詞-非自立-副詞可能","名詞-非自立-助動詞語幹","名詞-非自立-形容動詞語幹","形容詞","形容詞-接
尾","形容詞-自立","形容詞-非自立","感動詞","接続詞","接頭詞","接頭詞-動詞接続","接頭詞-名詞接続","接頭
詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #189 from myui/HIVEMALL-248.

3 years ago[HIVEMALL-247][DOC] Recommend hive.optimize.cte.materialize.threshold=2 in Hive tunin...
Makoto Yui [Fri, 12 Apr 2019 07:02:17 +0000 (16:02 +0900)] 
[HIVEMALL-247][DOC] Recommend hive.optimize.cte.materialize.threshold=2 in Hive tuning tips

## What changes were proposed in this pull request?

Recommend `hive.optimize.cte.materialize.threshold=2` in Hive tuning tips

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-247

Author: Makoto Yui <myui@apache.org>

Closes #188 from myui/HIVEMALL-247.

3 years ago[HIVEMALL-250][DOC] Add tutorial for binarize_label
Makoto Yui [Fri, 12 Apr 2019 06:38:53 +0000 (15:38 +0900)] 
[HIVEMALL-250][DOC] Add tutorial for binarize_label

## What changes were proposed in this pull request?

Add tutorial for `binarize_label` UDTF

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-250

## How to use this feature?

as described in tutorial

Author: Makoto Yui <myui@apache.org>

Closes #187 from myui/HIVEMALL-250.

3 years agoAdded a unit test for PA regression
Makoto Yui [Mon, 25 Mar 2019 08:27:09 +0000 (17:27 +0900)] 
Added a unit test for PA regression

3 years agoFixed links
Makoto Yui [Mon, 18 Mar 2019 09:43:50 +0000 (18:43 +0900)] 
Fixed links

3 years agoworkaround for maven-project-info-reports-plugin erros on building site
Makoto Yui [Mon, 18 Mar 2019 09:37:04 +0000 (18:37 +0900)] 
workaround for maven-project-info-reports-plugin erros on building site

3 years agoUpdated scm tag
Makoto Yui [Mon, 18 Mar 2019 07:22:20 +0000 (16:22 +0900)] 
Updated scm tag

3 years agoExcluded JDK's tools.jar from Bytecode Version enforcer
Makoto Yui [Mon, 18 Mar 2019 06:40:35 +0000 (15:40 +0900)] 
Excluded JDK's tools.jar from Bytecode Version enforcer

3 years agoAdded Java API compatibility checks
Makoto Yui [Mon, 18 Mar 2019 05:51:58 +0000 (14:51 +0900)] 
Added Java API compatibility checks

3 years ago[HIVEMALL-242][HIVEMALL-241] Drop support for Spark 2.1 and Deprecate Java7 for packaging
Makoto Yui [Mon, 18 Mar 2019 05:14:14 +0000 (14:14 +0900)] 
[HIVEMALL-242][HIVEMALL-241] Drop support for Spark 2.1 and Deprecate Java7 for packaging

## What changes were proposed in this pull request?

- Drop support for Spark 2.1
- Require Java8 for packaging, deprecating Java7 (class file compatibility is Java7 or later)

Runtime Java compatibility: Java7 or later
Packaging/Compile-time Java compatibility: Java8 or later

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-242
https://issues.apache.org/jira/browse/HIVEMALL-241

## How was this patch tested?

unit tests, manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #186 from myui/drop-spark2_1.

3 years ago[HIVEMALL-243] Fix nominal variable handling in DecisionTree and RegressionTre
Makoto Yui [Wed, 13 Mar 2019 07:56:17 +0000 (16:56 +0900)] 
[HIVEMALL-243] Fix nominal variable handling in DecisionTree and RegressionTre

## What changes were proposed in this pull request?

For NOMINAL variable, the maximum attribute index 'm' is used for computing splits.

This cause performance issues for sparse nominal variables. So, revise this handling for a better performance.

https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L703

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-243

## How was this patch tested?

- [x] manual test on EMR

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #185 from myui/HIVEMALL-243.

3 years agoApplied refactoring
Makoto Yui [Thu, 21 Feb 2019 07:11:35 +0000 (16:11 +0900)] 
Applied refactoring

3 years agoApplied formatter
Makoto Yui [Thu, 21 Feb 2019 06:59:41 +0000 (15:59 +0900)] 
Applied formatter

3 years ago[HIVEMALL-238] Fixed from_json UDF to support top-level Map object
Makoto Yui [Thu, 21 Feb 2019 06:55:39 +0000 (15:55 +0900)] 
[HIVEMALL-238] Fixed from_json UDF to support top-level Map object

## What changes were proposed in this pull request?

Fixed from_json UDF to support top-level Map object

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-238

## How was this patch tested?

unit tests, manual tests

## How to use this feature?

```sql
select
  from_json(to_json(map('one',1,'two',2)), 'map<string,int>')
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #184 from myui/HIVEMALL-238.

3 years agoFixed scala test errors
Makoto Yui [Fri, 15 Feb 2019 06:10:56 +0000 (15:10 +0900)] 
Fixed scala test errors

3 years agoFixed CI error due to a bug in unit test
Makoto Yui [Thu, 14 Feb 2019 06:07:12 +0000 (15:07 +0900)] 
Fixed CI error due to a bug in unit test

3 years agoRefined tutorial documents
Makoto Yui [Fri, 8 Feb 2019 06:10:54 +0000 (15:10 +0900)] 
Refined tutorial documents

3 years agoApplied refactoring and documentation improvement
Makoto Yui [Fri, 8 Feb 2019 06:10:29 +0000 (15:10 +0900)] 
Applied refactoring and documentation improvement

3 years agoRenamed map_index UDF to map_get
Makoto Yui [Thu, 7 Feb 2019 06:12:39 +0000 (15:12 +0900)] 
Renamed map_index UDF to map_get

3 years agoAdded usages
Makoto Yui [Wed, 6 Feb 2019 08:16:24 +0000 (17:16 +0900)] 
Added usages

3 years agoModified to_string_array to be a generic UDF
Makoto Yui [Wed, 6 Feb 2019 08:15:47 +0000 (17:15 +0900)] 
Modified to_string_array to be a generic UDF

3 years ago[HIVEMALL-236] to_json/from_json cause KryoException/NullPointerException with ArrayL...
Makoto Yui [Tue, 5 Feb 2019 08:17:37 +0000 (17:17 +0900)] 
[HIVEMALL-236] to_json/from_json cause KryoException/NullPointerException with ArrayList due to Kryo bug

## What changes were proposed in this pull request?

Avoid NPE in Kryo serialization of List object created by `Arrays.asList`.

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-236

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #182 from myui/json_fix.

3 years ago[HIVEMALL-233-2] RandomForest regressor accepts sparse vector input
Takuya Kitazawa [Tue, 5 Feb 2019 04:55:55 +0000 (13:55 +0900)] 
[HIVEMALL-233-2] RandomForest regressor accepts sparse vector input

## What changes were proposed in this pull request?

Enable RandomForestRegressor to accept sparse vector input as RandomForestClassifier already does.

This closes #178

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-233

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
with customers as (
  select 1 as id, "male" as gender, 23 as age, "Japan" as country, 12 as num_purchases
  union all
  select 2 as id, "female" as gender, 43 as age, "US" as country, 4 as num_purchases
  union all
  select 3 as id, "other" as gender, 19 as age, "UK" as country, 2 as num_purchases
  union all
  select 4 as id, "male" as gender, 31 as age, "US" as country, 20 as num_purchases
  union all
  select 5 as id, "female" as gender, 37 as age, "Australia" as country, 9 as num_purchases
),
training as (
  select
    array_concat(
      quantitative_features(
        array("age"),
        age
      ),
      categorical_features(
        array("country", "gender"),
        country, gender
      )
    ) as features,
    num_purchases
  from
    customers
)
select
  train_randomforest_regressor(
    feature_hashing(features), -- feature vector
    num_purchases, -- target value
    '-trees 40 -seed 31' -- hyper-parameters
  )
from
  training
;
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Takuya Kitazawa <k.takuti@gmail.com>
Author: Makoto Yui <myui@apache.org>

Closes #181 from myui/HIVEMALL-233-2.

3 years ago[HIVEMALL-234] Define `EtaEstimator` default values as constants
Takuya Kitazawa [Wed, 30 Jan 2019 05:01:35 +0000 (14:01 +0900)] 
[HIVEMALL-234] Define `EtaEstimator` default values as constants

## What changes were proposed in this pull request?

Fix mismatched default values declared in `getOptions()` and `EtaEstimator`.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-234

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?

Author: Takuya Kitazawa <k.takuti@gmail.com>

Closes #179 from takuti/HIVEMALL-234.

3 years ago[HIVEMALL-235] Fix a bug in expansion of array where size is zero
Makoto Yui [Wed, 30 Jan 2019 04:50:25 +0000 (13:50 +0900)] 
[HIVEMALL-235] Fix a bug in expansion of array where size is zero

## What changes were proposed in this pull request?

Fix a bug in expansion of array where size is zero.

See for detail
https://github.com/apache/incubator-hivemall/pull/178/commits/d7695d461056b21eab25465e015c582edc2b57ce

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-235

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #180 from myui/HIVEMALL-235.

3 years ago[HIVEMALL-232][DOC] Fix typo in the Top-K document
Kengo Seki [Thu, 10 Jan 2019 18:33:49 +0000 (03:33 +0900)] 
[HIVEMALL-232][DOC] Fix typo in the Top-K document

## What changes were proposed in this pull request?

`DISTRIBUTE BY x CLASS SORT BY x` in the Top-K document looks like a typo, so fixing it.

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-232

## How was this patch tested?

I think no test is needed since it's just a minor documentation fix.

Author: Kengo Seki <sekikn@apache.org>

Closes #177 from sekikn/HIVEMALL-232.

3 years agoFixed to update generic_func.md properly
Makoto Yui [Wed, 9 Jan 2019 07:00:53 +0000 (16:00 +0900)] 
Fixed to update generic_func.md properly

3 years ago[HIVEMALL-231] Replaced subarray UDF implementation with SubarrayUDF
Makoto Yui [Tue, 8 Jan 2019 11:02:07 +0000 (20:02 +0900)] 
[HIVEMALL-231] Replaced subarray UDF implementation with SubarrayUDF

## What changes were proposed in this pull request?

Replaced subarray UDF implementation with SubarrayUDF for backward compatibility.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-231

## How was this patch tested?

manual tests on EMR

## How to use this feature?

To be described in [userguide](http://hivemall.incubator.apache.org/userguide/misc/generic_funcs.html#array).

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #176 from myui/subarray.

3 years agoMoved git repos to Gitbox
Makoto Yui [Tue, 8 Jan 2019 06:21:59 +0000 (15:21 +0900)] 
Moved git repos to Gitbox

3 years ago[HIVEMALL-214][DOC] Update userguide for General Classifier/Regressor example
Makoto Yui [Wed, 26 Dec 2018 10:15:43 +0000 (19:15 +0900)] 
[HIVEMALL-214][DOC] Update userguide for General Classifier/Regressor example

## What changes were proposed in this pull request?

Refine user guide for generic classifier/regressor and so on.

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-214

## How to use this feature?

See user guide.

Author: Makoto Yui <myui@apache.org>

Closes #159 from myui/HIVEMALL-214.

3 years ago[HIVEMALL-230] Revise Optimizer Implementation
Makoto Yui [Wed, 26 Dec 2018 10:14:23 +0000 (19:14 +0900)] 
[HIVEMALL-230] Revise Optimizer Implementation

## What changes were proposed in this pull request?

Revise Optimizer implementation.

1. Revise default hyperparameters of AdaDelta and Adam.
2. Support AdamW, Amsgrad, AdamHD, Eve, and YellowFin optimizer.

- [x] Nesterov’s Accelerated Gradient
https://arxiv.org/abs/1212.0901
- [x] Rmsprop
Geoffrey Hinton, Nitish Srivastava, Kevin Swersky. 2014. Lecture 6e: Rmsprop: Divide the gradient by a running average of its recent magnitude
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
- [x] RMSpropGraves - Generating Sequences With Recurrent Neural Networks
https://arxiv.org/abs/1308.0850
- [x] Fixing Weight Decay Regularization in Adam
https://openreview.net/forum?id=rk6qdGgCZ
- [x] On the Convergence of Adam and Beyond
https://openreview.net/forum?id=ryQu7f-RZ
- [x] AdamHD (Adam with Hypergradient descent)
https://arxiv.org/pdf/1703.04782.pdf
- [x] Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
https://openreview.net/forum?id=r1WUqIceg
- [x] nadam: Adam with Nesterov momentum
https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ
http://cs229.stanford.edu/proj2015/054_report.pdf
http://www.cs.toronto.edu/~fritz/absps/momentum.pdf
- [ ] ~YellowFin and the Art of Momentum Tuning~
https://openreview.net/forum?id=SyrGJYlRZ

## What type of PR is it?

Improvement, Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-230

## How was this patch tested?

unit tests, emr

## How to use this feature?

Described in [tutorial](http://hivemall.incubator.apache.org/userguide/index.html)

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #175 from myui/adam_test.

3 years agoFixed ANN message and download page
Makoto Yui [Tue, 4 Dec 2018 07:13:25 +0000 (16:13 +0900)] 
Fixed ANN message and download page

3 years agoUpdate the project top page
Makoto Yui [Mon, 3 Dec 2018 09:27:36 +0000 (18:27 +0900)] 
Update the project top page

3 years agoUpdated release history
Makoto Yui [Mon, 3 Dec 2018 09:03:02 +0000 (18:03 +0900)] 
Updated release history

3 years agoMerge remote-tracking branch 'origin/v0.5.2'
Makoto Yui [Mon, 3 Dec 2018 07:32:03 +0000 (16:32 +0900)] 
Merge remote-tracking branch 'origin/v0.5.2'

3 years ago[DOC] Added workaround for a Surefire error
Makoto Yui [Wed, 21 Nov 2018 06:11:29 +0000 (15:11 +0900)] 
[DOC] Added workaround for a Surefire error

3 years ago[HIVEMALL-227-2] Updated release guide to use SHA-512
Makoto Yui [Mon, 19 Nov 2018 10:29:28 +0000 (19:29 +0900)] 
[HIVEMALL-227-2] Updated release guide to use SHA-512

3 years ago[maven-release-plugin] prepare for next development iteration v0.5.2
Makoto Yui [Mon, 19 Nov 2018 08:44:42 +0000 (17:44 +0900)] 
[maven-release-plugin] prepare for next development iteration

3 years ago[maven-release-plugin] prepare release v0.5.2-rc2 v0.5.2 v0.5.2-rc2
Makoto Yui [Mon, 19 Nov 2018 08:44:31 +0000 (17:44 +0900)] 
[maven-release-plugin] prepare release v0.5.2-rc2

3 years agoBumped up ASF parent pom version to 21 to use SHA-512 instead of SHA-1
Makoto Yui [Mon, 19 Nov 2018 08:34:02 +0000 (17:34 +0900)] 
Bumped up ASF parent pom version to 21 to use SHA-512 instead of SHA-1

3 years ago[HIVEMALL-227][DOC] Removed md5 and replace sha1 with sha512 following new ASF policy
Makoto Yui [Thu, 15 Nov 2018 09:39:58 +0000 (18:39 +0900)] 
[HIVEMALL-227][DOC] Removed md5 and replace sha1 with sha512 following new ASF policy

## What changes were proposed in this pull request?

Removed md5 and replace sha1 with sha512 following new ASF policy

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-227

Author: Makoto Yui <myui@apache.org>

Closes #173 from myui/HIVEMALL-227.

3 years agoBumped version string to 0.5.2-incubating
Makoto Yui [Thu, 15 Nov 2018 06:54:44 +0000 (15:54 +0900)] 
Bumped version string to 0.5.2-incubating

3 years agoPrepare for the next Snapshot release of v0.5.2
Makoto Yui [Thu, 15 Nov 2018 06:16:28 +0000 (15:16 +0900)] 
Prepare for the next Snapshot release of v0.5.2

3 years ago[SPARK][HOTFIX] Fix the existing test failures in spark-2.3
Takeshi Yamamuro [Wed, 14 Nov 2018 17:33:01 +0000 (02:33 +0900)] 
[SPARK][HOTFIX] Fix the existing test failures in spark-2.3

## What changes were proposed in this pull request?
This pr is to fix the test failures for spark-2.3.

## How was this patch tested?
Run the existing tests.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #171 from maropu/HOTFIX-20181114.

3 years agoFix typo
Vladimir Kroz [Wed, 14 Nov 2018 06:25:19 +0000 (15:25 +0900)] 
Fix typo

## What changes were proposed in this pull request?

Fix minor typo in documentation

## What type of PR is it?

Documentation

## What is the Jira issue?

n/a

## How was this patch tested?

n/a

## How to use this feature?

n/a

## Checklist

n/a

Author: Vladimir Kroz <vkroz@users.noreply.github.com>

Closes #172 from vkroz/patch-1.

3 years agoFixed tutorial docs
Makoto Yui [Tue, 13 Nov 2018 09:29:07 +0000 (18:29 +0900)] 
Fixed tutorial docs

3 years ago[HIVEMALL-223] Add -kv_map and -vk_map option to to_ordered_list UDAF
Makoto Yui [Tue, 13 Nov 2018 09:18:35 +0000 (18:18 +0900)] 
[HIVEMALL-223] Add -kv_map and -vk_map option to to_ordered_list UDAF

## What changes were proposed in this pull request?

Add `-kv_map` and `-vk_map` option to `to_ordered_list` UDAF.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-223

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

Will be described in
http://hivemall.incubator.apache.org/userguide/misc/generic_funcs.html#array

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #170 from myui/HIVEMALL-223.

3 years agoFixed Travis CI bug [[: not found
Makoto Yui [Fri, 9 Nov 2018 06:39:35 +0000 (15:39 +0900)] 
Fixed Travis CI bug [[: not found

3 years agoFixed scalatest version used for Spark 2.3 to avoid scalatest version conflict
Makoto Yui [Wed, 7 Nov 2018 17:47:50 +0000 (02:47 +0900)] 
Fixed scalatest version used for Spark 2.3 to avoid scalatest version conflict

3 years agoFixed release guide for MAVEN_OPT
Makoto Yui [Wed, 7 Nov 2018 17:37:17 +0000 (02:37 +0900)] 
Fixed release guide for MAVEN_OPT

3 years agoUpdated Netty version to cope with NoSuchMethodError PooledByteBufAllocator.metric...
Makoto Yui [Wed, 7 Nov 2018 10:24:24 +0000 (19:24 +0900)] 
Updated Netty version to cope with NoSuchMethodError PooledByteBufAllocator.metric() for Spark v2.3

3 years agoFixed a bug introduced in the previous commit
Makoto Yui [Wed, 7 Nov 2018 09:42:09 +0000 (18:42 +0900)] 
Fixed a bug introduced in the previous commit

3 years agoRemoved unknown host
Makoto Yui [Wed, 7 Nov 2018 09:36:05 +0000 (18:36 +0900)] 
Removed unknown host

3 years agoFixed scala test for subarray UDF misusage
Makoto Yui [Wed, 7 Nov 2018 07:36:49 +0000 (16:36 +0900)] 
Fixed scala test for subarray UDF misusage

3 years agoFixed GeneralRegressorUDTFTest to cope with behavioral change where dloss is zero
Makoto Yui [Wed, 7 Nov 2018 06:41:24 +0000 (15:41 +0900)] 
Fixed GeneralRegressorUDTFTest to cope with behavioral change where dloss is zero

3 years agoFixed a bug in ArrayFlattenUDFTest
Makoto Yui [Tue, 6 Nov 2018 14:12:47 +0000 (23:12 +0900)] 
Fixed a bug in ArrayFlattenUDFTest

3 years agoFixed a possible Json deserialize bug caused by illegal Text use
Makoto Yui [Tue, 6 Nov 2018 10:42:28 +0000 (19:42 +0900)] 
Fixed a possible Json deserialize bug caused by illegal Text use

3 years agoFixed failing test
Makoto Yui [Tue, 6 Nov 2018 10:41:15 +0000 (19:41 +0900)] 
Fixed failing test

3 years agoUpdated release guide for SSL related workaround
Makoto Yui [Mon, 5 Nov 2018 08:51:13 +0000 (17:51 +0900)] 
Updated release guide for SSL related workaround

3 years agoAdded missing license header
Makoto Yui [Sat, 3 Nov 2018 07:54:04 +0000 (16:54 +0900)] 
Added missing license header

3 years agoAdded Koji to the Mentor list
Makoto Yui [Sat, 3 Nov 2018 07:46:08 +0000 (16:46 +0900)] 
Added Koji to the Mentor list

3 years agoFixed term vector space tutorial
Makoto Yui [Sat, 3 Nov 2018 07:38:47 +0000 (16:38 +0900)] 
Fixed term vector space tutorial

3 years agoFixed bm25() UDF for help message
Makoto Yui [Sat, 3 Nov 2018 07:38:13 +0000 (16:38 +0900)] 
Fixed bm25() UDF for help message

3 years ago[HIVEMALL-196] Support BM25 scoring
Jackson Huang [Fri, 2 Nov 2018 10:35:13 +0000 (19:35 +0900)] 
[HIVEMALL-196] Support BM25 scoring

## What changes were proposed in this pull request?

Adding scoring function Okapi BM25 as a UDF

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/projects/HIVEMALL/issues/HIVEMALL-196

## How was this patch tested?

1. Unit testing
2. Manual testing on Hive

## How to use this feature?
This new `okapi_bm25` function requires 5 mandatory arguments and 2 optional hyperparameters:

1. raw frequency count of a term in a given document
2. length of the given document
3. average length of a document in the corpus
4. number of documents in the corpus
5. number of documents containing the term, i.e. document frequency
6. (*optional*) k1 - a smoothing hyperparameter
7. (*optional*) b - a smoothing hyperparameter

### Step 1: Count frequency of terms
```sql
create or replace view frequency
as
select
  docid,
  word,
  count(*) as freq
from
  test_corpus_exploded
group by
  docid,
  word
;
```

### Step 2: Calculate document lengths
```sql
create or replace view doc_len
as
select
  docid, count(1) as cnt
from
  test_corpus_exploded
group by
  docid
;
```

### Step 3: Calculate document frequency
```sql
create or replace view document_frequency
as
select
  word,
  count(distinct docid) docs
from
  test_corpus_exploded
group by
  word
;
```

### Step 4: Set number of documents
```sql
set hivevar:n_docs=3;
```

### Step 5: Use `okapi_bm25`
```sql
create or replace view bm25
as
with tmp as (
select avg(cnt) as avgdl from doc_len
)
select
  f.docid,
  f.word,
  okapi_bm25(
    CAST(f.freq AS INT),
    dl.cnt,
    CAST(tmp.avgdl AS DOUBLE),
    ${n_docs},
    df.docs,
    '-k1 1.5 -b 0.75'
  ) as score
from frequency     f
JOIN document_frequency df ON (f.word = df.word)
JOIN doc_len            dl ON (f.docid = dl.docid)
CROSS JOIN tmp
ORDER BY
score desc;
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Jackson Huang <huang.j@treasure-data.com>
Author: Makoto Yui <myui@apache.org>

Closes #163 from jaxony/feature/bm25.

3 years ago[HIVEMALL-222] Introduce Gradient Clipping to avoid exploding gradient to General...
Makoto Yui [Wed, 24 Oct 2018 08:20:56 +0000 (17:20 +0900)] 
[HIVEMALL-222] Introduce Gradient Clipping to avoid exploding gradient to General Classifier/Regressor

## What changes were proposed in this pull request?

Avoid [exploding gradients](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L15%20Exploding%20and%20Vanishing%20Gradients.pdf) by gradient clipping (by value)

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-222

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #169 from myui/clipping.

3 years agoRemoved unnecessary comment
Makoto Yui [Thu, 11 Oct 2018 08:13:50 +0000 (17:13 +0900)] 
Removed unnecessary comment

4 years agoTiny optimization for PassThrough regularization
Makoto Yui [Fri, 21 Sep 2018 07:52:47 +0000 (16:52 +0900)] 
Tiny optimization for PassThrough regularization

4 years agoStatic method should be called static way
Makoto Yui [Tue, 18 Sep 2018 10:51:42 +0000 (19:51 +0900)] 
Static method should be called static way

4 years ago[HIVEMALL-219] Fixed LDA bug for single update
Makoto Yui [Tue, 18 Sep 2018 10:46:18 +0000 (19:46 +0900)] 
[HIVEMALL-219] Fixed LDA bug for single update

## What changes were proposed in this pull request?

Fixed LDA bug for single update and added unit tests

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-219

## How was this patch tested?

unit tests and manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #166 from myui/HIVEMALL-219-2.

4 years ago[HIVEMALL-219][BUGFIX] Fixed NPE in finalizeTraining()
Makoto Yui [Tue, 18 Sep 2018 09:51:33 +0000 (18:51 +0900)] 
[HIVEMALL-219][BUGFIX] Fixed NPE in finalizeTraining()

## What changes were proposed in this pull request?

Fixed NPE in finalizeTraining() where there are no training example

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-219

## How was this patch tested?

to appear

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #165 from myui/HIVEMALL-219.

4 years agoUpdated mentor list
Makoto Yui [Tue, 18 Sep 2018 05:52:09 +0000 (14:52 +0900)] 
Updated mentor list

4 years ago[HIVEMALL-218] Fixed train_lda NPE where input row is null
Makoto Yui [Fri, 7 Sep 2018 10:19:35 +0000 (19:19 +0900)] 
[HIVEMALL-218] Fixed train_lda NPE where input row is null

## What changes were proposed in this pull request?

Fixed NegativeArraySizeException where input is NULL of `train_lda`

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-218

## How was this patch tested?

manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #164 from myui/HIVEMALL-218.