boomkim [Sun, 28 Nov 2021 04:08:19 +0000 (13:08 +0900)]
[HIVEMALL-317] Update documentation about Amazon EMR
## What changes were proposed in this pull request?
Update documentation about Amazon EMR.
Just little change to make bootstrap script working.
Previous script had dead link.
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-317
## How was this patch tested?
Document update. Test not needed.
Author: boomkim <bhk3177@gmail.com>
Closes #246 from boomkim/emr_docs.
Makoto Yui [Fri, 2 Jul 2021 06:15:20 +0000 (15:15 +0900)]
[HIVEMALL-316] Improve error message for duplicate entries error in Tokenizer user dictionary
## What changes were proposed in this pull request?
Improve error message for duplicate entries error in Tokenizer user dictionary
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-316
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #245 from myui/HIVEMALL-316.
Makoto Yui [Fri, 14 May 2021 18:12:19 +0000 (03:12 +0900)]
[HIVEMALL-314] fixed Spark DDLs
## What changes were proposed in this pull request?
fixed Spark DDLs
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-314
## How was this patch tested?
manual tests
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #244 from myui/HIVEMALL-314-fix-spark-ddls.
Makoto Yui [Fri, 14 May 2021 03:35:11 +0000 (12:35 +0900)]
Fixed links
Makoto Yui [Fri, 14 May 2021 03:25:13 +0000 (12:25 +0900)]
[HIVEMALL-307][DOC] Update tokenize_ko examples
## What changes were proposed in this pull request?
Update tokenize_ko examples
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-307
Author: Makoto Yui <myui@apache.org>
Closes #243 from myui/update_tokenize_ko_example.
Makoto Yui [Thu, 6 May 2021 08:18:22 +0000 (17:18 +0900)]
[HIVEMALL-312] Changed default constructor accessor to public for Spark
## What changes were proposed in this pull request?
Changed default constructor accessor to public for Spark
```
Exception in thread "main" org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'hivemall.factorization.fm.FMPredictGenericUDAF': java.lang.IllegalAccessException: Class org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper can not access a member of class hivemall.factorization.fm.FMPredictGenericUDAF with modifiers "private"; line 6 pos 0Exception in thread "main" org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'hivemall.factorization.fm.FMPredictGenericUDAF': java.lang.IllegalAccessException: Class org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper can not access a member of class hivemall.factorization.fm.FMPredictGenericUDAF with modifiers "private"; line 6 pos 0 at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:102) at java.lang.Class.newInstance(Class.java:436) at org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:220) at org.apache.spark.sql.hive.HiveUDAFFunction.newEvaluator(hiveUDFs.scala:343) at org.apache.spark.sql.hive.HiveUDAFFunction.org$apache$spark$sql$hive$HiveUDAFFunction$$finalHiveEvaluator$lzycompute(hiveUDFs.scala:366) at org.apache.spark.sql.hive.HiveUDAFFunction.org$apache$spark$sql$hive$HiveUDAFFunction$$finalHiveEvaluator(hiveUDFs.scala:365) at org.apache.spark.sql.hive.HiveUDAFFunction.dataType$lzycompute(hiveUDFs.scala:394) at org.apache.spark.sql.hive.HiveUDAFFunction.dataType(hiveUDFs.scala:394) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$1$$anonfun$apply$2.apply(HiveSessionCatalog.scala:85) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$1$$anonfun$apply$2.apply(HiveSessionCatalog.scala:71) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$1.apply(HiveSessionCatalog.scala:71) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$1.apply(HiveSessionCatalog.scala:71) at
```
## What type of PR is it?
Bug Fix, Hot fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-312
## How was this patch tested?
unit tests
Author: Makoto Yui <myui@apache.org>
Closes #242 from myui/HIVEMALL-312.
Makoto Yui [Sun, 2 May 2021 06:55:46 +0000 (15:55 +0900)]
[HIVEMALL-311] Upgrade Kryo version from 2.21 to 2.24.0
## What changes were proposed in this pull request?
xgboost4j and xgboost module used Kryo version 2.21 but it has a bug in serializing generic collections. So, update Kryo version to 2.24.0 just in case.
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-311
## How was this patch tested?
unit tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #241 from myui/kryo_update.
Makoto Yui [Wed, 28 Apr 2021 02:43:37 +0000 (11:43 +0900)]
[HIVEMALL-310] Remove old release artifacts and unlink them
## What changes were proposed in this pull request?
Update links for old release artifacts
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-310
Author: Makoto Yui <myui@apache.org>
Closes #240 from myui/HIVEMALL-310.
Makoto Yui [Fri, 23 Apr 2021 12:22:30 +0000 (21:22 +0900)]
[HIVEMALL-308] Relocate kryo packages in shaded jar
## What changes were proposed in this pull request?
Relocate Kryo packages in fat jar to avoid conflicts
## What type of PR is it?
Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-308
## How was this patch tested?
manual tests on EMR
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #239 from myui/relocate_kryo.
Makoto Yui [Fri, 23 Apr 2021 10:17:14 +0000 (19:17 +0900)]
[HIVEMALL-309] Enhance tokenize_ko to support stopwords and external user dict
## What changes were proposed in this pull request?
Enhance tokenize_ko to support stopwords and external user dict
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-309
## How was this patch tested?
unit tests, manual tests on EMR
## How to use this feature?
```sql
-- default stopward (null), default stoptags (null), custom dict
select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', null, null, array('C++'));
> ["나","c++","언어","프로그래밍","언어","사랑"]
select tokenize_ko('나는 c++ 프로그래밍을 즐긴다.', '-mode discard', null, null, 'https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt');
> ["나","c++","프로그래밍","즐기"]
```
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #238 from myui/korean-enhancement.
Makoto Yui [Thu, 22 Apr 2021 14:53:10 +0000 (23:53 +0900)]
Implement Korean text tokenizer
## What changes were proposed in this pull request?
Implement Korean text tokenizer
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-307
## How was this patch tested?
unit tests and manual tests on EMR
## How to use this feature?
```sql
-- show version of lucene-analyzers-nori
select tokenize_ko();
> 8.8.2
select tokenize_ko("소설 무궁화꽃이 피었습니다.");
> ["소설","무궁","화","꽃","피"]
select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "mixed");
> ["소설","무궁화","무궁","화","꽃","피"]
select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "discard", array("E", "VV"));
> ["소설","무궁","화","꽃","이"]
select tokenize_ko("Hello, world.", null, "none", array(), true);
> ["h","e","l","l","o","w","o","r","l","d"]
select tokenize_ko("Hello, world.", null, "none", array(), false);
> ["hello","world"]
select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", null, "discard", array());
> ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", array("C++"), "discard", array());
> ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
```
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #237 from myui/korean_tokenizer.
Makoto Yui [Thu, 22 Apr 2021 12:24:33 +0000 (21:24 +0900)]
Fixed gitbook build
## What changes were proposed in this pull request?
Fixed gitbook build
## What type of PR is it?
Documentation
Author: Makoto Yui <myui@apache.org>
Closes #236 from myui/fix_gitbook.
Makoto Yui [Thu, 22 Apr 2021 03:39:21 +0000 (12:39 +0900)]
[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary
## What changes were proposed in this pull request?
Add tokenize_ja_neologd UDF that uses Neologd dictionary for Kuromoji tokenization.
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-305
## How was this patch tested?
unit tests and manual tests on EMR
## How to use this feature?
```sql
tokenize_ja_neologd(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)
select tokenize_ja_neologd("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。");
> ["彼女","ペンパイナッポーアッポーペン","恋ダンス","踊る"]
```
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #235 from myui/neologd.
Makoto Yui [Mon, 19 Apr 2021 06:39:03 +0000 (15:39 +0900)]
[HIVEMALL-304] Updated lucene version from 5.5.5 (java7) to 8.8.2 (java8)
## What changes were proposed in this pull request?
Updated lucene version from 5.5.5 (java7) to 8.8.2 (java8)
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-304
## How was this patch tested?
unit tests
## How to use this feature?
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #234 from myui/lucene_version_up.
Makoto Yui [Thu, 15 Apr 2021 05:29:23 +0000 (14:29 +0900)]
[HIVEMALL-303] Changed compilation target to Java 8
## What changes were proposed in this pull request?
Change compilation target to Java 8 from Java 7.
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-303
## How was this patch tested?
unit tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #233 from myui/HIVEMALL-303-java8.
Makoto Yui [Mon, 29 Mar 2021 07:42:58 +0000 (16:42 +0900)]
[HIVEMALL-301] Remove macros and replace them with UDF
## What changes were proposed in this pull request?
Remove macros and replace them with UDF
## What type of PR is it?
Improvement, Refactoring
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-301
## How was this patch tested?
manual tests
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #232 from myui/HIVEMALL-301-tfidf.
Makoto Yui [Thu, 27 Aug 2020 09:06:58 +0000 (18:06 +0900)]
Revised bagging doc entry
Makoto Yui [Fri, 21 Aug 2020 09:54:15 +0000 (18:54 +0900)]
Trivial doc fix
Makoto Yui [Fri, 21 Aug 2020 06:21:45 +0000 (15:21 +0900)]
Added user guide entry for bagging classifiers
Makoto Yui [Thu, 6 Aug 2020 07:05:37 +0000 (16:05 +0900)]
[HIVEMALL-297] Fixed null element handling in feature vector
## What changes were proposed in this pull request?
Fixed null element handling in feature vector
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-297
## How was this patch tested?
unit tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #231 from myui/HIVEMALL-297.
Makoto Yui [Mon, 27 Jul 2020 06:36:48 +0000 (15:36 +0900)]
[HIVEMALL-296][BUGFIX] Fixed corner case NPE bug when count is zero
## What changes were proposed in this pull request?
Fixed corner case NPE bug when count is zero.
```
Caused by: java.lang.NullPointerException
at hivemall.GeneralLearnerBaseUDTF.forwardModel(GeneralLearnerBaseUDTF.java:763)
at hivemall.GeneralLearnerBaseUDTF.close(GeneralLearnerBaseUDTF.java:560)
at org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:152)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:279)
```
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-296
## How was this patch tested?
unit tests
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #230 from myui/HIVEMALL-296.
Makoto Yui [Thu, 4 Jun 2020 05:21:05 +0000 (14:21 +0900)]
[HIVEMALL-295][BUGFIX] transpose_and_dot throws UDFArgumentException for 0 rows input
## What changes were proposed in this pull request?
transpose_and_dot throws UDFArgumentException for 0 rows input.
```
WITH INPUT AS(
SELECT
ARRAY(1.0,2.0,3.0) AS X,
ARRAY(0,1) AS Y
)
SELECT
transpose_and_dot(Y,X) AS observed
FROM
INPUT
WHERE false
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.exec.UDFArgumentException
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1126)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:464)
... 15 more
Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at hivemall.utils.lang.Preconditions.checkNotNull(Preconditions.java:43)
at hivemall.tools.matrix.TransposeAndDotUDAF$TransposeAndDotUDAFEvaluator.iterate(TransposeAndDotUDAF.java:172)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:192)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1117)
... 21 more
```
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-295
## How was this patch tested?
manual tests on EMR
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #229 from myui/HIVEMALL-295.
Makoto Yui [Fri, 29 May 2020 07:42:14 +0000 (16:42 +0900)]
[HIVEMALL-294] Fix XGboost to report progress report for each iteration
## What changes were proposed in this pull request?
Fix XGboost to report progress report for each iteration.
## What type of PR is it?
Improvement, Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-294
## How was this patch tested?
unit tests
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #228 from myui/HIVEMALL-294.
Makoto Yui [Tue, 3 Mar 2020 06:44:04 +0000 (15:44 +0900)]
[HIVEMALL-291] Fixed dedup behavior of to_ordered_list UDAF
## What changes were proposed in this pull request?
Fixed dedup behavior of to_ordered_list UDAF
## What type of PR is it?
Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-291
## How was this patch tested?
unit tests
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #227 from myui/HIVEMALL-291-2.
Makoto Yui [Thu, 23 Jan 2020 10:20:55 +0000 (19:20 +0900)]
[HIVEMALL-291] Support deduplication in to_order_list UDAF
## What changes were proposed in this pull request?
Add -dedup option to to_ordered_list.
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-291
## How was this patch tested?
manual tests on EMR
## How to use this feature?
```sql
WITH data as (
SELECT 5 as key, 'apple' as value
UNION ALL
SELECT 3 as key, 'banana' as value
UNION ALL
SELECT 4 as key, 'candy' as value
UNION ALL
SELECT 1 as key, 'donut' as value
UNION ALL
SELECT 2 as key, 'egg' as value
UNION ALL
SELECT 4 as key, 'candy' as value -- both key and value duplicates
)
select
to_ordered_list(value, key, '-k 4 -dedup -vk_map'),
to_ordered_list(value, key, '-k 4 -vk_map'),
to_ordered_list(value, key, '-k 4 -dedup'),
to_ordered_list(value, key, '-k 4')
from
data
```
> {"apple":5,"candy":4,"banana":3,"egg":2} {"apple":5,"candy":4,"banana":3} ["apple","candy","banana","egg"] [
"apple","candy","candy","banana"]
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #226 from myui/HIVEMALL-291.
Makoto Yui [Tue, 21 Jan 2020 09:25:37 +0000 (18:25 +0900)]
Fixed docs
Makoto Yui [Thu, 26 Dec 2019 07:48:02 +0000 (16:48 +0900)]
[HIVEMALL-289] Add str_contain(string str, array<string> match, boolean or=true) UDF
## What changes were proposed in this pull request?
Add str_contain(string str, array<string> match, boolean or=true) UDF
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-289
## How was this patch tested?
manual tests on EMR
## How to use this feature?
```sql
select
str_contains('There are apple and orange', array('apple')),
str_contains('There are apple and orange', array('apple', 'banana'), true),
str_contains('There are apple and orange', array('apple', 'banana'), false);
> true, true, false
```
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #225 from myui/HIVEMALL-289.
Makoto Yui [Wed, 25 Dec 2019 09:30:06 +0000 (18:30 +0900)]
Fixed docs for UDF preparation
Makoto Yui [Fri, 20 Dec 2019 15:45:22 +0000 (00:45 +0900)]
Fixed a typo
Makoto Yui [Fri, 20 Dec 2019 13:05:12 +0000 (22:05 +0900)]
Added ChangeLog
Makoto Yui [Thu, 19 Dec 2019 12:31:55 +0000 (21:31 +0900)]
Replaced http with https and added verification procedure
Makoto Yui [Thu, 19 Dec 2019 10:18:04 +0000 (19:18 +0900)]
Fixed links in doc
Makoto Yui [Thu, 19 Dec 2019 08:36:51 +0000 (17:36 +0900)]
Updated download page
Makoto Yui [Thu, 19 Dec 2019 08:06:44 +0000 (17:06 +0900)]
Merge remote-tracking branch 'origin/v0.6.0'
Makoto Yui [Thu, 19 Dec 2019 05:18:25 +0000 (14:18 +0900)]
Updated copyrights holders
Makoto Yui [Thu, 12 Dec 2019 08:32:27 +0000 (17:32 +0900)]
[HIVEMALL-288] mf_predict throws SemanticException No matching method with (array<double>, array<double>, int)
## What changes were proposed in this pull request?
`mf_predict` throws SemanticException No matching method with (array<double>, array<double>, int)
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-288
## How was this patch tested?
manual tests on EMR
```sql
select
-- 3 arguments
mf_predict(array(cast(1.0 as float),cast(2.0 as float),cast(3.0 as float)), array(cast(1.0 as float),cast(2.0 as float),cast(3.0 as float)), 1),
mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 1),
mf_predict(array(cast(1.0 as DOUBLE),cast(2.0 as DOUBLE),cast(3.0 as DOUBLE)), array(cast(1.0 as DOUBLE),cast(2.0 as DOUBLE),cast(3.0 as DOUBLE)), 1),
-- 2 arguments
mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0)),
-- 4 arguments
mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 0, 0),
-- 5 arguments
mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 0, 0, 1);
```
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #224 from myui/HIVEMALL-288.
Makoto Yui [Tue, 3 Dec 2019 06:21:35 +0000 (15:21 +0900)]
Update date
Makoto Yui [Mon, 2 Dec 2019 10:25:54 +0000 (19:25 +0900)]
[DOC] update titanic random forest doc for decision_path
Makoto Yui [Thu, 28 Nov 2019 18:26:51 +0000 (03:26 +0900)]
Fixed release guide
Makoto Yui [Thu, 28 Nov 2019 16:43:53 +0000 (01:43 +0900)]
[maven-release-plugin] prepare for next development iteration
Makoto Yui [Thu, 28 Nov 2019 16:43:43 +0000 (01:43 +0900)]
[maven-release-plugin] prepare release v0.6.0-rc1
Makoto Yui [Thu, 28 Nov 2019 16:41:45 +0000 (01:41 +0900)]
Bumped version string to 0.6.0-incubating
Makoto Yui [Thu, 28 Nov 2019 07:46:02 +0000 (16:46 +0900)]
Minor refactoring and fixed function docs
Makoto Yui [Thu, 28 Nov 2019 07:11:17 +0000 (16:11 +0900)]
[HIVEMALL-159][DOC] Add documentation about One-hot encoding
## What changes were proposed in this pull request?
Add documentation about One-hot encoding
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-159
## How to use this feature?
See userguide
Author: Makoto Yui <myui@apache.org>
Closes #223 from myui/onehot_docs.
Makoto Yui [Wed, 27 Nov 2019 09:03:41 +0000 (18:03 +0900)]
[HIVEMALL-56][DOC] Add documentation about Similarity/Distance functions
## What changes were proposed in this pull request?
Add documentation about Similarity/Distance functions
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-56
## Checklist
Author: Makoto Yui <myui@apache.org>
Closes #222 from myui/HIVEMALL-56.
Makoto Yui [Wed, 27 Nov 2019 07:42:34 +0000 (16:42 +0900)]
[HIVEMALL-158][DOC] Refine deprecated userguide contents
## What changes were proposed in this pull request?
Refine deprecated userguide contents
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-158
Author: Makoto Yui <myui@apache.org>
Closes #221 from myui/HIVEMALL-158.
Makoto Yui [Wed, 27 Nov 2019 07:11:56 +0000 (16:11 +0900)]
[HIVEMALL-285] Add -inspect_opts option to show hyperparameters
## What changes were proposed in this pull request?
Add `-inspect_opts` option to show hyperparameters
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-285
## How was this patch tested?
manual tests on EMR
## How to use this feature?
```sql
select train_regressor(array(), 0, '-inspect_opts -optimizer adam -reg elasticnet');
FAILED: UDFArgumentException Inspected Optimizer options ...
{disable_cvtest=false, regularization=ElasticNet, loss_function=SquaredLoss, eps=1.0E-8, decay=0.0, iterations=10, eta0=0.1, l1_ratio=0.5, lambda=1.0E-4, eta=Invscaling, optimizer=adam, beta1=0.9, beta2=0.999, alpha=1.0, cv_rate=0.005, power_t=0.1}
```
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #220 from myui/HIVEMALL-285.
Makoto Yui [Tue, 26 Nov 2019 06:43:09 +0000 (15:43 +0900)]
Revised exception type
Makoto Yui [Tue, 26 Nov 2019 06:39:30 +0000 (15:39 +0900)]
Minor refactoring
Makoto Yui [Tue, 26 Nov 2019 04:54:43 +0000 (13:54 +0900)]
[HIVEMALL-283] Bump up netty version to 4.1.42.Final
## What changes were proposed in this pull request?
Bump up netty version to 4.1.42.Final
This closes #206 and closes #207
## What type of PR is it?
Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-283
## How was this patch tested?
unit tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #219 from myui/HIVEMALL-283.
Makoto Yui [Mon, 25 Nov 2019 18:58:42 +0000 (03:58 +0900)]
[HIVEMALL-226] Move hivemall.fm and hivemall.mf packages to under hivemall.factorization
## What changes were proposed in this pull request?
Move hivemall.fm and hivemall.mf packages to under hivemall.factorization
## What type of PR is it?
Refactoring
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-226
## How was this patch tested?
unit tests and manual tests on EMR
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #218 from myui/HIVEMALL-266.
Makoto Yui [Mon, 25 Nov 2019 17:05:56 +0000 (02:05 +0900)]
Update javadoc and applied formatter
Makoto Yui [Mon, 25 Nov 2019 16:53:29 +0000 (01:53 +0900)]
[HIVEMALL-165] Fixed to accept any primitive
## What changes were proposed in this pull request?
Fix a bug that `array_remove` UDF throws exception when the first argument is null
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-165
## How was this patch tested?
manual tests on EMR
## How to use this feature?
```sql
WITH data4 as (
select false as n, array(2.0, 3.0, 4.0) as nums
union all
select true as n, array(2.0, 3.0, 4.0) as nums
)
select
array_remove(if(n = true, null, nums), 2.0) as c1,
array_remove(if(n = true, null, nums), array(3.0,2.0)) as c2,
array_remove(if(n = false, null, nums), 2.0) as c3
from
data4;
> c1 c2 c3
> [3,4] [4] NULL
> NULL NULL [3,4]
select array_remove(array(2.0,2.1,3.0,4.0,2.0),2), array_remove(array(2.0,3.0,4.0),array(3,2.0));
> [2.1,3,4] [4]
SELECT array_remove(array(1,null,3),null);
> [1,3]
SELECT array_remove(array(1,null,3,null,5),null);
> [1,3,5]
SELECT array_remove(array(1,null,3),array(null));
> [1,3]
SELECT array_remove(array('aaa','bbb'),'bbb');
> ["aaa"]
SELECT array_remove(array('aaa','bbb','ccc','bbb'), array('bbb','ccc'));
> ["aaa"]
select array_remove(array(null),null);
> []
select array_remove(array(null,'bbb'),'aaa');
> [null,"bbb"]
```
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #217 from myui/HIVEMALL-165.
Makoto Yui [Mon, 25 Nov 2019 10:03:15 +0000 (19:03 +0900)]
[HIVEMALL-121] Add -libsvm formatting option to feature_hashing UDF
## What changes were proposed in this pull request?
Add `-libsvm` formatting option for `feature_hashing
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-121
## How was this patch tested?
unit tests, manual tests on EMR
## How to use this feature?
```sql
select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-libsvm');
> ["
4063537:1.0","
4063537:1","
8459207:2.0"]
select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-features 10 -libsvm');
> ["1:2.0","7:1.0","7:1"]
```
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #216 from myui/HIVEMALL-121.
Makoto Yui [Mon, 25 Nov 2019 08:50:35 +0000 (17:50 +0900)]
[HIVEMALL-249] Fix fmeasure UDAF to support any integers
## What changes were proposed in this pull request?
Fix fmeasure UDAF to support any integers
## What type of PR is it?
Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-249
## How to use this feature?
```sql
create table data2 as
select 1.1 as truth, 0 as predicted
union all
select 0.0 as truth, 1 as predicted
union all
select 0.0 as truth, 0 as predicted
union all
select 1.0 as truth, 1 as predicted
union all
select 0.0 as truth, 1 as predicted
union all
select 0.0 as truth, 0 as predicted
;
select fmeasure(truth, predicted, '-average binary') from data;
```
## How was this patch tested?
manual tests on EMR
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #215 from myui/HIVEMALL-249.
Makoto Yui [Fri, 22 Nov 2019 15:56:36 +0000 (00:56 +0900)]
[HIVEMALL-276] Stable support for XGBoost v0.90
## What changes were proposed in this pull request?
- Fix xgboost module to create DMatrix from CSRMatrix
- Support xgboost v0.90 hyperparameters
- Replace xgboost4j with [xgboost-predictor](https://github.com/komiya-atsushi/xgboost-predictor-java) for prediction
- Add documentation about Xgboost
## What type of PR is it?
Refactoring, Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-276
https://issues.apache.org/jira/browse/HIVEMALL-275
https://issues.apache.org/jira/browse/HIVEMALL-279
https://issues.apache.org/jira/browse/HIVEMALL-272
https://issues.apache.org/jira/browse/HIVEMALL-27
## How to use this feature?
as described in [user guide](http://hivemall.apache.org/userguide/index.html).
## How was this patch tested?
unit tests and manual tests on EMR
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #213 from myui/HIVEMALL-275-2.
Makoto Yui [Fri, 22 Nov 2019 14:17:11 +0000 (23:17 +0900)]
[HIVEMALL-281] Support max_by, min_by, majority_vote UDAFs
## What changes were proposed in this pull request?
upport max_by, min_by, majority_vote UDAFs
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-281
## How was this patch tested?
manual tests on EMR
## How to use this feature?
```sql
create table data1 as (
select 'jake' as name, 18 as age
union all
select 'tom' as name, 64 as age
union all
select 'lisa' as name, 32 as age
);
select
max_by(name, age) as max_name,
min_by(name, age) as min_name
from
data1;
> tom, jake
create table data2 as
select
explode(array('1', '2', '2', '2', '5', '4', '1', '2')) as k;
select
majority_vote(k) as k
from
data2;
> 2
```
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #214 from myui/HIVEMALL-281.
Makoto Yui [Mon, 11 Nov 2019 05:38:54 +0000 (14:38 +0900)]
[HOTFIX] bumped matrix4j version to 0.9.2
Makoto Yui [Fri, 1 Nov 2019 09:27:53 +0000 (18:27 +0900)]
[HIVEMALL-278] Bumped matrix4j version to v0.9.1
## What changes were proposed in this pull request?
Bumped matrix4j version to v0.9.1 since matrix4j v0.9.0 had a bug on constructing CSRMatrix in an unordered column order.
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-278
## How was this patch tested?
unit tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #212 from myui/HIVEMALL-278.
Makoto Yui [Thu, 31 Oct 2019 10:58:20 +0000 (19:58 +0900)]
add missing junit dependency
Makoto Yui [Thu, 31 Oct 2019 10:17:54 +0000 (19:17 +0900)]
Added SparseDMatrixBuilder
Makoto Yui [Thu, 31 Oct 2019 10:17:31 +0000 (19:17 +0900)]
Renamed XGBoostUDTF as XGBoostBaseUDTF
Aki Ariga [Thu, 31 Oct 2019 07:44:44 +0000 (16:44 +0900)]
[HIVEMALL-274] Fix wrong column name of train_regressor() in tutorial
## What changes were proposed in this pull request?
Fix document bug reported in HIVEMALL-274
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/projects/HIVEMALL/issues/HIVEMALL-274
## How was this patch tested?
N/A
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Aki Ariga <ariga@treasure-data.com>
Closes #210 from chezou/HIVEMALL-274.
Makoto Yui [Wed, 30 Oct 2019 08:59:49 +0000 (17:59 +0900)]
Added document about xgboost_version() UDF
Makoto Yui [Wed, 30 Oct 2019 07:41:21 +0000 (16:41 +0900)]
[HIVEMALL-273] Support xgboost v0.90
## What changes were proposed in this pull request?
Support xgboost v0.90
## What type of PR is it?
Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-273
## How was this patch tested?
unit tests and manual tests on EMR
## How to use this feature?
https://gist.github.com/myui/
aa6e142a95ca8f995cc8e49146dbe2eb
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #209 from myui/HIVEMALL-273.
Makoto Yui [Tue, 29 Oct 2019 06:37:43 +0000 (15:37 +0900)]
[HIVEMALL-260] Remove dependencies to Scala library in xgboost classifier
## What changes were proposed in this pull request?
Remove dependencies to Scala library in xgboost classifier
## What type of PR is it?
Bug Fix, Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-260
## How was this patch tested?
manual tests on EMR
## How to use this feature?
to appear
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #205 from myui/HIVEMALL-260.
Makoto Yui [Wed, 23 Oct 2019 09:44:41 +0000 (18:44 +0900)]
Remove rand_gid/rand_gid2 macro
## What changes were proposed in this pull request?
Remove rand_gid/rand_gid2 macro
## What type of PR is it?
Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-270
Author: Makoto Yui <myui@apache.org>
Closes #204 from myui/HIVEMALL-270.
Makoto Yui [Wed, 23 Oct 2019 09:01:51 +0000 (18:01 +0900)]
[HIVEMALL-261][HIVEMALL-262] argmin/argmax/argsort UDF
## What changes were proposed in this pull request?
Introduce argmin/argmax/argsort UDF
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-261
https://issues.apache.org/jira/browse/HIVEMALL-262
## How was this patch tested?
unit tests, manual tests on EMR
## How to use this feature?
```sql
SELECT argmax(array(5,2,0,1));
> 0
SELECT array_slice(array(5,2,0,1), argmax(array(5,2,0,1)));
> 5
SELECT argmin(array(5,2,0,1));
> 2
SELECT argsort(array(5,2,0,1));
> 2, 3, 1, 0
SELECT array_slice(array(5,2,0,1), argsort(array(5,2,0,1)));
> 0, 1, 2, 5
SELECT argsort(argsort(array(5,2,0,1))), argrank(array(5,2,0,1));
> 3, 2, 0, 1
SELECT arange(5), arange(1, 5), arange(1, 5, 1), arange(0, 5, 1);
> [0,1,2,3,4] [1,2,3,4] [1,2,3,4] [0,1,2,3,4]
SELECT arange(1, 6, 2);
> 1, 3, 5
SELECT arange(-1, -6, 2);
> -1, -3, -5
SELECT argsort(array(5, 2, 0, 1)), argrank(array(5, 2, 0, 1)), argsort(argsort(array(5, 2, 0, 1)));
> [2,3,1,0] [3,2,0,1] [3,2,0,1]
```
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #197 from myui/argmax.
Makoto Yui [Mon, 21 Oct 2019 07:22:05 +0000 (16:22 +0900)]
[HIVEMALL-244] Support Java9, Java11(LTS)
## What changes were proposed in this pull request?
Support Java9, Java11(LTS)
## What type of PR is it?
Improvement | Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-244
## How was this patch tested?
unit tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #203 from myui/HIVEMALL-244.
Makoto Yui [Fri, 18 Oct 2019 08:42:16 +0000 (17:42 +0900)]
[HIVEMALL-269] Modified to use matrix4j for matrix module
## What changes were proposed in this pull request?
Use matrix4j for matrix module
## What type of PR is it?
Hot Fix | Refactoring
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-269
## How was this patch tested?
unit tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #202 from myui/HIVEMALL-269.
Makoto Yui [Tue, 8 Oct 2019 07:15:24 +0000 (16:15 +0900)]
Fixed annotations
Makoto Yui [Mon, 7 Oct 2019 07:16:19 +0000 (16:16 +0900)]
Moved matrix/random package to utils/random
Makoto Yui [Mon, 7 Oct 2019 05:44:39 +0000 (14:44 +0900)]
Merged ArrayUtilsTest
Makoto Yui [Fri, 4 Oct 2019 05:28:49 +0000 (14:28 +0900)]
[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)
## What changes were proposed in this pull request?
Drop Spark Dataframe support (SparkSQL remain supported).
## What type of PR is it?
Hot Fix, Refactoring
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-267
## How was this patch tested?
unit tests, manual tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #201 from myui/HIVEMALL-267.
Makoto Yui [Thu, 3 Oct 2019 08:34:10 +0000 (17:34 +0900)]
[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines
## What changes were proposed in this pull request?
Fix the default vInit, eta initialization bug in FactorizationMachines
## What type of PR is it?
Bug Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-268
## How was this patch tested?
unit tests, manual tests on EMR
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #200 from myui/HIVEMALL-268.
Makoto Yui [Fri, 27 Sep 2019 18:39:01 +0000 (03:39 +0900)]
[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees
## What changes were proposed in this pull request?
Introduce `decision_path` UDF providing tracing of decision tree prediction paths
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-171
## How was this patch tested?
unit tests, manual tests on EMR
## How to use this feature?
to be described in the user guide
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #199 from myui/HIVEMALL-171.
Makoto Yui [Fri, 13 Sep 2019 09:23:00 +0000 (18:23 +0900)]
[HIVEMALL-245] Refactor RandomForest for Sparse Data handling
## What changes were proposed in this pull request?
Refactor RandomForest for Sparse Data handling
## What type of PR is it?
Refactoring
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-245
https://issues.apache.org/jira/browse/HIVEMALL-171
## How was this patch tested?
unit tests, manual tests on EMR
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #198 from myui/HIVEMALL-245.
Makoto Yui [Fri, 26 Jul 2019 07:33:22 +0000 (16:33 +0900)]
Fixed a documentation bug
Makoto Yui [Thu, 18 Jul 2019 07:51:33 +0000 (16:51 +0900)]
Add test of sparse input for randomforest classifier
Makoto Yui [Sat, 13 Jul 2019 14:45:52 +0000 (23:45 +0900)]
Fixed a minor typo in doc
Makoto Yui [Wed, 10 Jul 2019 07:17:20 +0000 (16:17 +0900)]
Added sanity checks for training data in RandomForest
Makoto Yui [Wed, 10 Jul 2019 05:58:39 +0000 (14:58 +0900)]
Refactor Matrix module for NNZ and zero value handling
## What changes were proposed in this pull request?
Refactor Matrix module for NNZ and zero value handling.
## What type of PR is it?
Hot Fix, Refactoring
## What is the Jira issue?
no JIRA issue
## How was this patch tested?
Unit tests
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #196 from myui/refactor_randomforest.
Makoto Yui [Fri, 28 Jun 2019 16:57:48 +0000 (01:57 +0900)]
Fixed ToC
Makoto Yui [Fri, 28 Jun 2019 16:55:39 +0000 (01:55 +0900)]
Added usage for feature_binning UDF
Makoto Yui [Fri, 28 Jun 2019 16:30:53 +0000 (01:30 +0900)]
Fixed a doc
Makoto Yui [Fri, 28 Jun 2019 06:43:05 +0000 (15:43 +0900)]
Fixed feature binning documentation
Makoto Yui [Thu, 27 Jun 2019 18:02:38 +0000 (03:02 +0900)]
[HIVEMALL-259][DOC] Refactor feature_binning UDF
## What changes were proposed in this pull request?
Refactor feature_binning UDF and update the function usage
## What type of PR is it?
Documentation, Refactoring
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-259
## How was this patch tested?
unit tests, manual tests on EMR
## How to use this feature?
```
WITH extracted as (
select
extract_feature(feature) as index,
extract_weight(feature) as value
from
input l
LATERAL VIEW explode(features) r as feature
),
mapping as (
select
index,
build_bins(value, 5, true) as quantiles -- 5 bins with auto bin shrinking
from
extracted
group by
index
),
bins as (
select
to_map(index, quantiles) as quantiles
from
mapping
)
select
l.features as original,
feature_binning(l.features, r.quantiles) as features
from
input l
cross join bins r
```
see https://gist.github.com/myui/
f943fa3ce1a7e1ac3f2dd9a7f9fa703b
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #195 from myui/HIVEMALL-259.
Makoto Yui [Tue, 25 Jun 2019 12:52:12 +0000 (21:52 +0900)]
Fixed imports
Solodye [Tue, 25 Jun 2019 10:31:02 +0000 (19:31 +0900)]
[HIVEMALL-253-2] map_roulette UDF
revise #192
Author: Makoto Yui <myui@apache.org>
Closes #193 from myui/HIVEMALL-253-2.
Makoto Yui [Thu, 20 Jun 2019 10:35:42 +0000 (19:35 +0900)]
[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format
## What changes were proposed in this pull request?
Add UDF to convert feature/label in Libsvm format
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-258
## How was this patch tested?
unit tests and manual tests
## How to use this feature?
```sql
Usage:
select to_libsvm_format(array('apple:3.4','orange:2.1'))
>
6284535:3.4
8104713:2.1
select to_libsvm_format(array('apple:3.4','orange:2.1'), '-features 10')
> 3:2.1 7:3.4
select to_libsvm_format(array('7:3.4','3:2.1'), 5.0)
> 5.0 3:2.1 7:3.4
```
## Checklist
(Please remove this section if not needed; check `x` for YES, blank for NO)
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #194 from myui/libsvm.
Makoto Yui [Thu, 20 Jun 2019 07:09:16 +0000 (16:09 +0900)]
Fixed a bug in document
Makoto Yui [Wed, 19 Jun 2019 10:12:03 +0000 (19:12 +0900)]
Fixed the usage of min-max scaling and zscore
Makoto Yui [Wed, 12 Jun 2019 08:27:24 +0000 (17:27 +0900)]
Increased write buffer from 1MB to 2MB
Makoto Yui [Fri, 19 Apr 2019 07:16:32 +0000 (16:16 +0900)]
Update doc
Makoto Yui [Fri, 19 Apr 2019 07:04:01 +0000 (16:04 +0900)]
[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja
## What changes were proposed in this pull request?
Add option to return PartOfSpeech information for `tokenize_ja` UDF.
## What type of PR is it?
Feature, Improvement
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-251
## How was this patch tested?
unit tests and manual tests on EMR
## How to use this feature?
```sql
WITH tmp as (
select
tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
)
select
r.tokens,
r.pos,
r.tokens[0] as token0,
r.pos[0] as pos0
from
tmp;
```
| tokens |pos | token0 | pos0 |
|:-:|:-:|:-:|:-:|
| ["kuromoji","使う","分かち書き","テスト"] | ["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"] | kuromoji | 名詞-一般 |
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #191 from myui/HIVEMALL-251.
Makoto Yui [Sat, 13 Apr 2019 21:24:42 +0000 (06:24 +0900)]
[HIVEMALL-246] Add feature name validation in feature UDF
## What changes were proposed in this pull request?
This PR adds feature name validation in feature UDF
feature(name, value) should validate name not to include ":". Fail-fast behavior is preferable.
## What type of PR is it?
Hot Fix
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-246
## How was this patch tested?
unit tests
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #190 from myui/HIVEMALL-246.
Makoto Yui [Sat, 13 Apr 2019 20:37:14 +0000 (05:37 +0900)]
[HIVEMALL-237-1] Add usage in ML function reference page
## What changes were proposed in this pull request?
Add usage in ML function reference page
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-237
## How was this patch tested?
via CI
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
Author: Makoto Yui <myui@apache.org>
Author: Makoto YUI <yuin405@gmail.com>
Closes #183 from myui/HIVEMALL-237.
Makoto Yui [Sat, 13 Apr 2019 20:09:38 +0000 (05:09 +0900)]
[HIVEMALL-248] UDF for Kuromoji stoptags
## What changes were proposed in this pull request?
In tokenize_ja, user need to provide stoptags that matched tokens removed from the token stream. So, stoptag is "exclusive" rule.
## What type of PR is it?
Feature
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-248
## How was this patch tested?
unit tests, functional test on EMR
## How to use this feature?
```sql
select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));
```
> ["分かち書き","テスト"]
`stoptags_exclude(array<string> tags, [, const string lang='ja'])` is a useful UDF for getting [stoptags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt) excluding given part-of-speech tags as seen below:
```sql
select stoptags_exclude(array("名詞-固有名詞"));
```
> ["その他","その他-間投","フィラー","副詞","副詞-一般","副詞-助詞類接続","助動詞","助詞","助詞-並立助詞"
,"助詞-係助詞","助詞-副助詞","助詞-副助詞/並立助詞/終助詞","助詞-副詞化","助詞-接続助詞","助詞-格助詞
","助詞-格助詞-一般","助詞-格助詞-引用","助詞-格助詞-連語","助詞-特殊","助詞-終助詞","助詞-連体化","助
詞-間投助詞","動詞","動詞-接尾","動詞-自立","動詞-非自立","名詞","名詞-サ変接続","名詞-ナイ形容詞語幹",
"名詞-一般","名詞-代名詞","名詞-代名詞-一般","名詞-代名詞-縮約","名詞-副詞可能","名詞-動詞非自立的","名
詞-引用文字列","名詞-形容動詞語幹","名詞-接尾","名詞-接尾-サ変接続","名詞-接尾-一般","名詞-接尾-人名","
名詞-接尾-副詞可能","名詞-接尾-助動詞語幹","名詞-接尾-助数詞","名詞-接尾-地域","名詞-接尾-形容動詞語幹"
,"名詞-接尾-特殊","名詞-接続詞的","名詞-数","名詞-特殊","名詞-特殊-助動詞語幹","名詞-非自立","名詞-非自
立-一般","名詞-非自立-副詞可能","名詞-非自立-助動詞語幹","名詞-非自立-形容動詞語幹","形容詞","形容詞-接
尾","形容詞-自立","形容詞-非自立","感動詞","接続詞","接頭詞","接頭詞-動詞接続","接頭詞-名詞接続","接頭
詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]
## Checklist
- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?
Author: Makoto Yui <myui@apache.org>
Closes #189 from myui/HIVEMALL-248.
Makoto Yui [Fri, 12 Apr 2019 07:02:17 +0000 (16:02 +0900)]
[HIVEMALL-247][DOC] Recommend hive.optimize.cte.materialize.threshold=2 in Hive tuning tips
## What changes were proposed in this pull request?
Recommend `hive.optimize.cte.materialize.threshold=2` in Hive tuning tips
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-247
Author: Makoto Yui <myui@apache.org>
Closes #188 from myui/HIVEMALL-247.
Makoto Yui [Fri, 12 Apr 2019 06:38:53 +0000 (15:38 +0900)]
[HIVEMALL-250][DOC] Add tutorial for binarize_label
## What changes were proposed in this pull request?
Add tutorial for `binarize_label` UDTF
## What type of PR is it?
Documentation
## What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-250
## How to use this feature?
as described in tutorial
Author: Makoto Yui <myui@apache.org>
Closes #187 from myui/HIVEMALL-250.