[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary
authorMakoto Yui <myui@apache.org>
Thu, 22 Apr 2021 03:39:21 +0000 (12:39 +0900)
committerMakoto Yui <myui@apache.org>
Thu, 22 Apr 2021 03:39:21 +0000 (12:39 +0900)
commitb56c477a20ef6d7be143cddc49d9f9f85e144b63
tree5fd86ae8e4045cfffa92d01f4c68fc68573d2774
parentdc461c2c7d1f7702659acab60c0b1334990a7b17
[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary

## What changes were proposed in this pull request?

Add tokenize_ja_neologd UDF that uses Neologd dictionary for Kuromoji tokenization.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-305

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
tokenize_ja_neologd(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)

select tokenize_ja_neologd("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。");
> ["彼女","ペンパイナッポーアッポーペン","恋ダンス","踊る"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #235 from myui/neologd.
16 files changed:
.rat-excludes
bin/update_ddls.sh
dist/pom.xml
docs/gitbook/misc/funcs.md
docs/gitbook/misc/tokenizer.md
nlp/pom.xml
nlp/src/main/java/hivemall/nlp/tokenizer/KuromojiNEologdUDF.java [new file with mode: 0644]
nlp/src/main/java/hivemall/nlp/tokenizer/KuromojiUDF.java
nlp/src/main/resources/hivemall/nlp/tokenizer/tokenizer.properties [new file with mode: 0644]
nlp/src/test/java/hivemall/nlp/tokenizer/KuromojiNEologdUDFTest.java [new file with mode: 0644]
nlp/src/test/java/hivemall/nlp/tokenizer/KuromojiUDFTest.java
resources/ddl/define-additional.hive [deleted file]
resources/ddl/define-all-as-permanent.hive
resources/ddl/define-all.hive
resources/ddl/define-all.spark
tools/hivemall-docs/src/main/java/hivemall/docs/FuncsListGeneratorMojo.java