Implement Korean text tokenizer
authorMakoto Yui <myui@apache.org>
Thu, 22 Apr 2021 14:53:10 +0000 (23:53 +0900)
committerMakoto Yui <myui@apache.org>
Thu, 22 Apr 2021 14:53:10 +0000 (23:53 +0900)
commit0ee33d2f59a0c5e0a47abbecfd56fb864b831462
tree14bbf04ea15fb7f465fa381923e0db2e4c5be683
parentaa0591c83d09940298e60c6dea6ab997f6a3a75d
Implement Korean text tokenizer

## What changes were proposed in this pull request?

Implement Korean text tokenizer

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-307

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
-- show version of lucene-analyzers-nori
select tokenize_ko();
> 8.8.2

select tokenize_ko("소설 무궁화꽃이 피었습니다.");
> ["소설","무궁","화","꽃","피"]

select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "mixed");
> ["소설","무궁화","무궁","화","꽃","피"]

select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "discard", array("E", "VV"));
> ["소설","무궁","화","꽃","이"]

select tokenize_ko("Hello, world.", null, "none", array(), true);
> ["h","e","l","l","o","w","o","r","l","d"]

select tokenize_ko("Hello, world.", null, "none", array(), false);
> ["hello","world"]

select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", null, "discard", array());
> ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]

select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", array("C++"), "discard", array());
> ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #237 from myui/korean_tokenizer.
15 files changed:
dist/pom.xml
docs/Dockerfile
docs/gitbook/misc/funcs.md
docs/gitbook/misc/tokenizer.md
nlp/pom.xml
nlp/src/main/java/hivemall/nlp/tokenizer/KuromojiNEologdUDF.java
nlp/src/main/java/hivemall/nlp/tokenizer/KuromojiUDF.java
nlp/src/main/java/hivemall/nlp/tokenizer/SmartcnUDF.java
nlp/src/main/java/hivemall/nlp/tokenizer/TokenizeKoUDF.java [new file with mode: 0644]
nlp/src/main/resources/hivemall/nlp/tokenizer/tokenizer.properties
nlp/src/test/java/hivemall/nlp/tokenizer/SmartcnUDFTest.java
nlp/src/test/java/hivemall/nlp/tokenizer/TokenizeKoUDFTest.java [new file with mode: 0644]
resources/ddl/define-all-as-permanent.hive
resources/ddl/define-all.hive
resources/ddl/define-all.spark