lipzhu [Mon, 24 Jan 2022 05:21:04 +0000 (10:51 +0530)]
[GRIFFIN-362] Add postgresql and oracle driver into dependencies
**What changes were proposed in this pull request?**
1. Add Oracle and postgresql JDBC driver into dependencies in measure module due user report
https://issues.apache.org/jira/browse/GRIFFIN-362
2. Update postgresql jdbc driver version to the latest in service module.
**Does this PR introduce any user-facing change?**
No.
How was this patch tested?
Unit Tests
Closes #597 from lipzhu/GRIFFIN-362.
Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: chitralverma <chitralverma@gmail.com>
Lipeng Zhu [Mon, 24 Jan 2022 05:06:22 +0000 (10:36 +0530)]
[GRIFFIN-369] Bug fix for avro format in Spark 2.3.x environment
**What changes were proposed in this pull request?**
Built in Avro format is released in Spark 2.4.0,https://issues.apache.org/jira/browse/SPARK-24768
For Griffin, we still need to convert the Avro to com.databricks.spark.avro in Spark 2.3.x environment.
**Does this PR introduce any user-facing change?**
No.
**How was this patch tested?**
Unit Tests
Closes #598 from lipzhu/GRIFFIN-369.
Lead-authored-by: Lipeng Zhu <lipzhu@ebay.com>
Co-authored-by: Chitral Verma <chitralverma@gmail.com>
Co-authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: chitralverma <chitralverma@gmail.com>
Lipeng Zhu [Thu, 20 Jan 2022 06:52:53 +0000 (14:52 +0800)]
[GRIFFIN-367] For task GRIFFIN-367, update local deploy document. (#596)
* For task GRIFFIN-367, update local deploy document.
* Update doc.
Chitral Verma [Mon, 4 Oct 2021 15:12:14 +0000 (20:42 +0530)]
[GRIFFIN-365] Measure Enhancements and Stability fixes (#593)
* [GRIFFIN-365] Update pom.xml with scapegoat and other changes
* [GRIFFIN-365] Remove ban on elasticsearch-spark dependency
* [GRIFFIN-365] Measure enhancements
* [GRIFFIN-365] Fix test cases
* [GRIFFIN-365] Updates to documentation and fix for breaking tests
* [GRIFFIN-365] Revert elasticsearch changes
* Update schema_conformance.md
* Update sparksql.md
William Guo [Wed, 7 Jul 2021 11:29:49 +0000 (19:29 +0800)]
Merge pull request #590 from chitralverma/improve-mergepr-script
[GRIFFIN-360] Improvements to merge_pr.py
William Guo [Wed, 7 Jul 2021 11:29:20 +0000 (19:29 +0800)]
Merge pull request #583 from chitralverma/check-stale-pr-and-issues
[GRIFFIN-347] Setup automated workflows for greetings and stale checks
William Guo [Mon, 5 Jul 2021 12:57:43 +0000 (20:57 +0800)]
Merge pull request #591 from chitralverma/fix-measures
[GRIFFIN-358] Rewrite the Rule/ Measure implementations
chitralverma [Wed, 30 Jun 2021 06:46:34 +0000 (12:16 +0530)]
[GRIFFIN-358] Fix import
chitralverma [Fri, 11 Jun 2021 23:08:22 +0000 (04:38 +0530)]
[GRIFFIN-358] Added documentation for SchemaConformanceMeasure
chitralverma [Fri, 11 Jun 2021 22:04:44 +0000 (03:34 +0530)]
[GRIFFIN-358] Changed Metric output format and fixed test cases
chitralverma [Fri, 11 Jun 2021 19:05:43 +0000 (00:35 +0530)]
[GRIFFIN-358] Added SchemaConformance measure
chitralverma [Fri, 11 Jun 2021 15:25:39 +0000 (20:55 +0530)]
[GRIFFIN-358] Error handling and code formatting changes
chitralverma [Fri, 11 Jun 2021 05:55:42 +0000 (11:25 +0530)]
[GRIFFIN-358] Added sampling option to ProfilingMeasure
chitralverma [Fri, 4 Jun 2021 05:19:43 +0000 (10:49 +0530)]
[GRIFFIN-358] Fixed breaking test case
chitralverma [Fri, 4 Jun 2021 03:26:54 +0000 (08:56 +0530)]
[GRIFFIN-358] Added code documentation for all new measures.
chitralverma [Wed, 2 Jun 2021 22:09:03 +0000 (03:39 +0530)]
[GRIFFIN-358] Added parallelization to MeasureExecutor
chitralverma [Sat, 29 May 2021 18:37:30 +0000 (00:07 +0530)]
[GRIFFIN-358] Changes structure of Measure
chitralverma [Sat, 29 May 2021 18:22:49 +0000 (23:52 +0530)]
[GRIFFIN-358] Added test cases for Data pre proc
chitralverma [Fri, 28 May 2021 20:54:40 +0000 (02:24 +0530)]
[GRIFFIN-358] Updated Configurations for pre proc and batch all measures
chitralverma [Fri, 28 May 2021 20:46:26 +0000 (02:16 +0530)]
[GRIFFIN-358] Allow users to run old "evaluate.rule" configs as well
chitralverma [Fri, 28 May 2021 19:57:28 +0000 (01:27 +0530)]
[GRIFFIN-358] Added accuracy measure configuration guide.
chitralverma [Fri, 28 May 2021 02:05:21 +0000 (07:35 +0530)]
[GRIFFIN-358] Changed 'target' to 'ref' to clear terminology
chitralverma [Sun, 2 May 2021 21:33:52 +0000 (03:03 +0530)]
[GRIFFIN-358] Added profiling measure configuration guide.
chitralverma [Sun, 2 May 2021 17:39:43 +0000 (23:09 +0530)]
[GRIFFIN-358] Added measure configuration guide for duplication and sparkSql measures.
chitralverma [Sun, 2 May 2021 17:04:07 +0000 (22:34 +0530)]
[GRIFFIN-358] Update Duplication Measure to exclude null values
chitralverma [Sun, 2 May 2021 13:03:55 +0000 (18:33 +0530)]
[GRIFFIN-358] Added general documentation for new dimensions/ measures and completeness measure configuration guide.
chitralverma [Wed, 21 Apr 2021 17:00:45 +0000 (22:30 +0530)]
[GRIFFIN-347] Removed automation for issues as that's handled by Jira
chitralverma [Wed, 21 Apr 2021 14:56:48 +0000 (20:26 +0530)]
[GRIFFIN-347] Updated with master
chitralverma [Wed, 21 Apr 2021 14:41:27 +0000 (20:11 +0530)]
[GRIFFIN-360] Improvements to merge_pr.py
chitralverma [Wed, 7 Apr 2021 10:16:59 +0000 (15:46 +0530)]
[GRIFFIN-358] Fixed breaking test cases
chitralverma [Mon, 5 Apr 2021 17:06:35 +0000 (22:36 +0530)]
[GRIFFIN-358] Fixed formatting
chitralverma [Mon, 5 Apr 2021 11:37:00 +0000 (17:07 +0530)]
[GRIFFIN-358] Added ProfilingMeasureTest
chitralverma [Mon, 5 Apr 2021 05:26:41 +0000 (10:56 +0530)]
[GRIFFIN-358] Added DuplicationMeasureTest
chitralverma [Sun, 4 Apr 2021 17:37:48 +0000 (23:07 +0530)]
[GRIFFIN-358] Added SparkSqlMeasureTest
chitralverma [Sun, 4 Apr 2021 17:04:07 +0000 (22:34 +0530)]
[GRIFFIN-358] Added AccuracyMeasureTest
chitralverma [Sun, 4 Apr 2021 11:25:18 +0000 (16:55 +0530)]
[GRIFFIN-358] Added CompletenessMeasureTest
chitralverma [Sat, 3 Apr 2021 16:59:24 +0000 (22:29 +0530)]
[GRIFFIN-358] Merge Measure constants
chitralverma [Sat, 3 Apr 2021 15:50:19 +0000 (21:20 +0530)]
[GRIFFIN-358] New Accuracy Measure
chitralverma [Mon, 29 Mar 2021 18:37:34 +0000 (00:07 +0530)]
[GRIFFIN-358] Changes to Metric Flush process
chitralverma [Mon, 29 Mar 2021 17:38:25 +0000 (23:08 +0530)]
[GRIFFIN-358] New Duplication (Distinctness, Uniqueness) Measure
chitralverma [Sun, 28 Mar 2021 17:47:25 +0000 (23:17 +0530)]
[GRIFFIN-358] New SparkSQL Measure
chitralverma [Sun, 28 Mar 2021 15:49:58 +0000 (21:19 +0530)]
[GRIFFIN-358] New Profiling Measure
chitralverma [Wed, 24 Mar 2021 13:47:35 +0000 (19:17 +0530)]
[GRIFFIN-358] Rewrite new measure hierarchy and new completeness measure
chitralverma [Sun, 21 Mar 2021 05:06:08 +0000 (10:36 +0530)]
[GRIFFIN-358] Rewrite dataset preprocessing as SQL Queries
chitralverma [Tue, 9 Mar 2021 09:03:02 +0000 (14:33 +0530)]
[GRIFFIN-345] Support cross-version compilation for Scala and Spark dependencies
**What changes were proposed in this pull request?**
_This PR affects only the measure module._
In newer environments specially clouds, Griffin measure module may face compatibility issues due the old Scala and Spark versions. To remedy this following topics are covered in this ticket,
- Cross-compilation across scala major versions (2.11, 2.12)
- Update Spark Version (2.4+)
- Create maven profiles to build different scala and spark versions
- Changes to build strategy
This process is also done is apache spark to build for different versions of Scala and Hadoop.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Via maven build process.
Closes #589 from chitralverma/cross-version-build.
Authored-by: chitralverma <chitralverma@gmail.com>
Signed-off-by: chitralverma <chitralverma@gmail.com>
yuxiaoyu [Mon, 7 Dec 2020 03:59:14 +0000 (11:59 +0800)]
Support http remote conf
In our production practice, many Griffin jobs run on yarn in cluster mode. We upload different conf files to the http file server and we also provide services that generate specific configurations based on different HTTP URLs.
So we supports setting HTTP URLs as conf in submitting Griffin jobs in this PR. There is no effect on JSON or File conf mode. And it works well in our production environment for a long time.
Author: yuxiaoyu <yuxiaoyu@bytedance.com>
Closes #587 from XiaoyuBD/support_http_url_conf.
Eugene [Wed, 2 Dec 2020 10:14:01 +0000 (03:14 -0700)]
Fix doc format glitches
Author: Eugene <liujin@apache.org>
Closes #588 from toyboxman/doc-pr.
Yuepeng [Mon, 9 Nov 2020 06:10:43 +0000 (23:10 -0700)]
[maven-release-plugin] prepare for next development iteration
Yuepeng [Mon, 9 Nov 2020 06:09:19 +0000 (23:09 -0700)]
[maven-release-plugin] prepare release griffin-0.6.0
William Guo [Sun, 8 Nov 2020 08:26:32 +0000 (13:56 +0530)]
Change connectors to connector for datasource
Closes #586 from guoyuepeng/change_connectors_to_connector_for_datasource.
Lead-authored-by: William Guo <guoyp@apache.org>
Co-authored-by: deyiyao <deyiyao@ebay.com>
Co-authored-by: ahutsunshine <ahutsunshine@gmail.com>
Signed-off-by: Chitral Verma <chitralverma@gmail.com>
William Guo [Mon, 26 Oct 2020 09:35:16 +0000 (17:35 +0800)]
update angular cli version for release issue
Author: William Guo <guoyp@apache.org>
Closes #585 from guoyuepeng/update_angular_cli_verion_for_release.
William Guo [Mon, 26 Oct 2020 03:14:23 +0000 (11:14 +0800)]
compaliancy fix
Author: William Guo <guoyp@apache.org>
Closes #584 from guoyuepeng/compaliancy_fix_before_release_0.7.0.
chitralverma [Mon, 21 Sep 2020 05:05:53 +0000 (10:35 +0530)]
For task GRIFFIN-347, Add workflows
ambition119 [Tue, 15 Sep 2020 01:31:34 +0000 (09:31 +0800)]
[GRIFFIN-SERVICE] use service tar.gz deploy and griffin.sh start
If we use` java -jar` start service ,then It is inconvenient to modify the configuration file. Modifying the configuration file requires recompiling the jar. And it is inconvenient to stop the service.
This PR provides `tar.gz` installation method and shell startup method,and it has been deployed and run in our production environment.
1. ls -al service-0.6.0-SNAPSHOT
bin
config
lib
2. cd service-0.6.0-SNAPSHOT/
3. ./bin/griffin.sh start
4. jps
17860 GriffinWebApplication
5. ./bin/griffin.sh stop
stopping 17860 of service ...
If start service, we can access http://${ip}:8080

Author: ambition119 <
1269223860@qq.com>
Closes #582 from ambition119/service.tar.gz.
wankunde [Mon, 17 Aug 2020 07:41:10 +0000 (13:11 +0530)]
[GRIFFIN-339] Import griffin tool for debug and run user jobs
With Griffin tool, user can run dq jobs in command line.
This is helpful for user to debug and run user dq jobs.
Closes #581 from wankunde/measure_tools.
Authored-by: wankunde <wankunde@163.com>
Signed-off-by: chitralverma <chitralverma@gmail.com>
chitralverma [Mon, 10 Aug 2020 02:49:42 +0000 (10:49 +0800)]
[GRIFFIN-305] Standardize sink hierarchy
**What changes were proposed in this pull request?**
Currently, the implementation of `Sinks` in Griffin poses the below issues. This PR aims at fixing these issues.
- `Sinks` are based on the recursive MultiSink class which is a sink itself but the underlying implementation is that of a `Seq` which causes ambiguity and isn't much useful. This has been removed.
- Some unused code like `SinkContext` has been removed.
- Data is converted from the performant DataFrame to RDD while persisting in both streaming and batch pipelines. A new method `sinkBatchRecords` has been added to allow operations directly on DataFrame for batch pipelines. Streaming will still use the old implementation which will be replaced with structured streaming.
- Refactored the methods of `Sink` like changed `start`/ `finish` to `open`/ `close` and `jobName` was incorrectly passed as `metricName`.
- Presently, only one instance of a sink with a given type can be defined in the env config. This will not allow the cases where you want to configure multiple sinks of same type like HDFS or JDBC. Added sink `name` to env config which is used to define the sink that should be used in the job config also.
- Updated all sinks as per the changes above. With some additional changes to ConsoleSink
**Does this PR introduce any user-facing change?**
Yes. As mentioned above, the sink config has changed in env and job configs.
**How was this patch tested?**
Griffin test suite and additional unit test cases
Author: chitralverma <chitralverma@gmail.com>
Closes #575 from chitralverma/standardize-sink-hierarchy.
Eugene [Sun, 21 Jun 2020 12:13:49 +0000 (05:13 -0700)]
Fix Unit Test Issue In Measure Test Case
[GRIFFIN-329] Measure unit test cases fail on the condition of no docker image
The unit test case tries to download a ES docker image and run the following cases. If the downloading fails, some cases will abort due to exceptions. In the revision, a new flag is introduced in execution, unless the docker image is avaiable always, some cases will be excluded.
Author: Eugene <liujin@apache.org>
Closes #580 from toyboxman/Fix.
chitralverma [Thu, 4 Jun 2020 11:01:34 +0000 (19:01 +0800)]
[GRIFFIN-326] New Data Connector for Elasticsearch
**What changes were proposed in this pull request?**
This ticket proposes the following changes,
- Deprecate the current implementation in favour of the direct implementation in the official [elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/sql-20) library.
- This library is built on DataSource API built on spark 2.2.x+ and thus brings support for filter pushdowns, column pruning, unified read and write and additional optimizations.
- Many configuration options are available for ES connectivity, [check here](https://github.com/elastic/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/cfg/ConfigurationOptions.java).
- Any filters can be applied as expressions directly on the data frame and are pushed automatically to the source.
**Does this PR introduce any user-facing change?**
Yes. As mentioned above, the old connector has been deprecated and config structure for Elasticsearch data connector has changed now.
**How was this patch tested?**
Griffin test suite and additional unit test cases
Author: chitralverma <chitralverma@gmail.com>
Closes #569 from chitralverma/new-elastic-search-connector.
DongfangLu [Wed, 13 May 2020 10:50:54 +0000 (18:50 +0800)]
Upgrade UI packages for jquery
Upgrade jquery
Author: DongfangLu <ayludongfang@163.com>
Closes #571 from ludongfang/ui_package_upgrade.
Yu [Wed, 11 Mar 2020 15:15:19 +0000 (23:15 +0800)]
[GRIFFIN-316] Fix job exception handling
**What changes were proposed in this pull request?**
Currently we are using Try instance to represent the results of a DQ job, whether succeeded or failed. But as we are only wrapping the Boolean result by applying "Try" at the most outside level, the underlying failure would not be able to caught and it would always return "Success" even if exception got.
This is to modify all the underlying execute/doExecute methods of a DQ job, by handling exception with "Try" instances so that it could be passed properly to users when things get wrong.
**Does this PR introduce any user-facing change?**
No.
**How was this patch tested?**
Griffin test suite.
Author: Yu <yu.liu003@gmail.com>
Closes #562 from PnPie/exception_catch.
yuxiaoyu [Tue, 11 Feb 2020 12:15:53 +0000 (20:15 +0800)]
optimize get metric maps in 'MetricWriteStep'
**Why/What changes?**
In 'MetricWriteStep.getMetricMaps()' the dataframe was transformed to json rdd, and then collect, and then transformed to Seq[Map].
It's not elegant and hard to understand. More optimized way is to collect it first, and then transform it to Seq[Map] directly.
We have test it with our DQ cases. It works well.
Author: yuxiaoyu <yuxiaoyu@bytedance.com>
Closes #566 from XiaoyuBD/optimizeMetricWriteGetMaps.
chitralverma [Mon, 10 Feb 2020 09:09:29 +0000 (17:09 +0800)]
[GRIFFIN-323] Refactor configuration Data Source Connector
**What changes were proposed in this pull request?**
This ticket proposes the following changes,
- remove 'version' from 'DataConnectorParam' as it is not being used anywhere in the codebase.
- change 'connectors' from array type to a single JSON object. Since a data source named X may only be of one type (hive, file etc), the connector field should not be an array.
- rename connectors to connector
- update existing config files and documentation for reference
**Does this PR introduce any user-facing change?**
Yes. As mentioned above, the config structure has changed now.
**How was this patch tested?**
Griffin test suite.
Author: chitralverma <chitralverma@gmail.com>
Closes #568 from chitralverma/refactor-data-connector-config.
yuxiaoyu [Sat, 8 Feb 2020 08:15:03 +0000 (16:15 +0800)]
[GRIFFIN-322] Add SQL mode for ES connector
As [GRIFFIN-322](https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-322?filter=allopenissues) , we want add sql mode for es connector.
**The sql mode would more effective and user-friendly.**
Current mode config:
{ "class": "org.apache.griffin.measure.datasource.connector.batch.ElasticSearchGriffinDataConnector",
"index": "index-xxx",
"type": "metric",
"host": "xxxxxxxxxx",
"port": "xxxx",
"fields": ["col_a", "col_b", "col_c"],
"size": 100}
SQL mode config:
{ "class": "org.apache.griffin.measure.datasource.connector.batch.ElasticSearchGriffinDataConnector",
"sql.mode": true,
"host": "xxxxxxxxxx",
"port": "xxxx",
"sql": "select col_a, col_b, col_c from index-xx limit 100"}
Compared with current mode, SQL mode could support other types except number type.
Author: yuxiaoyu <yuxiaoyu@bytedance.com>
Closes #567 from XiaoyuBD/enrichEsConnectorAddSqlMode.
chitralverma [Thu, 16 Jan 2020 07:20:21 +0000 (15:20 +0800)]
[GRIFFIN-317] Define guidelines for Griffin Project Improvement Proposals (GPIP)
**What changes were proposed in this pull request?**
Taking inspiration from Apache Spark, this ticket aims to define guidelines for Griffin Project Improvement Proposals (GPIP).
The purpose of a GPIP is to inform and involve the user community in major improvements to the Apache Griffin codebase throughout the development process, to increase the likelihood that user needs are met.
A GPIP aims to discuss the design and implementation of major features and changes in a collaborative manner. These major features must not be small/ incremental/ wide-scoped, as these features can be resolved by normal Jira process.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Not Applicable
Author: chitralverma <chitralverma@gmail.com>
Closes #563 from chitralverma/griffin-pip-template.
neveljkovic [Mon, 6 Jan 2020 09:17:36 +0000 (17:17 +0800)]
[GRIFFIN-318] Replace all YYYY with yyyy in all user guides and examples
https://issues.apache.org/jira/browse/GRIFFIN-318
Replace YYYY with yyyy and DD with dd
Author: neveljkovic <nveljkovic@plume.com>
Closes #565 from neveljkovic/GRIFFIN-318.
chitralverma [Mon, 6 Jan 2020 09:12:03 +0000 (17:12 +0800)]
[GRIFFIN-319] Deprecate old Data Connectors
**What changes were proposed in this pull request?**
This ticket aims to inform users of the deprecated data source connectors.
Deprecated connectors:
- MySqlDataConnector in favour of JDBCBasedDataConnector
- AvroBatchDataConnector in favour of FileBasedDataConnector
- TextDirBatchDataConnector in favour of FileBasedDataConnector
The documentation is also updated corresponding to the new connectors for reference.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Not Applicable
Author: chitralverma <chitralverma@gmail.com>
Closes #564 from chitralverma/deprecate-old-data-connectors.
tusharpatil20 [Tue, 31 Dec 2019 02:28:00 +0000 (10:28 +0800)]
[GRIFFIN-315] Adding JDBC based data connector
**What changes were proposed in this pull request?**
JDBC based data connector to read data from different JDBC based data sources.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Griffin test suite.
Author: tusharpatil20 <tushargp20@gmail.com>
Closes #561 from tusharpatil20/JDBCBased-source-connector.
chitralverma [Wed, 25 Dec 2019 02:39:11 +0000 (10:39 +0800)]
[GRIFFIN-312] Code Style Standardization
**What changes were proposed in this pull request?**
This PR targets the following,
- fix the various warnings during build and in source code,
- perform code formatting as per a standard style,
- fix scalastyle integration
Link: https://github.com/apache/spark/blob/master/dev/.scalafmt.conf
Since ScalaStyle targets scala source code only, it should be a part of the measure module only. Current misconfiguration is also suppressing the formatting errors
Scalafmt is used for code formatting.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Griffin test suite.
Author: chitralverma <chitralverma@gmail.com>
Closes #560 from chitralverma/code-style-standardization.
tusharpatil [Mon, 16 Dec 2019 13:22:20 +0000 (21:22 +0800)]
Enum based configs
**What changes were proposed in this pull request?**
All the predefined: `DQTypes, DSLTypes, FlattenType, OutputType, ProcessType, SinkType and WriteMode` are compared with config using a regex-based approach. This will make unnecessary overhead in terms of execution time and maintainability.
This PR uses a predefined enum based approach than regex-based to provide the same functionality.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Griffin test suite.
Author: tusharpatil <tus>
Author: tusharpatil20 <
35911427+tusharpatil20@users.noreply.github.com>
Closes #558 from tusharpatil20/enum-based-configs.
wankunde [Fri, 13 Dec 2019 13:25:14 +0000 (21:25 +0800)]
[GRIFFIN-310] Unified scala code style and enable scala code style checking by default
Griffin has more and more contributors, so we need a unified code style and enable code style by default.
Author: wankunde <wankunde@163.com>
Closes #559 from wankunde/scalastyle.
wankunde [Sat, 30 Nov 2019 05:08:12 +0000 (13:08 +0800)]
[GRIFFIN-301] Update custom data connector to have the same parameters as build-in data connector
Now custom data connectors have different parameters with build-in data connector, which will confuse the user.
For example : https://issues.apache.org/jira/browse/GRIFFIN-300
Author: wankunde <wankunde@163.com>
Closes #556 from wankunde/custom_data_connector.
chitralverma [Sat, 30 Nov 2019 05:01:30 +0000 (13:01 +0800)]
[GRIFFIN-304] Eliminate older contexts
**What changes were proposed in this pull request?**
As SparkSession is a direct replacement for SparkContext, SQLContext and HiveContext, there is no need to pass/ instantiate them. If any of the oder contexts are needed, they can be derived from SparkSession.
This issue aims to eliminate dependency on older Contexts in favour of SparkSession.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Griffin test suite.
Author: chitralverma <chitralverma@gmail.com>
Closes #557 from chitralverma/eliminate-older-contexts.
wankunde [Mon, 25 Nov 2019 11:32:00 +0000 (19:32 +0800)]
Bug fix for reflecting a custom sink object
Bug fix for reflecting a custom sink object.
Author: wankunde <wankunde@163.com>
Closes #551 from wankunde/custom_sink.
chitralverma [Thu, 21 Nov 2019 01:26:44 +0000 (09:26 +0800)]
[GRIFFIN-297] Allow support for additional file based data sources
**What changes were proposed in this pull request?**
The PR extends the current support beyond just Avro and Text for various file based data sources (Parquet, ORC, etc).
- Allows users to specify additional file based data sources like Parquet, CSV, TSV, ORC etc.
- Allows data to be read directly from stand-alone files as well as directories present in both local/ distributed file systems.
- Allows users to specify schema directly through options (useful for CSV/ TSV types).
A sample config looks like,
```
{
"name": "source",
"baseline": true,
"connectors": [
{
"type": "file",
"version": "1.7",
"config": {
"format": "parquet",
"options": {
"k1": "v1",
"k2": "v2"
},
"paths": [
"/home/chitral/path/to/source/",
"/home/chitral/path/to/test.parquet"
]
}
}
]
}
```
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Griffin test suite. Some additional unit test has also been added.
Author: chitralverma <chitralverma@gmail.com>
Closes #555 from chitralverma/allow_file_based_batch_connectors.
wankunde [Sat, 16 Nov 2019 03:11:34 +0000 (11:11 +0800)]
[GRIFFIN-298] add CompletenessExpr2DQSteps test case
Add some test case for CompletenessExpr2DQSteps transform, in addition some small code optimization.
Author: wankunde <wankunde@163.com>
Closes #550 from wankunde/CompletenessExpr2DQSteps.
wankunde [Thu, 14 Nov 2019 01:07:33 +0000 (09:07 +0800)]
[GRIFFIN-299] Add oracle jdk8 support in travis build phase
As Ubuntu Xenial has become the default Travis CI build environment, we may build fail.
The workaround is to either add `dist: trusty` to your .travis.yml file or use `openjdk8`.
Author: wankunde <wankunde@163.com>
Closes #552 from wankunde/travis.
wankunde [Fri, 1 Nov 2019 13:56:28 +0000 (21:56 +0800)]
[GRIFFIN-295] Limit the memory used by test case
The container memory size is 3G in travis, but out test cases always uses more than 3G memory, so `Cannot allocate memory` will be thrown.
```
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000fe980000,
23592960, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map
23592960 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/travis/build/apache/griffin/measure/hs_err_pid11948.log
# [ timer expired, abort... ]
```
There are two kind of programs in our tests, the maven main program and the tests run by maven-surefire-plugin and scalatest-maven-plugin.
If the memory is unlimited, test cases will occupy as much memory as possible especially spark jobs.
Spark jobs will not free the memory until a full GC occurs , even if we have stopped the spark context .so we need to limit the momery used by test cases.
We can limit the maven memory used by set export MAVEN_OPTS=" -Xmx1024m -XX:ReservedCodeCacheSize=128m" , and we can limit the memory used by spark job tests by configuring the maven-surefire-plugin and scalatest-maven-plugin.
For example:
Before we limit the memory used, maven program occupy 1.5G memory and spark job occupy 1.8G memory.
<img width="1153" alt="1" src="https://user-images.githubusercontent.com/
3626747/
67956554-
40108e00-fc2f-11e9-83de-
d0840fb42cb7.png">
<img width="1150" alt="2" src="https://user-images.githubusercontent.com/
3626747/
67956567-
46066f00-fc2f-11e9-8a73-
6d141be28e70.png">
After we limit the memory used, maven program occupy 1G memory and spark job occupy 1G memory.
<img width="1142" alt="3" src="https://user-images.githubusercontent.com/
3626747/
67956579-
4999f600-fc2f-11e9-9cd4-
9032966ca923.png">
<img width="1139" alt="4" src="https://user-images.githubusercontent.com/
3626747/
67956586-
4dc61380-fc2f-11e9-800b-
1d26d637a479.png">
Author: wankunde <wankunde@163.com>
Closes #546 from wankunde/testcase_memory_limit.
Zhao Li [Fri, 1 Nov 2019 13:36:02 +0000 (21:36 +0800)]
[GRIFFIN-294] bugfix for completness enumeration wrong sql
if there is only 'hive_none' in enumeration values list, it will generate wrong sql. Update code to fix this bug
Author: Zhao Li <mrlzbebop@gmail.com>
Author: Zhao <LittleZhao@users.noreply.github.com>
Closes #545 from LittleZhao/griffin-294.
ahutsunshine [Thu, 24 Oct 2019 11:30:35 +0000 (19:30 +0800)]
fix tests failure
1. fix hive connect failure
2.improve test run time
Author: ahutsunshine <ahutsunshine@gmail.com>
Closes #543 from ahutsunshine/master.
William Guo [Tue, 22 Oct 2019 13:30:00 +0000 (21:30 +0800)]
remove -Xms500m -Xmx1g -XX:MaxPermSize=256m temporarily
Author: William Guo <guoyp@apache.org>
Closes #544 from guoyuepeng/fix_maven_memory_opt.
neveljkovic [Mon, 14 Oct 2019 12:14:56 +0000 (20:14 +0800)]
[GRIFFIN-293][SERVICE] livy.need.queue=true
This is how we fixed issue described in https://issues.apache.org/jira/browse/GRIFFIN-293
Solution is deployed to our servers and works OK.
Author: neveljkovic <nveljkovic@plume.com>
Closes #541 from neveljkovic/griffin-293.
‘Zhao [Thu, 10 Oct 2019 15:13:27 +0000 (23:13 +0800)]
[GRIFFIN-289] New feature for griffin COMPLETENESS dq type
As describing in GRIFFIN-289, add two new ways to check 'incompleteness' record: regular expression and enumeration.
Add 'error.confs' in dq json file. Each json object in 'error.confs' list means one column configuration.
If do not have 'error.confs', using old 'incompleteness' process, which is compatible for existing json file.
Add ut for the new json format.
Author: ‘Zhao <mrlzbebop@gmail.com>
Author: Zhao Li <mrlzbebop@gmail.com>
Closes #538 from LittleZhao/griffin-289.
jasonliaoxiaoge [Thu, 26 Sep 2019 14:53:27 +0000 (22:53 +0800)]
add placeholder for cron expression
add placeholder for cron expression, cause java quartz is a little difference from crontab in linux
Author: jasonliaoxiaoge <
181276056@qq.com>
Closes #503 from jasonliaoxiaoge/master.
wankunde [Tue, 17 Sep 2019 23:29:05 +0000 (07:29 +0800)]
[GRIFFIN-290] Fix bug for submitting job to livy
When griffin submit multiple DQ jobs to livy, the http parameter `name` is always griffin.
So livy will reject them.
job request :
```
[owner: null, request: [proxyUser: None, file: hdfs://nameservice-standby/user/kun.wan/measure-0.6.0-SNAPSHOT.jar,
args: {
"spark" :
Unknown macro: { "log.level" }
,
"sinks" : [
Unknown macro: { .... }
],
"griffin.checkpoint" : [ ]
},{
"measure.type" : "griffin",
"id" : 5202,
"name" : "spu_null_check",
"owner" : "test",
"description" : "check null value for store and category",
"deleted" : false,
"timestamp" :
1568195100000,
"dq.type" : "PROFILING",
"sinks" : [ "ELASTICSEARCH", "HDFS" ],
"process.type" : "BATCH",
"rule.description" :
,
"data.sources" : [
Unknown macro: { .... }
],
"evaluate.rule" :
,
"measure.type" : "griffin"
},raw,raw, driverMemory: 1g, executorMemory: 6g, executorCores: 2, numExecutors: 6, queue: root.users.kun_dot_wan, name: griffin]]
```
livy Response :
```
400 Bad Request
[Date:"Thu, 12 Sep 2019 10:00:00 GMT", Content-Type:"application/json;charset=utf-8", Content-Length:"47", Server:"Jetty(9.3.24.v20180605)"]
{"msg":"Duplicate session name: Some(griffin)"}
```
Author: wankunde <wankunde@163.com>
Closes #534 from wankunde/livy_bug.
wankunde [Sat, 14 Sep 2019 07:45:39 +0000 (15:45 +0800)]
[GRIFFIN-291] Relocate HttpClient code in measure jar using shade plugin
Now projects use different version of httpclient , and very easy to have conflicts with other components.
We can shade the httpclient sources into measure jar and the conflicts disappear.
Author: wankunde <wankunde@163.com>
Closes #535 from wankunde/httpclient.
wankunde [Fri, 13 Sep 2019 14:17:03 +0000 (22:17 +0800)]
[GRIFFIN-288] optimize hdfs sink
When we sink records to hdfs , it may be OOM if the result is huge.
```
19/09/06 18:52:39 INFO LineBufferedStream: 19/09/06 18:52:39 ERROR sink.HdfsSink: Java heap space
19/09/06 18:52:39 INFO LineBufferedStream: java.lang.OutOfMemoryError: Java heap space
19/09/06 18:52:39 INFO LineBufferedStream: at java.util.Arrays.copyOf(Arrays.java:3332)
19/09/06 18:52:39 INFO LineBufferedStream: at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
19/09/06 18:52:39 INFO LineBufferedStream: at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
19/09/06 18:52:39 INFO LineBufferedStream: at java.lang.StringBuilder.append(StringBuilder.java:136)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.AbstractTraversable.addString(Traversable.scala:104)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:323)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:325)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.sink.HdfsSink.org$apache$griffin$measure$sink$HdfsSink$$sinkRecords2Hdfs(HdfsSink.scala:191)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.sink.HdfsSink.sinkRecords(HdfsSink.scala:133)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.sink.MultiSinks$$anonfun$sinkRecords$1.apply(MultiSinks.scala:63)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.sink.MultiSinks$$anonfun$sinkRecords$1.apply(MultiSinks.scala:61)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.collection.immutable.List.foreach(List.scala:392)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.sink.MultiSinks.sinkRecords(MultiSinks.scala:61)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.step.write.RecordWriteStep.execute(RecordWriteStep.scala:49)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.step.transform.SparkSqlTransformStep.doExecute(SparkSqlTransformStep.scala:40)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.step.transform.TransformStep$class.execute(TransformStep.scala:72)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.step.transform.SparkSqlTransformStep.execute(SparkSqlTransformStep.scala:27)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.step.transform.TransformStep$$anonfun$2$$anonfun$apply$1.apply$mcV$sp(TransformStep.scala:51)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.step.transform.TransformStep$$anonfun$2$$anonfun$apply$1.apply(TransformStep.scala:50)
19/09/06 18:52:39 INFO LineBufferedStream: at org.apache.griffin.measure.step.transform.TransformStep$$anonfun$2$$anonfun$apply$1.apply(TransformStep.scala:50)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
19/09/06 18:52:39 INFO LineBufferedStream: at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
19/09/06 18:52:39 INFO LineBufferedStream: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
19/09/06 18:52:39 INFO LineBufferedStream: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
19/09/06 18:52:39 INFO LineBufferedStream: at java.lang.Thread.run(Thread.java:748)
```
Author: wankunde <wankunde@163.com>
Closes #533 from wankunde/hdfssink.
wankunde [Mon, 9 Sep 2019 00:28:19 +0000 (08:28 +0800)]
[GRIFFIN-286] Remove spark-testing-base dependency jar
Now we use spark-testing-base jar to test spark job in measure module, but this jar maybe conflict with the spark version(CDH spark version,spark AE) or scala version(few scala version with specified spark version).
So I suggest removing the dependency of this package.
Author: wankunde <wankunde@163.com>
Closes #531 from wankunde/remoteSparkTestBase.
Simon George [Wed, 4 Sep 2019 13:38:54 +0000 (21:38 +0800)]
Made UI tests run without errors
These changes allow the UI tests to execute without errors when run using "npm test"
Author: Simon George <simongeorge@rentalcars.com>
Closes #529 from simegeorge/fix-ui-tests.
Johnnie [Mon, 2 Sep 2019 23:40:33 +0000 (07:40 +0800)]
[GRIFFIN-279] Upgrade Spring boot to 2.1.7.RELEASE
As spring boot 1.x is end of life from Aug 1st 2019, it would be great to migrate to 2.1.x.
Below is the announcement
https://spring.io/blog/2018/07/30/spring-boot-1-x-eol-aug-1st-2019
Migrate Guide is
https://github.com/spring-projects/spring-boot/wiki/Spring-Boot-2.0-Migration-Guide
Author: Johnnie <joohnnie.z@gmail.com>
Closes #528 from joohnnie/GRIFFIN-279.
Lionel Liu [Thu, 29 Aug 2019 14:13:46 +0000 (22:13 +0800)]
Merge pull request #530 from aleksgor/GRIFFIN-AGORSHKOV
Add Mysql, Cassandra and Elasticsearch connectors.
Lionel Liu [Thu, 29 Aug 2019 14:13:24 +0000 (22:13 +0800)]
Merge pull request #527 from joohnnie/GRIFFIN-280
GRIFFIN-280 update travis config to start griffin docker container
Gorshkov Aleksey [Tue, 27 Aug 2019 13:43:20 +0000 (16:43 +0300)]
Remove redurant dependency.
aleksgor [Tue, 27 Aug 2019 12:23:57 +0000 (15:23 +0300)]
Merge branch 'master' into GRIFFIN-AGORSHKOV
Gorshkov Aleksey [Tue, 27 Aug 2019 12:20:03 +0000 (15:20 +0300)]
Add Mysql, Cassandra and Elasticsearch connectors.
wankunde [Mon, 26 Aug 2019 23:42:13 +0000 (07:42 +0800)]
[GRIFFIN-283] Move sink steps into TransformStep
Treat sink steps as a part of a transform step, so we can keep focus on transform step codes.
Also the sink steps and some other transform step could be executed concurrently.
Author: wankunde <wankunde@163.com>
Closes #526 from wankunde/sink2.
Johnnie [Mon, 26 Aug 2019 23:41:41 +0000 (16:41 -0700)]
remove unused plugin
Johnnie [Mon, 26 Aug 2019 23:09:20 +0000 (16:09 -0700)]
GRIFFIN-280 change travis config for log level
Johnnie [Mon, 26 Aug 2019 22:40:05 +0000 (15:40 -0700)]
GRIFFIN-280 change travis config for log level
Johnnie [Mon, 26 Aug 2019 22:17:46 +0000 (15:17 -0700)]
GRIFFIN-280 change travis config for log level
Johnnie [Mon, 26 Aug 2019 21:55:39 +0000 (14:55 -0700)]
GRIFFIN-280 change travis config for log level