helix.git
2 years ago[maven-release-plugin] prepare release helix-0.9.0.1 helix-0.9.0.1
Jiajun Wang [Thu, 1 Aug 2019 23:21:27 +0000 (16:21 -0700)] 
[maven-release-plugin] prepare release helix-0.9.0.1

2 years agoFix the race condition while Helix refresh cluster status cache. (#363)
jiajunwang [Tue, 30 Jul 2019 21:41:32 +0000 (14:41 -0700)] 
Fix the race condition while Helix refresh cluster status cache. (#363)

* Fix the race condition while Helix refresh cluster status cache.

This change fix issue #331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.

3 years agoUpdate all markdown files to 0.9.0
Hunter Lee [Wed, 12 Jun 2019 17:27:45 +0000 (10:27 -0700)] 
Update all markdown files to 0.9.0

3 years agoUpdate website with 0.9.0 with maven plugin version upgrades
Hunter Lee [Tue, 11 Jun 2019 19:01:49 +0000 (12:01 -0700)] 
Update website with 0.9.0 with maven plugin version upgrades

3 years agoAdd Release Notes and Docs for 0.9.0 release
Hunter Lee [Tue, 11 Jun 2019 00:37:00 +0000 (17:37 -0700)] 
Add Release Notes and Docs for 0.9.0 release

3 years agoUpgrade ivy version for 0.9.0 release
Hunter Lee [Mon, 10 Jun 2019 23:57:51 +0000 (16:57 -0700)] 
Upgrade ivy version for 0.9.0 release

3 years agoUpgrade Apache rat version and add exclusion paths
Hunter Lee [Mon, 10 Jun 2019 22:29:58 +0000 (15:29 -0700)] 
Upgrade Apache rat version and add exclusion paths

A part of Helix release process requires the rat plugin to perform checks. However, there are scripts and website files that do not require this kind of checks. This diff adds exclusion paths to the pom.xml. Note that there are several Java files that do not still pass the checks due to them not having the Apache license. This still needs to be fixed in the future.

3 years ago[maven-release-plugin] prepare for next development iteration
Hunter Lee [Mon, 3 Jun 2019 00:56:17 +0000 (17:56 -0700)] 
[maven-release-plugin] prepare for next development iteration

3 years ago[maven-release-plugin] prepare release helix-0.9.0 helix-0.9.0
Hunter Lee [Mon, 3 Jun 2019 00:56:06 +0000 (17:56 -0700)] 
[maven-release-plugin] prepare release helix-0.9.0

3 years agoEnable helix-front in pom.xml
Hunter Lee [Sun, 2 Jun 2019 23:55:24 +0000 (16:55 -0700)] 
Enable helix-front in pom.xml

3 years agoDisable JavaDoc check
Hunter Lee [Sun, 2 Jun 2019 23:50:46 +0000 (16:50 -0700)] 
Disable JavaDoc check

3 years agoRevert "[maven-release-plugin] prepare release 0.9.0"
Hunter Lee [Sun, 2 Jun 2019 23:48:42 +0000 (16:48 -0700)] 
Revert "[maven-release-plugin] prepare release 0.9.0"

This reverts commit 9b446bd673f269323f33c1459ddc6e72a3840cc5.

3 years agoRevert "[maven-release-plugin] prepare for next development iteration"
Hunter Lee [Sun, 2 Jun 2019 23:48:39 +0000 (16:48 -0700)] 
Revert "[maven-release-plugin] prepare for next development iteration"

This reverts commit f5a83b57397e03d06bf1e39f15326e227b07722b.

3 years ago[maven-release-plugin] prepare for next development iteration
Hunter Lee [Sun, 2 Jun 2019 23:00:11 +0000 (16:00 -0700)] 
[maven-release-plugin] prepare for next development iteration

3 years ago[maven-release-plugin] prepare release 0.9.0 0.9.0
Hunter Lee [Sun, 2 Jun 2019 22:42:35 +0000 (15:42 -0700)] 
[maven-release-plugin] prepare release 0.9.0

3 years agoRemove cluster view related code
Hunter Lee [Sat, 1 Jun 2019 00:22:58 +0000 (17:22 -0700)] 
Remove cluster view related code

3 years agoUpgrade ZK to 3.4.13
Hunter Lee [Sat, 1 Jun 2019 00:08:03 +0000 (17:08 -0700)] 
Upgrade ZK to 3.4.13

3 years agoBug fix: reuse the stable logics to verfiy the difference between idealStates and...
Yi Wang [Fri, 3 May 2019 23:03:37 +0000 (16:03 -0700)] 
Bug fix: reuse the stable logics to verfiy the difference between idealStates and externalViews

RB=1654700
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAvoid lock the cache object when require a FullRefresh.
Jiajun Wang [Tue, 30 Apr 2019 22:18:21 +0000 (15:18 -0700)] 
Avoid lock the cache object when require a FullRefresh.

The old synchronize control logic will prevent requiring full refresh if a refresh is in progress. This may lead to a slow callback handling.In this change, we remove the original synchronize control. The current cache update logic will be able to handle gradually refreshed data. There is no need to lock the full refresh request.

RB=1652941
BUG=HELIX-1851,gcn-29329
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoIntegrate customRestClient health check with instance service main logic
Yi Wang [Fri, 3 May 2019 00:45:08 +0000 (17:45 -0700)] 
Integrate customRestClient health check with instance service main logic

RB=1645567
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoimplementation of CustomRestClient (post request and get health checks)
Yi Wang [Sat, 20 Apr 2019 00:27:19 +0000 (17:27 -0700)] 
implementation of CustomRestClient (post request and get health checks)

RB=1638858
G=helix-reviewers
R=cjerian
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix the log logic in HelixManager.isLeader().
Jiajun Wang [Thu, 25 Apr 2019 21:03:21 +0000 (14:03 -0700)] 
Fix the log logic in HelixManager.isLeader().

The output is not correct when isLeader() is false.

RB=1645218
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoSupport partion level health mapping fetch from ZK
Junkai Xue [Thu, 11 Apr 2019 21:38:45 +0000 (14:38 -0700)] 
Support partion level health mapping fetch from ZK

For partition level health status is different from per instance querying. Helix will try to get data from ZK under HEALTH_REPORT folder first. If the data is expired (check with EXPIRE entry), Helix will directly call the API to the participant to get latest data.

Otherwise, we shall assume the customized check as failed.

RB=1628988
BUG=HELIX-1785
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix the public API non-backward compatible change
Yi Wang [Tue, 23 Apr 2019 18:57:36 +0000 (11:57 -0700)] 
Fix the public API non-backward compatible change

RB=1641513
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Fix String formatting issue
Hunter Lee [Tue, 16 Apr 2019 20:21:33 +0000 (13:21 -0700)] 
TASK: Fix String formatting issue

For integers, you must use %d, not %f.

RB=1633161
BUG=HELIX-1794
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoMore unit tests for InstanceValidationUtil
Yi Wang [Wed, 3 Apr 2019 21:36:24 +0000 (14:36 -0700)] 
More unit tests for InstanceValidationUtil

RB=1617333
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd util for checking per instance level health and partition level health
Junkai Xue [Thu, 11 Apr 2019 00:18:30 +0000 (17:18 -0700)] 
Add util for checking per instance level health and partition level health

Customized health check including user customized per instance check which ioslated from other instances.

In addition to per instance level check, partition level check should have complete scope crossing instances which hold sibling partitions. For this partition check is to guarantee shuting down current check instance can have health replicas to hold top state.

RB=1627813
BUG=HELIX-1776
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix TestRecurringJobQueue
Hunter Lee [Wed, 10 Apr 2019 23:58:34 +0000 (16:58 -0700)] 
Fix TestRecurringJobQueue

This diff fixes TestRecurringJobQueue's testDeletingRecurrentQueueWithHistory

RB=1627625
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Fix bug in delete()
Hunter Lee [Wed, 10 Apr 2019 23:56:11 +0000 (16:56 -0700)] 
TASK: Fix bug in delete()

The delete() call was doing a force delete on workflows created from a recurrent workflow. This would cause a race condition between the controller cache and the deletion. This diff fixes this.
Changelist:
1. Fix the logic in delete()

RB=1627615
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoIntermediateStateCalcStage style change
Hunter Lee [Fri, 29 Mar 2019 19:13:48 +0000 (12:13 -0700)] 
IntermediateStateCalcStage style change

This diff includes code style fixes and refactor using Java 8 features.

RB=1613452
BUG=HELIX-1742
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTask Framework code style change
Hunter Lee [Fri, 29 Mar 2019 19:08:07 +0000 (12:08 -0700)] 
Task Framework code style change

This diff includes style changes using Java 8 features.

RB=1613441
BUG=HELIX-1742
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix tests in Helix REST
Junkai Xue [Wed, 10 Apr 2019 20:47:19 +0000 (13:47 -0700)] 
Fix tests in Helix REST

RB=1627064
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Add deleteJob namespaced job name support
Hunter Lee [Tue, 9 Apr 2019 04:44:40 +0000 (21:44 -0700)] 
TASK: Add deleteJob namespaced job name support

Current deletion of jobs from JobQueues only support denamespaced job names. This makes it impossible for users to list all jobs and delete them because they cannot recover denamespaced names sometimes.
Changelist:
1. Add support for namespaced job names for deletion

RB=1624395
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Fix bug in getExpiredJobs()
Hunter Lee [Tue, 9 Apr 2019 04:08:37 +0000 (21:08 -0700)] 
TASK: Fix bug in getExpiredJobs()

getExpiredJobs() had a bug where if the job has the same expiry time as workflow's default expiry, it would always override it with Workflow's expiry config. This is not correct.
Changelist:
1. Remove a block of code where it overrides expiry config with WorkflowConfig's default expiry

RB=1624376
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix faulty logic in BestPossibleExternalViewVerifier
Hunter Lee [Thu, 4 Apr 2019 00:12:56 +0000 (17:12 -0700)] 
Fix faulty logic in BestPossibleExternalViewVerifier

removeEntryWithIgnoredStates() was not really doing what it was supposed to do. This diff fixes this.
Also, a small delay added to make TestDrop more stable.

RB=1619153
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTEST: Fix UserContentStore related tests in helix-rest
Hunter Lee [Wed, 3 Apr 2019 21:34:29 +0000 (14:34 -0700)] 
TEST: Fix UserContentStore related tests in helix-rest

The behavior changed such that if the client-side code does not find the UserContent ZNode, it creates one instead of throwing an NPE. This fixes the tests so that it adapts to the new behavior. This behavior should be reverted eventually because UserContent ZNode should be created only by the Controller.

RB=1618685
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoCheck sibling nodes to guarantee MIN_ACTIVE_REPLICAS satisfied
Yi Wang [Sat, 30 Mar 2019 00:28:07 +0000 (17:28 -0700)] 
Check sibling nodes to guarantee MIN_ACTIVE_REPLICAS satisfied

RB=1614128
G=helix-reviewers
A=jxue,hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix test failures and fix logic check stable state
Junkai Xue [Mon, 1 Apr 2019 23:56:03 +0000 (16:56 -0700)] 
Fix test failures and fix logic check stable state

Fix test failures:
1. Add logic to skip task framework idealstates
2. fix logic for test failure.

RB=1615941
BUG=HELIX-1725
G=helix-reviewers
A=lxia

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix unit test by starting rest sever only once.
Lei Xia [Mon, 1 Apr 2019 22:55:05 +0000 (15:55 -0700)] 
Fix unit test by starting rest sever only once.

RB=1615610
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoSwallow exceptions during health status checks for getting instance by id
Yi Wang [Mon, 1 Apr 2019 22:54:21 +0000 (15:54 -0700)] 
Swallow exceptions during health status checks for getting instance by id

RB=1615554
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRefactor InstanceAccessor to InstancesAccessor and PerInstanceAccessor
Junkai Xue [Fri, 29 Mar 2019 23:40:51 +0000 (16:40 -0700)] 
Refactor InstanceAccessor to InstancesAccessor and PerInstanceAccessor

RB=1614063
BUG=HELIX-1725
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoGlobal instance stoppable API
Junkai Xue [Fri, 29 Mar 2019 18:11:35 +0000 (11:11 -0700)] 
Global instance stoppable API

This API will be input list of instances and return stoppable instances. So checks performed here:

1. single stoppable for each instance.
2. shutdown instances will cause replicas drop less than min active number.
For first phase, we do not implement instance based selection.

Here we added an integration test:
1. test instances disabled, has disable partition, not same zone, not alive.
2. disable one stoppable instance, check failed
3. reeable the instance and remove the disabled partition, check for that instance passed again.

Several places make set, map to be TreeSet and TreeMap is that we would like to guarantee the output result is consistent. We do see sorting different for Java 7 and Java 8.

RB=1596424
BUG=HELIX-1680
G=helix-reviewers
A=lxia

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoReport instance started & health status when getting by id
Yi Wang [Wed, 27 Mar 2019 00:53:19 +0000 (17:53 -0700)] 
Report instance started & health status when getting by id

RB=1609426
BUG=helix-1732
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRename instance health check enum to be more explicit
Yi Wang [Thu, 28 Mar 2019 23:33:01 +0000 (16:33 -0700)] 
Rename instance health check enum to be more explicit

RB=1612544
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoSingle stoppable API impl
Yi Wang [Wed, 20 Mar 2019 23:46:22 +0000 (16:46 -0700)] 
Single stoppable API impl

RB=1603158
G=helix-reviewers
A=jxue,hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoImplementation of ClusterService's getClusterTopology method
Yi Wang [Tue, 19 Mar 2019 21:16:53 +0000 (14:16 -0700)] 
Implementation of ClusterService's getClusterTopology method

RB=1601257
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoInterface design for zone mapping information
Yi Wang [Tue, 19 Mar 2019 21:16:16 +0000 (14:16 -0700)] 
Interface design for zone mapping information

RB=1578905
BUG=helix-1646
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoInterface design for zone mapping information
Yi Wang [Tue, 19 Mar 2019 21:16:16 +0000 (14:16 -0700)] 
Interface design for zone mapping information

RB=1578905
BUG=helix-1646
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix node swap test.
Jiajun Wang [Wed, 14 Nov 2018 00:59:18 +0000 (16:59 -0800)] 
Fix node swap test.

Add sleep to stablize the test. Several cluster operations require controller reaction before checking.

RB=1484466
G=helix-reviewers
A=hrzhang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd Util check instance is already in stable state
Junkai Xue [Fri, 15 Mar 2019 23:35:01 +0000 (16:35 -0700)] 
Add Util check instance is already in stable state

We have two choice of checking instance in stable state:
1. Compare IdealState with ExternalView
2. Compare IdealState with CurrentState.

Finally choose IS vs EV is because:
1. We have simple cache in REST, read current state will still cause multiple reads from different hosts for current state. But EV can shared by each host.
2. EV is the decision maker for router, which is kinda source of truth of real production environment. So EV is the final choice.

RB=1598176
BUG=HELIX-1676
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoDummy check for customized API
Junkai Xue [Tue, 12 Mar 2019 19:40:41 +0000 (12:40 -0700)] 
Dummy check for customized API

For this change, it build the dummy check for customized API. It contains following changes:
1. RESTConfig can setup the customized URL
2. Define the end point of per participant and per partition.
3. Add dummy logic that return true for all the check status of customized checks.

RB=1596427
BUG=HELIX-1678
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix helix-ui build failure due to wrong config reference
Yi Wang [Tue, 12 Mar 2019 19:47:04 +0000 (12:47 -0700)] 
Fix helix-ui build failure due to wrong config reference

RB=1592781
G=helix-reviewers
A=lxia,hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd adminGroup check for write operations
Yi Wang [Fri, 8 Mar 2019 23:28:35 +0000 (15:28 -0800)] 
Add adminGroup check for write operations

ACLOVERRIDE
RB=1590175
BUG=HELIX-1682
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoApply the JerseyTestUriRequestBuilder to the TestInstanceAccessor
ywang4 [Mon, 25 Feb 2019 23:04:50 +0000 (15:04 -0800)] 
Apply the JerseyTestUriRequestBuilder to the TestInstanceAccessor

RB=1575013
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoCreate util class to make it easier to make rest request
ywang4 [Fri, 22 Feb 2019 18:12:20 +0000 (10:12 -0800)] 
Create util class to make it easier to make rest request

RB=1573157
G=helix-reviewers
A=jxue,hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoget instance's pending messages with state model def parameter
ywang4 [Wed, 20 Feb 2019 22:23:08 +0000 (14:23 -0800)] 
get instance's pending messages with state model def parameter

Update the get() method in AbstractTestClass in order to take the correct QueryParam
BUGS=HELIX-1645

RB=1570393
BUG=HELIX-1645
G=helix-reviewers
A=hulee,jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoUtil methods for checking if instance healthy
Yi Wang [Tue, 12 Mar 2019 23:59:25 +0000 (16:59 -0700)] 
Util methods for checking if instance healthy

RB=1585486
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Make isJobQueue backward compatible
Hunter Lee [Wed, 3 Apr 2019 01:23:17 +0000 (18:23 -0700)] 
TASK: Make isJobQueue backward compatible

Making isJobQueue backward compatible by adding isTerminable() check.

RB=1617516
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Fix possible NPE in getWorkflowId()
Hunter Lee [Wed, 3 Apr 2019 01:09:03 +0000 (18:09 -0700)] 
TASK: Fix possible NPE in getWorkflowId()

Old workflows may not have WorkflowID field set. This makes getWorkflowId() backward-compatible by falling back on its ZNRecord id instead.

RB=1617517
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Fix cleanupQueue() API
Hunter Lee [Tue, 2 Apr 2019 21:17:23 +0000 (14:17 -0700)] 
TASK: Fix cleanupQueue() API

This API is meant for JobQueues only. However, it was checking only using isTerminable(), which is a deprecated flag for whether a workflow is a queue or not.

Changelist:
1. Add isJobQueue() check in cleanupQueue() in TaskDriver

RB=1616870
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoMigrate Helix to Java 8
Hunter Lee [Fri, 29 Mar 2019 18:56:13 +0000 (11:56 -0700)] 
Migrate Helix to Java 8

This diff migrates the project to JDK1.8. This diff does not change any functionalities/core logic. It contains a few style changes and redundant code changes.
Changelist:
1. Change to Java 8
2. Upgrade dependencies in pom.xml

RB=1613418
BUG=HELIX-1742
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoHELIX-816 use System.currentTimeMillis()
bd2019us [Fri, 12 Apr 2019 13:39:44 +0000 (08:39 -0500)] 
HELIX-816 use System.currentTimeMillis()

3 years ago[HELIX-815] fix bug to avoid potential crash
bd2019us [Sun, 31 Mar 2019 19:43:06 +0000 (14:43 -0500)] 
[HELIX-815] fix bug to avoid potential crash

3 years agoTASK2.0: Job scheduling core pipeline fixes
Hunter Lee [Thu, 28 Mar 2019 19:31:25 +0000 (12:31 -0700)] 
TASK2.0: Job scheduling core pipeline fixes

Task Framework 2.0 had stability issues and race conditions that weren't being handled correctly. Also, integration with RuntimeJobDag had some loopholes that needed to be fixed. This diff includes such fixes and improvements that makes it really show performance gains and cuts down on redundant computation.
Changelist:
1. Race condition when a job is enqueued, only the new JobConfig is updated and not the DAG
    Add a two-way selective update which ensures consistency between JobConfigs and parent DAGs
2. Moved where getNextJob() is called in scheduleJobs() in WorkflowDispatcher
    This ensures that once a RuntimeJobDag is rebuilt, update for jobs happens in one pipeline run, which removes any extra delay or slowness
3. Race condition where the job you got from getNextJob is for some reason not schedulable
    This is due to deleting and enqueuing a job of the same name
    RuntimeJobDag has the old job name, which conflicts with the dependency in the new DAG
    This fixes the test: TestTaskRebalancerStopResume so that it does not enqueue a job of the same name
4. JobRebalancer was throwing an NPE when calling processJobStatusUpdateAndAssignment()
    This was sometimes making the Controller hang
    Added a null check for JobConfig (job could have been deleted/purged)
5. Fix bug with isWorkflowStopped
    TargetState comparison was done in the opposite way
    This fixes the test: TestRecurringJobQueue's testDeletingRecurrentQueueWithHistory()
    Sometimes contexts do not get deleted cleanly but this does not affect correctness
6. Add TestEnqueueJobs
7. Fix unstable TestGetLastScheduledTaskExecInfo
8. Other minor style fixes

3 years agoTASK2.0: Add performance metrics to JobMonitor
Hunter Lee [Thu, 28 Mar 2019 19:30:09 +0000 (12:30 -0700)] 
TASK2.0: Add performance metrics to JobMonitor

    We want to add more metrics to Task Framework so that the user could understand what's going on in case of a slowdown, or get a general sense of how fast the workload is moving.
    Changelist:
    1. Add SubmissionToProcessDelay
    2. Add SubmissionToScheduleDelay
    3. Add ControllerInducedDelay (for testing)
    4. Add JobLatencyGauge
    5. Change regular metrics to Dynamic metrics in JobMonitor
    6. Add an integration test: TestTaskPerformanceMetrics

3 years agoTASK: Fix bug in isWorkflowStopped
Hunter Lee [Thu, 28 Mar 2019 19:29:38 +0000 (12:29 -0700)] 
TASK: Fix bug in isWorkflowStopped

    A bug in isWorkflowStopped was causing the workflow context for the recurrent workflow template to show up as STOPPED. This diff fixes this so that it handles recurrent workflow templates correctly.

3 years agoHELIX: Bypass throttling for disabled partitions
Hunter Lee [Thu, 28 Mar 2019 19:29:16 +0000 (12:29 -0700)] 
HELIX: Bypass throttling for disabled partitions

    This diff allows all state transitions linked to disabled instances/partitions to bypass throttling constraints.
    Changelist:
    1. Modify logic in IntermediateStateCalcStage
    2. Add more integration tests

3 years agoHELIX: Recovery balance partitions with disabled top-state replicas
Hunter Lee [Thu, 28 Mar 2019 19:27:52 +0000 (12:27 -0700)] 
HELIX: Recovery balance partitions with disabled top-state replicas

    Previously, disabling of partitions or disabled instances did not affect Helix's throttling logic. This was problematic because the ability to disable was designed in in order to move partitons/replicas out of the given instance as a measure to deal with unhealthy partitions/instances. This allows, for partitions that are disabled, to go into recovery balance, and when the user has not set any throttling configs for recovery balance, these types of state transitions will go through unthrottled, avoiding downtime.
    Changelist:
    1. Add a check for determining rebalance type for a given partition
    2. Add an integration test

3 years agoTASK: Fix bug where JobDispatcher does not create UserContentStore for new jobs
Hunter Lee [Thu, 28 Mar 2019 19:27:26 +0000 (12:27 -0700)] 
TASK: Fix bug where JobDispatcher does not create UserContentStore for new jobs

    It was observed that there are multiple logic paths where a new job could get scheduled: 1. scheduleJobs() 2. processJobStatusUpdateAndAssignment(). When a job is being assigned by the latter, JobDispatcher would fail to create the UserContentStore for the job, causing all subsequent read/writes to this UserContentStore fail. This is a temporary fix and further refactoring of code paths would be required in order to consolidate where new jobs get scheduled.
    Changelist:
    1. Add UserContentStore for jobs with null contexts

3 years agoHELIX: Fix all DataUpdaters so that it checks for null previous records
Hunter Lee [Thu, 28 Mar 2019 19:26:58 +0000 (12:26 -0700)] 
HELIX: Fix all DataUpdaters so that it checks for null previous records

Some implementations are missing the null check for the ZNRecord data prior to the update in the DataUpdater pattern. This is a fatal bug that could cause NullPointerExceptions. This diff goes through the codebase and address this.
Changelist:
1. Add null checks for all DataUpdater implementations

3 years agoFix read previous assignment for workflows
Hunter Lee [Thu, 28 Mar 2019 19:26:21 +0000 (12:26 -0700)] 
Fix read previous assignment for workflows

With this change, TaskDataCache will:
1. Not pollute the log with read failures for previous assignment
2. Reduce the amount of reads from the previous assignment

3 years agoFix N -> N + 1 extra bootstrap
Hunter Lee [Thu, 28 Mar 2019 19:24:44 +0000 (12:24 -0700)] 
Fix N -> N + 1 extra bootstrap

When rebalancing, Helix does the following before dropping replicas that are not in the preference list:
    1. Send a replica on a node that just came online from OFFLINE to SLAVE
    2. Drop a replica on a node that just bootstrapped

This happened because of the rebalancer is trying to make all current states exactly match the states in the best possible mapping. Once conditions are met, Helix started dropping replicas. This fix makes Helix guarantee that only the replicas in the preference list match the states in the best possible mapping. So the rebalancer does not have to wait and bootstrap extra replicas that are not in preference list.

3 years agoHELIX: Fix typo in ResourceControllerDataProvider's API
Hunter Lee [Thu, 28 Mar 2019 19:18:29 +0000 (12:18 -0700)] 
HELIX: Fix typo in ResourceControllerDataProvider's API

3 years agoAdd Hunter Lee as Committer
Hunter Lee [Thu, 28 Mar 2019 19:08:58 +0000 (12:08 -0700)] 
Add Hunter Lee as Committer

3 years agoDisable helix-front build
Junkai Xue [Thu, 7 Mar 2019 22:16:43 +0000 (14:16 -0800)] 
Disable helix-front build

3 years agoRevise the markdown and add more context to how-to guide
narendly [Tue, 5 Mar 2019 23:17:43 +0000 (15:17 -0800)] 
Revise the markdown and add more context to how-to guide

3 years agoUpdate menu bar
Junkai Xue [Wed, 6 Mar 2019 23:05:39 +0000 (15:05 -0800)] 
Update menu bar

3 years agoRevise the markdown and add more context to how-to guide
narendly [Tue, 5 Mar 2019 23:17:43 +0000 (15:17 -0800)] 
Revise the markdown and add more context to how-to guide

3 years agoAdd markdown for auto-exit of maintenance mode
narendly [Mon, 4 Mar 2019 22:37:58 +0000 (14:37 -0800)] 
Add markdown for auto-exit of maintenance mode

3 years ago[maven-release-plugin] prepare for next development iteration
Junkai Xue [Thu, 28 Feb 2019 23:40:32 +0000 (15:40 -0800)] 
[maven-release-plugin] prepare for next development iteration

3 years ago[maven-release-plugin] prepare release helix-0.8.4 helix-0.8.4
Junkai Xue [Thu, 28 Feb 2019 23:40:22 +0000 (15:40 -0800)] 
[maven-release-plugin] prepare release helix-0.8.4

3 years ago[HELIX-814] HELIX: Add back ClusterDataCache for backward-compatibility 314/head
narendly [Thu, 28 Feb 2019 23:24:17 +0000 (15:24 -0800)] 
[HELIX-814] HELIX: Add back ClusterDataCache for backward-compatibility

It was discovered that removing ClusterDataCache and changing public interfaces (RebalanceStrategy, Rebalancer) caused backward-incompatibility. This diff aims to solve this issue by creating a backward-compatibie ClusterDataCache (deprecated).

3 years agoRevert "[maven-release-plugin] prepare release helix-0.8.4"
Junkai Xue [Thu, 28 Feb 2019 23:28:35 +0000 (15:28 -0800)] 
Revert "[maven-release-plugin] prepare release helix-0.8.4"

This reverts commit 3e7794a74414ca45fd0d526cf53960241cc52898.

3 years agoRevert "[maven-release-plugin] prepare for next development iteration"
Junkai Xue [Thu, 28 Feb 2019 23:28:24 +0000 (15:28 -0800)] 
Revert "[maven-release-plugin] prepare for next development iteration"

This reverts commit 3abe3a4c29939ea027af19f21b3e4855910fa1ff.

3 years ago[maven-release-plugin] prepare for next development iteration
Junkai Xue [Wed, 27 Feb 2019 19:11:15 +0000 (11:11 -0800)] 
[maven-release-plugin] prepare for next development iteration

3 years ago[maven-release-plugin] prepare release helix-0.8.4
Junkai Xue [Wed, 27 Feb 2019 19:11:04 +0000 (11:11 -0800)] 
[maven-release-plugin] prepare release helix-0.8.4

3 years agoEnable helix-front build
Junkai Xue [Wed, 27 Feb 2019 18:57:41 +0000 (10:57 -0800)] 
Enable helix-front build

3 years agoRevert "[maven-release-plugin] prepare release helix-0.8.4"
Junkai Xue [Wed, 27 Feb 2019 18:57:19 +0000 (10:57 -0800)] 
Revert "[maven-release-plugin] prepare release helix-0.8.4"

This reverts commit b7395468b424df30a0b46da684862bb248d65c2c.

3 years agoRevert "[maven-release-plugin] prepare for next development iteration"
Junkai Xue [Wed, 27 Feb 2019 18:57:07 +0000 (10:57 -0800)] 
Revert "[maven-release-plugin] prepare for next development iteration"

This reverts commit e41f8a68ceb9134881c6e6d53c6a5a7efdbdeedc.

3 years ago[maven-release-plugin] prepare for next development iteration
Junkai Xue [Wed, 27 Feb 2019 05:17:52 +0000 (21:17 -0800)] 
[maven-release-plugin] prepare for next development iteration

3 years ago[maven-release-plugin] prepare release helix-0.8.4
Junkai Xue [Wed, 27 Feb 2019 05:17:40 +0000 (21:17 -0800)] 
[maven-release-plugin] prepare release helix-0.8.4

3 years agoBump ivy version
Junkai Xue [Wed, 27 Feb 2019 05:10:40 +0000 (21:10 -0800)] 
Bump ivy version

3 years agoRelease note and docs for 0.8.4
Junkai Xue [Wed, 27 Feb 2019 04:48:33 +0000 (20:48 -0800)] 
Release note and docs for 0.8.4

3 years ago[HELIX-812] HELIX: Fix maintenance history bug 312/head
narendly [Wed, 27 Feb 2019 01:30:47 +0000 (17:30 -0800)] 
[HELIX-812] HELIX: Fix maintenance history bug

There was a bug in maintenance history where when the cluster exits maintenance mode automatically, it would record the exit action twice in history. This is because each pipeline is designed to run MaintenanceRecoveryStage twice.
    Changelist:
    1. Add a flag so that if maintenanceSignal has been changed, just return from MaintenanceRecoveryStage

3 years ago[HELIX-811] HELIX: Only log relayMsg if it doesn't exist
narendly [Wed, 27 Feb 2019 01:29:40 +0000 (17:29 -0800)] 
[HELIX-811] HELIX: Only log relayMsg if it doesn't exist

This log was flooding our log files. We need to change it so that relay messages only get logged once.

3 years ago[HELIX-810] HELIX: Fix NPE in InstanceMessagesCache
narendly [Wed, 27 Feb 2019 01:28:38 +0000 (17:28 -0800)] 
[HELIX-810] HELIX: Fix NPE in InstanceMessagesCache

It was observed that InstanceMessagesCache was throwing an NPE when it tries to setRelayTime(). This is likely because some relay messages have target instances that are no longer live (thus not in liveInstanceMap). InstanceMessagesCache must handle this gracefully by skipping the operation. We do not delete these msgs right away because the instance may come back alive. Otherwise, after some time has passed, the msg will get expired by the Controller and be removed.
    Changelist;
    1. Add a try-catch block
    2. Improve logging

3 years ago[HELIX-809] TEST: Fix unstable TestClusterInMaintenanceModeWhenReachingMaxPartition
narendly [Wed, 27 Feb 2019 01:26:00 +0000 (17:26 -0800)] 
[HELIX-809] TEST: Fix unstable TestClusterInMaintenanceModeWhenReachingMaxPartition

The pause for this was too short so the test was occasionally failing. This RB fixes this.

3 years ago[HELIX-808] TASK: Fix double-booking of tasks with task CurrentStates
narendly [Wed, 27 Feb 2019 01:24:56 +0000 (17:24 -0800)] 
[HELIX-808] TASK: Fix double-booking of tasks with task CurrentStates

It was observed that TestNoDoubleAssign was failing intermittently. Upon debugging with more detailed logs, there was a race condition between newly starting tasks and dropping tasks. To prevent this, dropping state transitions will be prioritized and prevInstanceToTaskAssignment will be built from CurrentStates. This is needed to make sure the right number of tasks are assigned every task pipeline and dropping transitions happen right away.

    Changelist:
    1\. Change the logic for generating prevInstToTaskAssignment so that it's based on CurrentState
    2\. Add a special check for not updating task partition state upon Participant connection loss
    3\. TestNoDoubleAssign passes consistently
    4. Fix TestNoDoubleAssign so that there won't be any thread leak

3 years ago[HELIX-807] REST: Add get maintenance signal endpoint
narendly [Wed, 27 Feb 2019 01:20:46 +0000 (17:20 -0800)] 
[HELIX-807] REST: Add get maintenance signal endpoint

Changelist:
    1. Add get maintenance signal endpoint
    2. Add a test

3 years ago[HELIX-806] HELIX: Modify endpoints for instrumenting maintenance with custom fields
narendly [Wed, 27 Feb 2019 01:19:14 +0000 (17:19 -0800)] 
[HELIX-806] HELIX: Modify endpoints for instrumenting maintenance with custom fields

We want to use content for users to input their KV mappings as a JSON string.
    Changelist:
    1. Modify enable/disableMaintenanceMode endpoint logic
    2. Modify tests