Jiajun Wang [Thu, 1 Aug 2019 23:21:27 +0000 (16:21 -0700)]
[maven-release-plugin] prepare release helix-0.9.0.1
jiajunwang [Tue, 30 Jul 2019 21:41:32 +0000 (14:41 -0700)]
Fix the race condition while Helix refresh cluster status cache. (#363)
* Fix the race condition while Helix refresh cluster status cache.
This change fix issue #331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.
Hunter Lee [Wed, 12 Jun 2019 17:27:45 +0000 (10:27 -0700)]
Update all markdown files to 0.9.0
Hunter Lee [Tue, 11 Jun 2019 19:01:49 +0000 (12:01 -0700)]
Update website with 0.9.0 with maven plugin version upgrades
Hunter Lee [Tue, 11 Jun 2019 00:37:00 +0000 (17:37 -0700)]
Add Release Notes and Docs for 0.9.0 release
Hunter Lee [Mon, 10 Jun 2019 23:57:51 +0000 (16:57 -0700)]
Upgrade ivy version for 0.9.0 release
Hunter Lee [Mon, 10 Jun 2019 22:29:58 +0000 (15:29 -0700)]
Upgrade Apache rat version and add exclusion paths
A part of Helix release process requires the rat plugin to perform checks. However, there are scripts and website files that do not require this kind of checks. This diff adds exclusion paths to the pom.xml. Note that there are several Java files that do not still pass the checks due to them not having the Apache license. This still needs to be fixed in the future.
Hunter Lee [Mon, 3 Jun 2019 00:56:17 +0000 (17:56 -0700)]
[maven-release-plugin] prepare for next development iteration
Hunter Lee [Mon, 3 Jun 2019 00:56:06 +0000 (17:56 -0700)]
[maven-release-plugin] prepare release helix-0.9.0
Hunter Lee [Sun, 2 Jun 2019 23:55:24 +0000 (16:55 -0700)]
Enable helix-front in pom.xml
Hunter Lee [Sun, 2 Jun 2019 23:50:46 +0000 (16:50 -0700)]
Disable JavaDoc check
Hunter Lee [Sun, 2 Jun 2019 23:48:42 +0000 (16:48 -0700)]
Revert "[maven-release-plugin] prepare release 0.9.0"
This reverts commit
9b446bd673f269323f33c1459ddc6e72a3840cc5.
Hunter Lee [Sun, 2 Jun 2019 23:48:39 +0000 (16:48 -0700)]
Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit
f5a83b57397e03d06bf1e39f15326e227b07722b.
Hunter Lee [Sun, 2 Jun 2019 23:00:11 +0000 (16:00 -0700)]
[maven-release-plugin] prepare for next development iteration
Hunter Lee [Sun, 2 Jun 2019 22:42:35 +0000 (15:42 -0700)]
[maven-release-plugin] prepare release 0.9.0
Hunter Lee [Sat, 1 Jun 2019 00:22:58 +0000 (17:22 -0700)]
Remove cluster view related code
Hunter Lee [Sat, 1 Jun 2019 00:08:03 +0000 (17:08 -0700)]
Upgrade ZK to 3.4.13
Yi Wang [Fri, 3 May 2019 23:03:37 +0000 (16:03 -0700)]
Bug fix: reuse the stable logics to verfiy the difference between idealStates and externalViews
RB=
1654700
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Tue, 30 Apr 2019 22:18:21 +0000 (15:18 -0700)]
Avoid lock the cache object when require a FullRefresh.
The old synchronize control logic will prevent requiring full refresh if a refresh is in progress. This may lead to a slow callback handling.In this change, we remove the original synchronize control. The current cache update logic will be able to handle gradually refreshed data. There is no need to lock the full refresh request.
RB=
1652941
BUG=HELIX-1851,gcn-29329
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Fri, 3 May 2019 00:45:08 +0000 (17:45 -0700)]
Integrate customRestClient health check with instance service main logic
RB=
1645567
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Sat, 20 Apr 2019 00:27:19 +0000 (17:27 -0700)]
implementation of CustomRestClient (post request and get health checks)
RB=
1638858
G=helix-reviewers
R=cjerian
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Thu, 25 Apr 2019 21:03:21 +0000 (14:03 -0700)]
Fix the log logic in HelixManager.isLeader().
The output is not correct when isLeader() is false.
RB=
1645218
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 11 Apr 2019 21:38:45 +0000 (14:38 -0700)]
Support partion level health mapping fetch from ZK
For partition level health status is different from per instance querying. Helix will try to get data from ZK under HEALTH_REPORT folder first. If the data is expired (check with EXPIRE entry), Helix will directly call the API to the participant to get latest data.
Otherwise, we shall assume the customized check as failed.
RB=
1628988
BUG=HELIX-1785
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 23 Apr 2019 18:57:36 +0000 (11:57 -0700)]
Fix the public API non-backward compatible change
RB=
1641513
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 16 Apr 2019 20:21:33 +0000 (13:21 -0700)]
TASK: Fix String formatting issue
For integers, you must use %d, not %f.
RB=
1633161
BUG=HELIX-1794
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Wed, 3 Apr 2019 21:36:24 +0000 (14:36 -0700)]
More unit tests for InstanceValidationUtil
RB=
1617333
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 11 Apr 2019 00:18:30 +0000 (17:18 -0700)]
Add util for checking per instance level health and partition level health
Customized health check including user customized per instance check which ioslated from other instances.
In addition to per instance level check, partition level check should have complete scope crossing instances which hold sibling partitions. For this partition check is to guarantee shuting down current check instance can have health replicas to hold top state.
RB=
1627813
BUG=HELIX-1776
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Wed, 10 Apr 2019 23:58:34 +0000 (16:58 -0700)]
Fix TestRecurringJobQueue
This diff fixes TestRecurringJobQueue's testDeletingRecurrentQueueWithHistory
RB=
1627625
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Wed, 10 Apr 2019 23:56:11 +0000 (16:56 -0700)]
TASK: Fix bug in delete()
The delete() call was doing a force delete on workflows created from a recurrent workflow. This would cause a race condition between the controller cache and the deletion. This diff fixes this.
Changelist:
1. Fix the logic in delete()
RB=
1627615
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Fri, 29 Mar 2019 19:13:48 +0000 (12:13 -0700)]
IntermediateStateCalcStage style change
This diff includes code style fixes and refactor using Java 8 features.
RB=
1613452
BUG=HELIX-1742
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Fri, 29 Mar 2019 19:08:07 +0000 (12:08 -0700)]
Task Framework code style change
This diff includes style changes using Java 8 features.
RB=
1613441
BUG=HELIX-1742
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Wed, 10 Apr 2019 20:47:19 +0000 (13:47 -0700)]
Fix tests in Helix REST
RB=
1627064
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 9 Apr 2019 04:44:40 +0000 (21:44 -0700)]
TASK: Add deleteJob namespaced job name support
Current deletion of jobs from JobQueues only support denamespaced job names. This makes it impossible for users to list all jobs and delete them because they cannot recover denamespaced names sometimes.
Changelist:
1. Add support for namespaced job names for deletion
RB=
1624395
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 9 Apr 2019 04:08:37 +0000 (21:08 -0700)]
TASK: Fix bug in getExpiredJobs()
getExpiredJobs() had a bug where if the job has the same expiry time as workflow's default expiry, it would always override it with Workflow's expiry config. This is not correct.
Changelist:
1. Remove a block of code where it overrides expiry config with WorkflowConfig's default expiry
RB=
1624376
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Thu, 4 Apr 2019 00:12:56 +0000 (17:12 -0700)]
Fix faulty logic in BestPossibleExternalViewVerifier
removeEntryWithIgnoredStates() was not really doing what it was supposed to do. This diff fixes this.
Also, a small delay added to make TestDrop more stable.
RB=
1619153
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Wed, 3 Apr 2019 21:34:29 +0000 (14:34 -0700)]
TEST: Fix UserContentStore related tests in helix-rest
The behavior changed such that if the client-side code does not find the UserContent ZNode, it creates one instead of throwing an NPE. This fixes the tests so that it adapts to the new behavior. This behavior should be reverted eventually because UserContent ZNode should be created only by the Controller.
RB=
1618685
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Sat, 30 Mar 2019 00:28:07 +0000 (17:28 -0700)]
Check sibling nodes to guarantee MIN_ACTIVE_REPLICAS satisfied
RB=
1614128
G=helix-reviewers
A=jxue,hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Mon, 1 Apr 2019 23:56:03 +0000 (16:56 -0700)]
Fix test failures and fix logic check stable state
Fix test failures:
1. Add logic to skip task framework idealstates
2. fix logic for test failure.
RB=
1615941
BUG=HELIX-1725
G=helix-reviewers
A=lxia
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Lei Xia [Mon, 1 Apr 2019 22:55:05 +0000 (15:55 -0700)]
Fix unit test by starting rest sever only once.
RB=
1615610
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Mon, 1 Apr 2019 22:54:21 +0000 (15:54 -0700)]
Swallow exceptions during health status checks for getting instance by id
RB=
1615554
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Fri, 29 Mar 2019 23:40:51 +0000 (16:40 -0700)]
Refactor InstanceAccessor to InstancesAccessor and PerInstanceAccessor
RB=
1614063
BUG=HELIX-1725
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Fri, 29 Mar 2019 18:11:35 +0000 (11:11 -0700)]
Global instance stoppable API
This API will be input list of instances and return stoppable instances. So checks performed here:
1. single stoppable for each instance.
2. shutdown instances will cause replicas drop less than min active number.
For first phase, we do not implement instance based selection.
Here we added an integration test:
1. test instances disabled, has disable partition, not same zone, not alive.
2. disable one stoppable instance, check failed
3. reeable the instance and remove the disabled partition, check for that instance passed again.
Several places make set, map to be TreeSet and TreeMap is that we would like to guarantee the output result is consistent. We do see sorting different for Java 7 and Java 8.
RB=
1596424
BUG=HELIX-1680
G=helix-reviewers
A=lxia
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Wed, 27 Mar 2019 00:53:19 +0000 (17:53 -0700)]
Report instance started & health status when getting by id
RB=
1609426
BUG=helix-1732
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Thu, 28 Mar 2019 23:33:01 +0000 (16:33 -0700)]
Rename instance health check enum to be more explicit
RB=
1612544
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Wed, 20 Mar 2019 23:46:22 +0000 (16:46 -0700)]
Single stoppable API impl
RB=
1603158
G=helix-reviewers
A=jxue,hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 19 Mar 2019 21:16:53 +0000 (14:16 -0700)]
Implementation of ClusterService's getClusterTopology method
RB=
1601257
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 19 Mar 2019 21:16:16 +0000 (14:16 -0700)]
Interface design for zone mapping information
RB=
1578905
BUG=helix-1646
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 19 Mar 2019 21:16:16 +0000 (14:16 -0700)]
Interface design for zone mapping information
RB=
1578905
BUG=helix-1646
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Wed, 14 Nov 2018 00:59:18 +0000 (16:59 -0800)]
Fix node swap test.
Add sleep to stablize the test. Several cluster operations require controller reaction before checking.
RB=
1484466
G=helix-reviewers
A=hrzhang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Fri, 15 Mar 2019 23:35:01 +0000 (16:35 -0700)]
Add Util check instance is already in stable state
We have two choice of checking instance in stable state:
1. Compare IdealState with ExternalView
2. Compare IdealState with CurrentState.
Finally choose IS vs EV is because:
1. We have simple cache in REST, read current state will still cause multiple reads from different hosts for current state. But EV can shared by each host.
2. EV is the decision maker for router, which is kinda source of truth of real production environment. So EV is the final choice.
RB=
1598176
BUG=HELIX-1676
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Tue, 12 Mar 2019 19:40:41 +0000 (12:40 -0700)]
Dummy check for customized API
For this change, it build the dummy check for customized API. It contains following changes:
1. RESTConfig can setup the customized URL
2. Define the end point of per participant and per partition.
3. Add dummy logic that return true for all the check status of customized checks.
RB=
1596427
BUG=HELIX-1678
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 12 Mar 2019 19:47:04 +0000 (12:47 -0700)]
Fix helix-ui build failure due to wrong config reference
RB=
1592781
G=helix-reviewers
A=lxia,hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Fri, 8 Mar 2019 23:28:35 +0000 (15:28 -0800)]
Add adminGroup check for write operations
ACLOVERRIDE
RB=
1590175
BUG=HELIX-1682
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
ywang4 [Mon, 25 Feb 2019 23:04:50 +0000 (15:04 -0800)]
Apply the JerseyTestUriRequestBuilder to the TestInstanceAccessor
RB=
1575013
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
ywang4 [Fri, 22 Feb 2019 18:12:20 +0000 (10:12 -0800)]
Create util class to make it easier to make rest request
RB=
1573157
G=helix-reviewers
A=jxue,hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
ywang4 [Wed, 20 Feb 2019 22:23:08 +0000 (14:23 -0800)]
get instance's pending messages with state model def parameter
Update the get() method in AbstractTestClass in order to take the correct QueryParam
BUGS=HELIX-1645
RB=
1570393
BUG=HELIX-1645
G=helix-reviewers
A=hulee,jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 12 Mar 2019 23:59:25 +0000 (16:59 -0700)]
Util methods for checking if instance healthy
RB=
1585486
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Wed, 3 Apr 2019 01:23:17 +0000 (18:23 -0700)]
TASK: Make isJobQueue backward compatible
Making isJobQueue backward compatible by adding isTerminable() check.
RB=
1617516
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Wed, 3 Apr 2019 01:09:03 +0000 (18:09 -0700)]
TASK: Fix possible NPE in getWorkflowId()
Old workflows may not have WorkflowID field set. This makes getWorkflowId() backward-compatible by falling back on its ZNRecord id instead.
RB=
1617517
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 2 Apr 2019 21:17:23 +0000 (14:17 -0700)]
TASK: Fix cleanupQueue() API
This API is meant for JobQueues only. However, it was checking only using isTerminable(), which is a deprecated flag for whether a workflow is a queue or not.
Changelist:
1. Add isJobQueue() check in cleanupQueue() in TaskDriver
RB=
1616870
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Fri, 29 Mar 2019 18:56:13 +0000 (11:56 -0700)]
Migrate Helix to Java 8
This diff migrates the project to JDK1.8. This diff does not change any functionalities/core logic. It contains a few style changes and redundant code changes.
Changelist:
1. Change to Java 8
2. Upgrade dependencies in pom.xml
RB=
1613418
BUG=HELIX-1742
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
bd2019us [Fri, 12 Apr 2019 13:39:44 +0000 (08:39 -0500)]
HELIX-816 use System.currentTimeMillis()
bd2019us [Sun, 31 Mar 2019 19:43:06 +0000 (14:43 -0500)]
[HELIX-815] fix bug to avoid potential crash
Hunter Lee [Thu, 28 Mar 2019 19:31:25 +0000 (12:31 -0700)]
TASK2.0: Job scheduling core pipeline fixes
Task Framework 2.0 had stability issues and race conditions that weren't being handled correctly. Also, integration with RuntimeJobDag had some loopholes that needed to be fixed. This diff includes such fixes and improvements that makes it really show performance gains and cuts down on redundant computation.
Changelist:
1. Race condition when a job is enqueued, only the new JobConfig is updated and not the DAG
Add a two-way selective update which ensures consistency between JobConfigs and parent DAGs
2. Moved where getNextJob() is called in scheduleJobs() in WorkflowDispatcher
This ensures that once a RuntimeJobDag is rebuilt, update for jobs happens in one pipeline run, which removes any extra delay or slowness
3. Race condition where the job you got from getNextJob is for some reason not schedulable
This is due to deleting and enqueuing a job of the same name
RuntimeJobDag has the old job name, which conflicts with the dependency in the new DAG
This fixes the test: TestTaskRebalancerStopResume so that it does not enqueue a job of the same name
4. JobRebalancer was throwing an NPE when calling processJobStatusUpdateAndAssignment()
This was sometimes making the Controller hang
Added a null check for JobConfig (job could have been deleted/purged)
5. Fix bug with isWorkflowStopped
TargetState comparison was done in the opposite way
This fixes the test: TestRecurringJobQueue's testDeletingRecurrentQueueWithHistory()
Sometimes contexts do not get deleted cleanly but this does not affect correctness
6. Add TestEnqueueJobs
7. Fix unstable TestGetLastScheduledTaskExecInfo
8. Other minor style fixes
Hunter Lee [Thu, 28 Mar 2019 19:30:09 +0000 (12:30 -0700)]
TASK2.0: Add performance metrics to JobMonitor
We want to add more metrics to Task Framework so that the user could understand what's going on in case of a slowdown, or get a general sense of how fast the workload is moving.
Changelist:
1. Add SubmissionToProcessDelay
2. Add SubmissionToScheduleDelay
3. Add ControllerInducedDelay (for testing)
4. Add JobLatencyGauge
5. Change regular metrics to Dynamic metrics in JobMonitor
6. Add an integration test: TestTaskPerformanceMetrics
Hunter Lee [Thu, 28 Mar 2019 19:29:38 +0000 (12:29 -0700)]
TASK: Fix bug in isWorkflowStopped
A bug in isWorkflowStopped was causing the workflow context for the recurrent workflow template to show up as STOPPED. This diff fixes this so that it handles recurrent workflow templates correctly.
Hunter Lee [Thu, 28 Mar 2019 19:29:16 +0000 (12:29 -0700)]
HELIX: Bypass throttling for disabled partitions
This diff allows all state transitions linked to disabled instances/partitions to bypass throttling constraints.
Changelist:
1. Modify logic in IntermediateStateCalcStage
2. Add more integration tests
Hunter Lee [Thu, 28 Mar 2019 19:27:52 +0000 (12:27 -0700)]
HELIX: Recovery balance partitions with disabled top-state replicas
Previously, disabling of partitions or disabled instances did not affect Helix's throttling logic. This was problematic because the ability to disable was designed in in order to move partitons/replicas out of the given instance as a measure to deal with unhealthy partitions/instances. This allows, for partitions that are disabled, to go into recovery balance, and when the user has not set any throttling configs for recovery balance, these types of state transitions will go through unthrottled, avoiding downtime.
Changelist:
1. Add a check for determining rebalance type for a given partition
2. Add an integration test
Hunter Lee [Thu, 28 Mar 2019 19:27:26 +0000 (12:27 -0700)]
TASK: Fix bug where JobDispatcher does not create UserContentStore for new jobs
It was observed that there are multiple logic paths where a new job could get scheduled: 1. scheduleJobs() 2. processJobStatusUpdateAndAssignment(). When a job is being assigned by the latter, JobDispatcher would fail to create the UserContentStore for the job, causing all subsequent read/writes to this UserContentStore fail. This is a temporary fix and further refactoring of code paths would be required in order to consolidate where new jobs get scheduled.
Changelist:
1. Add UserContentStore for jobs with null contexts
Hunter Lee [Thu, 28 Mar 2019 19:26:58 +0000 (12:26 -0700)]
HELIX: Fix all DataUpdaters so that it checks for null previous records
Some implementations are missing the null check for the ZNRecord data prior to the update in the DataUpdater pattern. This is a fatal bug that could cause NullPointerExceptions. This diff goes through the codebase and address this.
Changelist:
1. Add null checks for all DataUpdater implementations
Hunter Lee [Thu, 28 Mar 2019 19:26:21 +0000 (12:26 -0700)]
Fix read previous assignment for workflows
With this change, TaskDataCache will:
1. Not pollute the log with read failures for previous assignment
2. Reduce the amount of reads from the previous assignment
Hunter Lee [Thu, 28 Mar 2019 19:24:44 +0000 (12:24 -0700)]
Fix N -> N + 1 extra bootstrap
When rebalancing, Helix does the following before dropping replicas that are not in the preference list:
1. Send a replica on a node that just came online from OFFLINE to SLAVE
2. Drop a replica on a node that just bootstrapped
This happened because of the rebalancer is trying to make all current states exactly match the states in the best possible mapping. Once conditions are met, Helix started dropping replicas. This fix makes Helix guarantee that only the replicas in the preference list match the states in the best possible mapping. So the rebalancer does not have to wait and bootstrap extra replicas that are not in preference list.
Hunter Lee [Thu, 28 Mar 2019 19:18:29 +0000 (12:18 -0700)]
HELIX: Fix typo in ResourceControllerDataProvider's API
Hunter Lee [Thu, 28 Mar 2019 19:08:58 +0000 (12:08 -0700)]
Add Hunter Lee as Committer
Junkai Xue [Thu, 7 Mar 2019 22:16:43 +0000 (14:16 -0800)]
Disable helix-front build
narendly [Tue, 5 Mar 2019 23:17:43 +0000 (15:17 -0800)]
Revise the markdown and add more context to how-to guide
Junkai Xue [Wed, 6 Mar 2019 23:05:39 +0000 (15:05 -0800)]
Update menu bar
narendly [Tue, 5 Mar 2019 23:17:43 +0000 (15:17 -0800)]
Revise the markdown and add more context to how-to guide
narendly [Mon, 4 Mar 2019 22:37:58 +0000 (14:37 -0800)]
Add markdown for auto-exit of maintenance mode
Junkai Xue [Thu, 28 Feb 2019 23:40:32 +0000 (15:40 -0800)]
[maven-release-plugin] prepare for next development iteration
Junkai Xue [Thu, 28 Feb 2019 23:40:22 +0000 (15:40 -0800)]
[maven-release-plugin] prepare release helix-0.8.4
narendly [Thu, 28 Feb 2019 23:24:17 +0000 (15:24 -0800)]
[HELIX-814] HELIX: Add back ClusterDataCache for backward-compatibility
It was discovered that removing ClusterDataCache and changing public interfaces (RebalanceStrategy, Rebalancer) caused backward-incompatibility. This diff aims to solve this issue by creating a backward-compatibie ClusterDataCache (deprecated).
Junkai Xue [Thu, 28 Feb 2019 23:28:35 +0000 (15:28 -0800)]
Revert "[maven-release-plugin] prepare release helix-0.8.4"
This reverts commit
3e7794a74414ca45fd0d526cf53960241cc52898.
Junkai Xue [Thu, 28 Feb 2019 23:28:24 +0000 (15:28 -0800)]
Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit
3abe3a4c29939ea027af19f21b3e4855910fa1ff.
Junkai Xue [Wed, 27 Feb 2019 19:11:15 +0000 (11:11 -0800)]
[maven-release-plugin] prepare for next development iteration
Junkai Xue [Wed, 27 Feb 2019 19:11:04 +0000 (11:11 -0800)]
[maven-release-plugin] prepare release helix-0.8.4
Junkai Xue [Wed, 27 Feb 2019 18:57:41 +0000 (10:57 -0800)]
Enable helix-front build
Junkai Xue [Wed, 27 Feb 2019 18:57:19 +0000 (10:57 -0800)]
Revert "[maven-release-plugin] prepare release helix-0.8.4"
This reverts commit
b7395468b424df30a0b46da684862bb248d65c2c.
Junkai Xue [Wed, 27 Feb 2019 18:57:07 +0000 (10:57 -0800)]
Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit
e41f8a68ceb9134881c6e6d53c6a5a7efdbdeedc.
Junkai Xue [Wed, 27 Feb 2019 05:17:52 +0000 (21:17 -0800)]
[maven-release-plugin] prepare for next development iteration
Junkai Xue [Wed, 27 Feb 2019 05:17:40 +0000 (21:17 -0800)]
[maven-release-plugin] prepare release helix-0.8.4
Junkai Xue [Wed, 27 Feb 2019 05:10:40 +0000 (21:10 -0800)]
Bump ivy version
Junkai Xue [Wed, 27 Feb 2019 04:48:33 +0000 (20:48 -0800)]
Release note and docs for 0.8.4
narendly [Wed, 27 Feb 2019 01:30:47 +0000 (17:30 -0800)]
[HELIX-812] HELIX: Fix maintenance history bug
There was a bug in maintenance history where when the cluster exits maintenance mode automatically, it would record the exit action twice in history. This is because each pipeline is designed to run MaintenanceRecoveryStage twice.
Changelist:
1. Add a flag so that if maintenanceSignal has been changed, just return from MaintenanceRecoveryStage
narendly [Wed, 27 Feb 2019 01:29:40 +0000 (17:29 -0800)]
[HELIX-811] HELIX: Only log relayMsg if it doesn't exist
This log was flooding our log files. We need to change it so that relay messages only get logged once.
narendly [Wed, 27 Feb 2019 01:28:38 +0000 (17:28 -0800)]
[HELIX-810] HELIX: Fix NPE in InstanceMessagesCache
It was observed that InstanceMessagesCache was throwing an NPE when it tries to setRelayTime(). This is likely because some relay messages have target instances that are no longer live (thus not in liveInstanceMap). InstanceMessagesCache must handle this gracefully by skipping the operation. We do not delete these msgs right away because the instance may come back alive. Otherwise, after some time has passed, the msg will get expired by the Controller and be removed.
Changelist;
1. Add a try-catch block
2. Improve logging
narendly [Wed, 27 Feb 2019 01:26:00 +0000 (17:26 -0800)]
[HELIX-809] TEST: Fix unstable TestClusterInMaintenanceModeWhenReachingMaxPartition
The pause for this was too short so the test was occasionally failing. This RB fixes this.
narendly [Wed, 27 Feb 2019 01:24:56 +0000 (17:24 -0800)]
[HELIX-808] TASK: Fix double-booking of tasks with task CurrentStates
It was observed that TestNoDoubleAssign was failing intermittently. Upon debugging with more detailed logs, there was a race condition between newly starting tasks and dropping tasks. To prevent this, dropping state transitions will be prioritized and prevInstanceToTaskAssignment will be built from CurrentStates. This is needed to make sure the right number of tasks are assigned every task pipeline and dropping transitions happen right away.
Changelist:
1\. Change the logic for generating prevInstToTaskAssignment so that it's based on CurrentState
2\. Add a special check for not updating task partition state upon Participant connection loss
3\. TestNoDoubleAssign passes consistently
4. Fix TestNoDoubleAssign so that there won't be any thread leak
narendly [Wed, 27 Feb 2019 01:20:46 +0000 (17:20 -0800)]
[HELIX-807] REST: Add get maintenance signal endpoint
Changelist:
1. Add get maintenance signal endpoint
2. Add a test
narendly [Wed, 27 Feb 2019 01:19:14 +0000 (17:19 -0800)]
[HELIX-806] HELIX: Modify endpoints for instrumenting maintenance with custom fields
We want to use content for users to input their KV mappings as a JSON string.
Changelist:
1. Modify enable/disableMaintenanceMode endpoint logic
2. Modify tests