Jiajun Wang [Tue, 13 Aug 2019 18:30:35 +0000 (11:30 -0700)]
[maven-release-plugin] prepare release helix-0.9.1
Jiajun Wang [Mon, 12 Aug 2019 20:57:21 +0000 (13:57 -0700)]
Reenable helix-front module for official release.
Jiajun Wang [Mon, 12 Aug 2019 20:55:40 +0000 (13:55 -0700)]
Revert "[maven-release-plugin] prepare release helix-0.9.1"
This reverts commit
c7e8e6366f6e5360d416e2fd1867252ebdcd7242.
Jiajun Wang [Mon, 12 Aug 2019 20:55:35 +0000 (13:55 -0700)]
Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit
f2746c823193991a0dd6152827b7344d66226368.
Jiajun Wang [Mon, 12 Aug 2019 20:36:09 +0000 (13:36 -0700)]
[maven-release-plugin] prepare for next development iteration
Jiajun Wang [Mon, 12 Aug 2019 20:35:57 +0000 (13:35 -0700)]
[maven-release-plugin] prepare release helix-0.9.1
Jiajun Wang [Mon, 12 Aug 2019 17:58:21 +0000 (10:58 -0700)]
Fix the CallbackHandler registration logic in DistributedLeaderElection (#395)
* Fix the CallbackHandler registration logic in DistributedLeaderElection that may cause a leader node has no callback registered.
Our current initialization logic assumes a strict leader acquire/relinquish events sequence. However, due to the possible carried over ZK events from the previous ZK session, the controller node change event might be triggered in the following sequence:
1. CALLBACK (from the previous session): Create new leader node and add handlers.
2. FINALIZE (Handle the previous session expire): Clean up handlers.
3. INIT (For the new session establishment): Expect to add the handlers back again.
As a result, if the INIT event processing does not recover the handlers, the leader controller won't be able to manage anything. This fix ensures all the acquireLeadership call will try to initialize the leader controller's callback handlers.
Also, add the additional test logic in TestHandleNewSession to verify the fix.
* Improve the leader history update logic so there is no duplicate entry recorded.
Hunter Lee [Fri, 9 Aug 2019 23:57:05 +0000 (16:57 -0700)]
TASK: Drop all tasks whose requested states are DROPPED
Upon a Participant disconnect, the Participant would carry over from the last session. This would copy all previous task states to the current session and set their requested states as DROPPED (for INIT and RUNNING states).
It came to our attention that sometimes these Participants experience connection issues and the tasks happen to be in TASK_ERROR or COMPLETED states. These tasks would get stuck on the Participant and never be dropped. This issue proposes to add the logic that would get all tasks whose requested states are DROPPED to be dropped immediately.
Changelist:
1. Make sure all tasks whose requested state is DROPPED get added to tasksToDrop
2. Add a unit test: TestDropTerminalTasksUponReset
Junkai Xue [Mon, 5 Aug 2019 23:33:50 +0000 (16:33 -0700)]
Improve ZK read with batch call
Current HealthReport read is single call for each participant. Improve it will batch call to ZK to reduce the number of calls.
Junkai Xue [Tue, 6 Aug 2019 03:53:13 +0000 (20:53 -0700)]
Add reviews@helix.apache.org to mailing list
Junkai Xue [Mon, 5 Aug 2019 23:25:03 +0000 (16:25 -0700)]
Stablize the REST tests
Stablize the REST tests by following changes:
1. Remove temporary cluster which impact the ClusterAccessor test
2. Add all start/end message for test debug purpose.
3. Disable unstable monitoring test for default MBeans. Sometimes we can query it sometimes not. It is not critical test path. Let's make it stable later.
Hunter Lee [Tue, 6 Aug 2019 18:32:16 +0000 (11:32 -0700)]
Read ClusterConfig from ZK selectively
Previously, ClusterConfig would be read from ZK every pipeline run. This PR makes it a selective read and also add to the set of all changed types so that cluster change detector could more easily tell whether ClusterConfig changed without having to store two copies of ClusterConfig objects.
kaisun2000 [Tue, 6 Aug 2019 18:58:16 +0000 (11:58 -0700)]
Fix RoutingTableProvider statePropagationLatency metric reporting bug (#365)
Issue:
CurrentStateCache updating snapshot would miss all the existing partitions that having state change.
RoutingTableProvider callback on the main event thread. Time is not accounted in log.
Description:
fix the bug by updating the snapshot with the correct reloadkeys.
enhanced log to accout for user callback code separately.
Tests:
mvn test passed.
Yi Wang [Tue, 23 Jul 2019 23:27:59 +0000 (16:27 -0700)]
Dynamically change the processor thread name when consuming event
Hunter Lee [Mon, 5 Aug 2019 17:16:27 +0000 (10:16 -0700)]
Remove DEFAULT_VIEW_CLUSTER_REFRESH_PERIOD from ClusterConfig
This is a constant that is no longer used.
Hunter Lee [Mon, 5 Aug 2019 16:19:51 +0000 (09:19 -0700)]
Remove .reviewboardrc from the open source repository
Ali Reza Zamani Zadeh Najari [Thu, 1 Aug 2019 17:54:28 +0000 (10:54 -0700)]
Remove unnecessary touch logics that trigge pipeline
In the places that ZooKeeper Resourceconfig is updated,
it is not necessary to do touch logic anymore to run the pipeline again.
Resourcesconfig update automatically runs triggers pipeline.
This commit fixes issue #370.
jiajunwang [Tue, 30 Jul 2019 21:41:32 +0000 (14:41 -0700)]
Fix the race condition while Helix refresh cluster status cache. (#363)
* Fix the race condition while Helix refresh cluster status cache.
This change fix issue #331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.
Ali Reza Zamani Zadeh Najari [Tue, 23 Jul 2019 22:15:47 +0000 (15:15 -0700)]
Remove TODO NPE log for computeResourceBestPossibleState
The logs related to NPE in computeResourceBestPossibleState is not needed anymore.
This commit fixes issue #351.
Ali Najari [Wed, 17 Jul 2019 21:21:40 +0000 (14:21 -0700)]
Read Failure while reading non-existent znode
In this commit, in case of encountering NoNodeException while reading data from a znode that does not exist, the NoNodeException will be caught and readfailurecounter will not incremented.
Instead, the related information (read Counter, read Latency, etc.) will be recorded.
This commit fixes issue #345.
Kai Sun [Wed, 17 Jul 2019 01:36:42 +0000 (18:36 -0700)]
Implementation of stateModelDef modification in REST 2.0
Current implementation of Rest 2.0 does not support stateModelDef modification. Here, we will implement
delete -- remove the stateModelDef with the input id.
put -- create new statemodeldef if no existing one with same input id
set -- replace the content of node with input id
We also add the following test cases:
Test delete model one; expect success
Test delete model one again; expect success
Create the deleted model one; expect success
Create the deleted model one again; expect failure as the same model id exists
Set the model one with modified content; expect success
Read the model one; expect the content would be same as modified content
Set the model one to original content restore original state; expect success
Ali Reza Zamani Zadeh Najari [Mon, 15 Jul 2019 22:25:52 +0000 (15:25 -0700)]
Change IllegalStateException to Helix Exception for CRUSH based rebalance strategy algorithm
In this commit the IllegalStateException has been caught and HelixException has been thrown for the upper layer instead. The error log shows more meaningful exception.
A test has been changed accordingly.
This commit fixes issue #322.
Junkai Xue [Mon, 22 Jul 2019 22:25:01 +0000 (15:25 -0700)]
Fix test fail for TestRebalanceScheduler
Junkai Xue [Tue, 16 Jul 2019 01:32:57 +0000 (18:32 -0700)]
Fix invoke rebalance by "touching" IdealState/ResourceConfig
Current HelixDataAccesor updateProperty uses ZNRecordUpdater. It's merge logic just simply adding all elements when do a merge for ZNRecord. That could cause lot of duplication of listFields.
This impact the invokeRebalanceForResourceConfig. The fix will be implementing a customized updater.
In this commit:
1. Fix invoke rebalance with customized updater.
2. Add comments for ZNRecord merge.
3. Add checks in TaskUtil to only trigger Workflow Config "touch" when purge job.
4. Add a test for RebalanceScheduler.
Ali Najari [Mon, 15 Jul 2019 22:25:52 +0000 (15:25 -0700)]
IllegalStateException for CRUSH based rebalance strategy algorithm.
This commit fixes the error log and exception that is shown when there is not enough eligible instance to use.
Junkai Xue [Tue, 9 Jul 2019 23:38:10 +0000 (16:38 -0700)]
Exclude ANY_INSTANCE for customized sibling checks
Current Helix HealthCheck API checks the ANY_INSTANCE resources, which is not necessary. Since ANY_INSTANCE resources only have single partition with 1 replica, there is no need to check sibling health status.
This commit fixes issue #328
Junkai Xue [Tue, 9 Jul 2019 23:47:58 +0000 (16:47 -0700)]
Disable helix-front build
Hunter Lee [Wed, 12 Jun 2019 17:27:45 +0000 (10:27 -0700)]
Update all markdown files to 0.9.0
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 11 Jun 2019 19:01:49 +0000 (12:01 -0700)]
Update website with 0.9.0 with maven plugin version upgrades
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 11 Jun 2019 00:37:00 +0000 (17:37 -0700)]
Add Release Notes and Docs for 0.9.0 release
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Mon, 10 Jun 2019 23:57:51 +0000 (16:57 -0700)]
Upgrade ivy version for 0.9.0 release
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Mon, 3 Jun 2019 00:56:17 +0000 (17:56 -0700)]
prepare for next development iteration
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 25 Jun 2019 22:08:22 +0000 (15:08 -0700)]
[maven-release-plugin] prepare release helix-0.9.0
[maven-release-plugin] prepare release helix-0.9.0
Hunter Lee [Sun, 2 Jun 2019 23:55:24 +0000 (16:55 -0700)]
Enable helix-front in pom.xml
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 25 Jun 2019 07:08:44 +0000 (00:08 -0700)]
Merge differences with another branch
There are multiple branches against which Helix devs have been doing development work. We wish to consolidate them into one by reconciling all differences. This diff makes such changes. This diff does not contain any changes in logic or functionality.
Junkai Xue [Fri, 21 Jun 2019 18:58:13 +0000 (11:58 -0700)]
Fix looping with keySet and modifying keySet same time
Looping with keySet and modifying with keySet entry at same time could cause ConcurrentModificationException. Fix that with adding to extra new Set and remove after looping is done.
RB=
1711266
G=helix-reviewers
A=jjwang,hulee,ksun
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Mon, 17 Jun 2019 21:37:02 +0000 (14:37 -0700)]
Always try reading from EphemeralOwner state first while reading the session ID from a live instance node.
This is to avoid inconsistent session ID in the node content and the emphemeral owner state.
Note that in order to ensure backward compatiblity and some test cases, the newly introduced method will still read from the node content if the ephemeral owner state is empty (-1 or 0).
RB=
1704942
BUG=HELIX-1969
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Tue, 18 Jun 2019 22:38:24 +0000 (15:38 -0700)]
Fix compute IdealState mapping tool
There is a bug in IdealState mapping tool. It does not filtered out the instances are live but disabled. Add this logic and add extra tests for it.
RB=
1706106
BUG=HELIX-1974
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 13 Jun 2019 01:39:50 +0000 (18:39 -0700)]
Enable default Jersey server metric reporting
For monitoring Helix REST, we can support both REST server monitoring and customized logic monitoring.
In this rb, we enable the Jersey server monitoring metrics and adding testing for that.
RB=
1701238
BUG=HELIX-1963
G=helix-reviewers
A=ywang4
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Lei Xia [Wed, 5 Jun 2019 00:15:17 +0000 (17:15 -0700)]
Remove relay message from controller's message cache immediately if the partition on relay host turned to ERROR state while transits off from top-state.
RB=
1689771
BUG=HELIX-1900
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Mon, 10 Jun 2019 22:29:58 +0000 (15:29 -0700)]
Upgrade Apache rat version and add exclusion paths
A part of Helix release process requires the rat plugin to perform checks. However, there are scripts and website files that do not require this kind of checks. This diff adds exclusion paths to the pom.xml. Note that there are several Java files that do not still pass the checks due to them not having the Apache license. This still needs to be fixed in the future.
RB=
1695987
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Wed, 24 Apr 2019 23:50:30 +0000 (16:50 -0700)]
Catch exception and log error when helix-admin-webapp fails to read data from certain path
RB=
1644066
BUG=https://jira01.corp.linkedin.com:8443/browse/EXC-114388
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Mon, 3 Jun 2019 18:15:54 +0000 (11:15 -0700)]
Fix http request hanging issue to the SN API
RB=
1684758
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 30 May 2019 23:47:16 +0000 (16:47 -0700)]
Change output behavior for non-exist instances
Current behavior of non-existing instance will be not showing in output. So user is hard to differentiate whether instance does not exist or not belongs to same zone.
Add the logic to check instances exists in instance list or not.
RB=
1684700
BUG=HELIX-1911
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 30 May 2019 01:19:43 +0000 (18:19 -0700)]
Fix check for disabled partitions
For the map field of disabled partitions, even they are all enabled, there could be some key left over for resources. We cannot just check if there is any resource entries. With this fix, Helix loops all the resource entries of disabled map to see whether there is a parition list is not empty.
In addition, fix failed tests in REST.
RB=
1683071
BUG=HELIX-1910
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Lei Xia [Fri, 17 May 2019 17:36:22 +0000 (10:36 -0700)]
Remove workaround in sending S->M message when there is a same pending relay message.
RB=
1670732
BUG=HELIX-1871
G=helix-reviewers
A=jjwang,jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 23 May 2019 19:36:59 +0000 (12:36 -0700)]
Change state transition monitor to per cluster per state transition
Existing state transition metrics recording per cluster, per resource and per state transition metrics. For the large cluster containing lots of resources may bring tremendous number of metrics at participant side.
This RB changes the metrics to per cluster per state transition, which could be fair enough for monitoring purpose.
RB=
1677106
RB=
1677106
BUG=HELIX-1890
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Sat, 1 Jun 2019 00:08:03 +0000 (17:08 -0700)]
Upgrade ZK to 3.4.13
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Wed, 22 May 2019 02:01:16 +0000 (19:01 -0700)]
Adding Zk data change callback propagation latency metric.
Note that the latency metric only covers data change callback for now.
To adding child change callback, we need to find a way to avoid the additional ZK access that is required for read children node stats. Added TODO in the corresponding code block.
RB=
1674550
G=helix-reviewers
A=jxue,ksun
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Thu, 23 May 2019 07:52:18 +0000 (00:52 -0700)]
Introduce ZkPathStatRecord to record watch reinstall in ZkClient.
This is to avoid duplicate watch re-install in the current implementation.
In addition, we will also leverage this Event to report ZK data propagation latency. The related change has been split to an other rb.
Hunter Lee [Sun, 2 Jun 2019 23:54:03 +0000 (16:54 -0700)]
Disable JavaDoc check in pom.xml
Java started enforcing JavaDoc lint checks going from Java 7 to Java 8. Since the codebase has a lot of JavaDoc that does not pass the lint check, we are going to disable this check temporarily.
Hunter Lee [Tue, 28 May 2019 18:25:05 +0000 (11:25 -0700)]
Fix TestZNRecordSizeLimit
This diff fixes TestZNRecordSizeLimit so that it considers the default behavior to be auto-compression of ZNodes enabled.
Hunter Lee [Sat, 25 May 2019 01:18:54 +0000 (18:18 -0700)]
Remove vestiges of cluster view aggregator
While merging commits, some code got merged into the master branch accidentally, and this commit removes such code so that the project builds.
Lei Xia [Fri, 17 May 2019 16:17:33 +0000 (09:17 -0700)]
Two minor improvements. 1) Avoid persisting null entry into CurrentStateOutput, 2) add addition info to CallbackProcess thread name to differeniate different threads.
RB=
1670214
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Wed, 15 May 2019 15:28:16 +0000 (08:28 -0700)]
Refactor StateTransitionStatMonitor extends DynamicMbean
To support per state transition latency, the first step is to change the StateTransitionStatMonitor to DynamicMbean.
RB=
1671496
RB=
1671496
RB=
1671496
BUG=HELIX-1890
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Fri, 22 Feb 2019 23:36:46 +0000 (15:36 -0800)]
Add message latency record to StateTransitionStatMonitor.
This record provides with additional breakdown to understand the state transition delay.
RB=
1573606
BUG=HELIX-1625
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Sat, 18 May 2019 00:43:40 +0000 (17:43 -0700)]
Fix unstable test for TestZKUtil
Since tests run parallel, it caused race condition for data messed up in ZK. Fix it with different id.
RB=
1671516
RB=
1671516
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Kai Sun [Mon, 20 May 2019 16:43:36 +0000 (09:43 -0700)]
Title: Helix-1842: test active a cluster to super cluster with default to FULL_AUTO
Description:
This is a follow up of previous diff at rb
1666833. In the previous diff, we did to really trigger the logic such that the participant (controllers) in supercluster will monitor the added cluster. We fixed it in this diff.
Also, we enhanced the test cases to tear down orphan-ed threads.
The following is the original description about this task:
Current v2 rest api, adding a cluster to supercluster will put the new ideastate to SEMI_AUTO. In this change, we make the default as follwing:
REBALANCE_MODE = FULL_AUTO,
replicas = 3,
REBALANCER = DelayedAutoRebalancer,
REBALANCE_STRATEGY = CrushEdRebalanceStrategy.
Also, we will indeed make the replicas of controller to be 3, instead of all of the controllers as currently implemented.
RB=
1671448
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Mon, 20 May 2019 01:10:31 +0000 (18:10 -0700)]
Add support for HTTPS in CustomRestClient
This diff configures SSLContext (Helix REST server's) into its HTTP client
RB=
1671108
G=helix-reviewers
R=cjerian,zpolicze
A=ywang4
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Fri, 17 May 2019 22:48:07 +0000 (15:48 -0700)]
Skip the sibling checks for resource without minActiveReplica checks
RB=
1670752
RB=
1670752
RB=
1670752
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Fri, 17 May 2019 00:40:34 +0000 (17:40 -0700)]
TEST: Further fix Helix test suite
This diff does the following:
1. Replace Thread.sleep statements with TestHelper.verify (polling with conditions)
2. Increases GC pause between tests to 4 seconds
3. Improve ZKHelixClusterVerifier's verifyByPolling method by adding invokeRebalance() method
RB=
1669831
RB=
1669831
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Thu, 16 May 2019 22:46:29 +0000 (15:46 -0700)]
Refine missing top state log method.
The parameter naming was confusing. The log message was not clear. This RB fixes both issues.
RB=
1669555
RB=
1669555
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Kai Sun [Tue, 14 May 2019 20:55:33 +0000 (13:55 -0700)]
Title: Helix-1842: add a resource/cluster to super cluster with default FULL_AUTO
Current v2 rest api, adding a cluster to supercluster will put the new ideastate to SEMI_AUTO. In this change, we make the default as follwing:
REBALANCE_MODE = FULL_AUTO,
replicas = 3,
REBALANCER = DelayedAutoRebalancer,
REBALANCE_STRATEGY = CrushEdRebalanceStrategy.
Also, we will indeed make the replicas of controller to be 3, instead of all of the controllers as currently implemented.
address the first batch of reviews from hunter and jiajun.
address further comments
RB=
1666833
BUG=Helix-1842
G=helix-reviewers
R=jxue,jjwang,ywang4,lxia,hulee,eblumena
A=hulee,jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Sat, 4 May 2019 00:52:17 +0000 (17:52 -0700)]
TEST: Groom and refactor Helix integration tests
It was observed that there was a lot of technical debt (improper and buggy cleanup) in Helix's unit and integration tests. There were also mock controller and participant threads that were never shut down properly. This was preventing mvn test suite from completing over a remote machine (TMC), and even on local environments, mvn test was not passing. This diff refactors tests and makes sure that ZK is cleaned up after tests.
Changelist:
1. Inspect and correct mock threads (controller, participant, spectator, etc)
2. Ensure there are no leftover garbage clusters from tests
3. Java 8 syntax
4. Style fixes in old tests using Helix open source style file (helix-style.xml)
RB=
1654905
G=helix-reviewers
A=jxue,eblumena
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Thu, 9 May 2019 22:59:59 +0000 (15:59 -0700)]
Fix critical Task Framework throttle bug
Task throttling feature had a logical bug where it wouldn't count any of the pending task assignments, which was breaking task throttling. This diff fixes it.
RB=
1661127
BUG=HELIX-1875
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 9 May 2019 22:25:40 +0000 (15:25 -0700)]
Add tests for cancellation message with p2p
Adding a test case to ensure cancellation message will not cancel the message of p2p relay message when it is under pending state.
RB=
1661028
G=helix-reviewers
A=lxia
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Fri, 3 May 2019 23:03:37 +0000 (16:03 -0700)]
Bug fix: reuse the stable logics to verfiy the difference between idealStates and externalViews
RB=
1654700
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Tue, 30 Apr 2019 22:18:21 +0000 (15:18 -0700)]
Avoid lock the cache object when require a FullRefresh.
The old synchronize control logic will prevent requiring full refresh if a refresh is in progress. This may lead to a slow callback handling.In this change, we remove the original synchronize control. The current cache update logic will be able to handle gradually refreshed data. There is no need to lock the full refresh request.
RB=
1652941
BUG=HELIX-1851,gcn-29329
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Fri, 3 May 2019 00:45:08 +0000 (17:45 -0700)]
Integrate customRestClient health check with instance service main logic
RB=
1645567
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Sat, 20 Apr 2019 00:27:19 +0000 (17:27 -0700)]
implementation of CustomRestClient (post request and get health checks)
RB=
1638858
G=helix-reviewers
R=cjerian
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Thu, 25 Apr 2019 21:03:21 +0000 (14:03 -0700)]
Fix the log logic in HelixManager.isLeader().
The output is not correct when isLeader() is false.
RB=
1645218
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 11 Apr 2019 21:38:45 +0000 (14:38 -0700)]
Support partion level health mapping fetch from ZK
For partition level health status is different from per instance querying. Helix will try to get data from ZK under HEALTH_REPORT folder first. If the data is expired (check with EXPIRE entry), Helix will directly call the API to the participant to get latest data.
Otherwise, we shall assume the customized check as failed.
RB=
1628988
BUG=HELIX-1785
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 23 Apr 2019 18:57:36 +0000 (11:57 -0700)]
Fix the public API non-backward compatible change
RB=
1641513
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 16 Apr 2019 20:21:33 +0000 (13:21 -0700)]
TASK: Fix String formatting issue
For integers, you must use %d, not %f.
RB=
1633161
BUG=HELIX-1794
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Wed, 3 Apr 2019 21:36:24 +0000 (14:36 -0700)]
More unit tests for InstanceValidationUtil
RB=
1617333
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 11 Apr 2019 00:18:30 +0000 (17:18 -0700)]
Add util for checking per instance level health and partition level health
Customized health check including user customized per instance check which ioslated from other instances.
In addition to per instance level check, partition level check should have complete scope crossing instances which hold sibling partitions. For this partition check is to guarantee shuting down current check instance can have health replicas to hold top state.
RB=
1627813
BUG=HELIX-1776
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Wed, 10 Apr 2019 23:58:34 +0000 (16:58 -0700)]
Fix TestRecurringJobQueue
This diff fixes TestRecurringJobQueue's testDeletingRecurrentQueueWithHistory
RB=
1627625
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Wed, 10 Apr 2019 23:56:11 +0000 (16:56 -0700)]
TASK: Fix bug in delete()
The delete() call was doing a force delete on workflows created from a recurrent workflow. This would cause a race condition between the controller cache and the deletion. This diff fixes this.
Changelist:
1. Fix the logic in delete()
RB=
1627615
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Fri, 29 Mar 2019 19:13:48 +0000 (12:13 -0700)]
IntermediateStateCalcStage style change
This diff includes code style fixes and refactor using Java 8 features.
RB=
1613452
BUG=HELIX-1742
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Fri, 29 Mar 2019 19:08:07 +0000 (12:08 -0700)]
Task Framework code style change
This diff includes style changes using Java 8 features.
RB=
1613441
BUG=HELIX-1742
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Wed, 10 Apr 2019 20:47:19 +0000 (13:47 -0700)]
Fix tests in Helix REST
RB=
1627064
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 9 Apr 2019 04:44:40 +0000 (21:44 -0700)]
TASK: Add deleteJob namespaced job name support
Current deletion of jobs from JobQueues only support denamespaced job names. This makes it impossible for users to list all jobs and delete them because they cannot recover denamespaced names sometimes.
Changelist:
1. Add support for namespaced job names for deletion
RB=
1624395
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 9 Apr 2019 04:08:37 +0000 (21:08 -0700)]
TASK: Fix bug in getExpiredJobs()
getExpiredJobs() had a bug where if the job has the same expiry time as workflow's default expiry, it would always override it with Workflow's expiry config. This is not correct.
Changelist:
1. Remove a block of code where it overrides expiry config with WorkflowConfig's default expiry
RB=
1624376
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Thu, 4 Apr 2019 00:12:56 +0000 (17:12 -0700)]
Fix faulty logic in BestPossibleExternalViewVerifier
removeEntryWithIgnoredStates() was not really doing what it was supposed to do. This diff fixes this.
Also, a small delay added to make TestDrop more stable.
RB=
1619153
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Wed, 3 Apr 2019 21:34:29 +0000 (14:34 -0700)]
TEST: Fix UserContentStore related tests in helix-rest
The behavior changed such that if the client-side code does not find the UserContent ZNode, it creates one instead of throwing an NPE. This fixes the tests so that it adapts to the new behavior. This behavior should be reverted eventually because UserContent ZNode should be created only by the Controller.
RB=
1618685
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Sat, 30 Mar 2019 00:28:07 +0000 (17:28 -0700)]
Check sibling nodes to guarantee MIN_ACTIVE_REPLICAS satisfied
RB=
1614128
G=helix-reviewers
A=jxue,hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Mon, 1 Apr 2019 23:56:03 +0000 (16:56 -0700)]
Fix test failures and fix logic check stable state
Fix test failures:
1. Add logic to skip task framework idealstates
2. fix logic for test failure.
RB=
1615941
BUG=HELIX-1725
G=helix-reviewers
A=lxia
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Lei Xia [Mon, 1 Apr 2019 22:55:05 +0000 (15:55 -0700)]
Fix unit test by starting rest sever only once.
RB=
1615610
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Mon, 1 Apr 2019 22:54:21 +0000 (15:54 -0700)]
Swallow exceptions during health status checks for getting instance by id
RB=
1615554
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Fri, 29 Mar 2019 23:40:51 +0000 (16:40 -0700)]
Refactor InstanceAccessor to InstancesAccessor and PerInstanceAccessor
RB=
1614063
BUG=HELIX-1725
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Fri, 29 Mar 2019 18:11:35 +0000 (11:11 -0700)]
Global instance stoppable API
This API will be input list of instances and return stoppable instances. So checks performed here:
1. single stoppable for each instance.
2. shutdown instances will cause replicas drop less than min active number.
For first phase, we do not implement instance based selection.
Here we added an integration test:
1. test instances disabled, has disable partition, not same zone, not alive.
2. disable one stoppable instance, check failed
3. reeable the instance and remove the disabled partition, check for that instance passed again.
Several places make set, map to be TreeSet and TreeMap is that we would like to guarantee the output result is consistent. We do see sorting different for Java 7 and Java 8.
RB=
1596424
BUG=HELIX-1680
G=helix-reviewers
A=lxia
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Wed, 27 Mar 2019 00:53:19 +0000 (17:53 -0700)]
Report instance started & health status when getting by id
RB=
1609426
BUG=helix-1732
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Thu, 28 Mar 2019 23:33:01 +0000 (16:33 -0700)]
Rename instance health check enum to be more explicit
RB=
1612544
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Wed, 20 Mar 2019 23:46:22 +0000 (16:46 -0700)]
Single stoppable API impl
RB=
1603158
G=helix-reviewers
A=jxue,hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 19 Mar 2019 21:16:53 +0000 (14:16 -0700)]
Implementation of ClusterService's getClusterTopology method
RB=
1601257
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 19 Mar 2019 21:16:16 +0000 (14:16 -0700)]
Interface design for zone mapping information
RB=
1578905
BUG=helix-1646
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Tue, 19 Mar 2019 21:16:16 +0000 (14:16 -0700)]
Interface design for zone mapping information
RB=
1578905
BUG=helix-1646
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Wed, 14 Nov 2018 00:59:18 +0000 (16:59 -0800)]
Fix node swap test.
Add sleep to stablize the test. Several cluster operations require controller reaction before checking.
RB=
1484466
G=helix-reviewers
A=hrzhang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Fri, 15 Mar 2019 23:35:01 +0000 (16:35 -0700)]
Add Util check instance is already in stable state
We have two choice of checking instance in stable state:
1. Compare IdealState with ExternalView
2. Compare IdealState with CurrentState.
Finally choose IS vs EV is because:
1. We have simple cache in REST, read current state will still cause multiple reads from different hosts for current state. But EV can shared by each host.
2. EV is the decision maker for router, which is kinda source of truth of real production environment. So EV is the final choice.
RB=
1598176
BUG=HELIX-1676
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Tue, 12 Mar 2019 19:40:41 +0000 (12:40 -0700)]
Dummy check for customized API
For this change, it build the dummy check for customized API. It contains following changes:
1. RESTConfig can setup the customized URL
2. Define the end point of per participant and per partition.
3. Add dummy logic that return true for all the check status of customized checks.
RB=
1596427
BUG=HELIX-1678
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>