helix.git
2 years ago[maven-release-plugin] prepare release helix-0.9.1 helix-0.9.1
Jiajun Wang [Tue, 13 Aug 2019 18:30:35 +0000 (11:30 -0700)] 
[maven-release-plugin] prepare release helix-0.9.1

2 years agoReenable helix-front module for official release.
Jiajun Wang [Mon, 12 Aug 2019 20:57:21 +0000 (13:57 -0700)] 
Reenable helix-front module for official release.

2 years agoRevert "[maven-release-plugin] prepare release helix-0.9.1"
Jiajun Wang [Mon, 12 Aug 2019 20:55:40 +0000 (13:55 -0700)] 
Revert "[maven-release-plugin] prepare release helix-0.9.1"

This reverts commit c7e8e6366f6e5360d416e2fd1867252ebdcd7242.

2 years agoRevert "[maven-release-plugin] prepare for next development iteration"
Jiajun Wang [Mon, 12 Aug 2019 20:55:35 +0000 (13:55 -0700)] 
Revert "[maven-release-plugin] prepare for next development iteration"

This reverts commit f2746c823193991a0dd6152827b7344d66226368.

2 years ago[maven-release-plugin] prepare for next development iteration
Jiajun Wang [Mon, 12 Aug 2019 20:36:09 +0000 (13:36 -0700)] 
[maven-release-plugin] prepare for next development iteration

2 years ago[maven-release-plugin] prepare release helix-0.9.1
Jiajun Wang [Mon, 12 Aug 2019 20:35:57 +0000 (13:35 -0700)] 
[maven-release-plugin] prepare release helix-0.9.1

2 years agoFix the CallbackHandler registration logic in DistributedLeaderElection (#395)
Jiajun Wang [Mon, 12 Aug 2019 17:58:21 +0000 (10:58 -0700)] 
Fix the CallbackHandler registration logic in DistributedLeaderElection (#395)

* Fix the CallbackHandler registration logic in DistributedLeaderElection that may cause a leader node has no callback registered.

Our current initialization logic assumes a strict leader acquire/relinquish events sequence. However, due to the possible carried over ZK events from the previous ZK session, the controller node change event might be triggered in the following sequence:
1. CALLBACK (from the previous session): Create new leader node and add handlers.
2. FINALIZE (Handle the previous session expire): Clean up handlers.
3. INIT (For the new session establishment): Expect to add the handlers back again.
As a result, if the INIT event processing does not recover the handlers, the leader controller won't be able to manage anything. This fix ensures all the acquireLeadership call will try to initialize the leader controller's callback handlers.

Also, add the additional test logic in TestHandleNewSession to verify the fix.

* Improve the leader history update logic so there is no duplicate entry recorded.

2 years agoTASK: Drop all tasks whose requested states are DROPPED
Hunter Lee [Fri, 9 Aug 2019 23:57:05 +0000 (16:57 -0700)] 
TASK: Drop all tasks whose requested states are DROPPED

Upon a Participant disconnect, the Participant would carry over from the last session. This would copy all previous task states to the current session and set their requested states as DROPPED (for INIT and RUNNING states).

It came to our attention that sometimes these Participants experience connection issues and the tasks happen to be in TASK_ERROR or COMPLETED states. These tasks would get stuck on the Participant and never be dropped. This issue proposes to add the logic that would get all tasks whose requested states are DROPPED to be dropped immediately.
Changelist:
1. Make sure all tasks whose requested state is DROPPED get added to tasksToDrop
2. Add a unit test: TestDropTerminalTasksUponReset

2 years agoImprove ZK read with batch call
Junkai Xue [Mon, 5 Aug 2019 23:33:50 +0000 (16:33 -0700)] 
Improve ZK read with batch call

Current HealthReport read is single call for each participant. Improve it will batch call to ZK to reduce the number of calls.

2 years agoAdd reviews@helix.apache.org to mailing list
Junkai Xue [Tue, 6 Aug 2019 03:53:13 +0000 (20:53 -0700)] 
Add reviews@helix.apache.org to mailing list

2 years agoStablize the REST tests
Junkai Xue [Mon, 5 Aug 2019 23:25:03 +0000 (16:25 -0700)] 
Stablize the REST tests

Stablize the REST tests by following changes:
1. Remove temporary cluster which impact the ClusterAccessor test
2. Add all start/end message for test debug purpose.
3. Disable unstable monitoring test for default MBeans. Sometimes we can query it sometimes not. It is not critical test path. Let's make it stable later.

2 years agoRead ClusterConfig from ZK selectively
Hunter Lee [Tue, 6 Aug 2019 18:32:16 +0000 (11:32 -0700)] 
Read ClusterConfig from ZK selectively

Previously, ClusterConfig would be read from ZK every pipeline run. This PR makes it a selective read and also add to the set of all changed types so that cluster change detector could more easily tell whether ClusterConfig changed without having to store two copies of ClusterConfig objects.

2 years agoFix RoutingTableProvider statePropagationLatency metric reporting bug (#365)
kaisun2000 [Tue, 6 Aug 2019 18:58:16 +0000 (11:58 -0700)] 
Fix RoutingTableProvider statePropagationLatency metric reporting bug (#365)

Issue:

CurrentStateCache updating snapshot would miss all the existing partitions that having state change.

RoutingTableProvider callback on the main event thread. Time is not accounted in log.

Description:
fix the bug by updating the snapshot with the correct reloadkeys.

enhanced log to accout for user callback code separately.

Tests:
mvn test passed.

2 years agoDynamically change the processor thread name when consuming event
Yi Wang [Tue, 23 Jul 2019 23:27:59 +0000 (16:27 -0700)] 
Dynamically change the processor thread name when consuming event

2 years agoRemove DEFAULT_VIEW_CLUSTER_REFRESH_PERIOD from ClusterConfig
Hunter Lee [Mon, 5 Aug 2019 17:16:27 +0000 (10:16 -0700)] 
Remove DEFAULT_VIEW_CLUSTER_REFRESH_PERIOD from ClusterConfig

This is a constant that is no longer used.

2 years agoRemove .reviewboardrc from the open source repository
Hunter Lee [Mon, 5 Aug 2019 16:19:51 +0000 (09:19 -0700)] 
Remove .reviewboardrc from the open source repository

2 years agoRemove unnecessary touch logics that trigge pipeline
Ali Reza Zamani Zadeh Najari [Thu, 1 Aug 2019 17:54:28 +0000 (10:54 -0700)] 
Remove unnecessary touch logics that trigge pipeline

In the places that ZooKeeper Resourceconfig is updated,
it is not necessary to do touch logic anymore to run the pipeline again.
Resourcesconfig update automatically runs triggers pipeline.

This commit fixes issue #370.

2 years agoFix the race condition while Helix refresh cluster status cache. (#363)
jiajunwang [Tue, 30 Jul 2019 21:41:32 +0000 (14:41 -0700)] 
Fix the race condition while Helix refresh cluster status cache. (#363)

* Fix the race condition while Helix refresh cluster status cache.

This change fix issue #331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.

2 years agoRemove TODO NPE log for computeResourceBestPossibleState
Ali Reza Zamani Zadeh Najari [Tue, 23 Jul 2019 22:15:47 +0000 (15:15 -0700)] 
Remove TODO NPE log for computeResourceBestPossibleState

The logs related to NPE in computeResourceBestPossibleState is not needed anymore.

This commit fixes issue #351.

2 years agoRead Failure while reading non-existent znode
Ali Najari [Wed, 17 Jul 2019 21:21:40 +0000 (14:21 -0700)] 
Read Failure while reading non-existent znode

In this commit, in case of encountering NoNodeException while reading data from a znode that does not exist, the NoNodeException will be caught and readfailurecounter will not incremented.
Instead, the related information (read Counter, read Latency, etc.) will be recorded.

This commit fixes issue #345.

2 years agoImplementation of stateModelDef modification in REST 2.0
Kai Sun [Wed, 17 Jul 2019 01:36:42 +0000 (18:36 -0700)] 
Implementation of stateModelDef modification in REST 2.0

Current implementation of Rest 2.0 does not support stateModelDef modification. Here, we will implement

delete -- remove the stateModelDef with the input id.

put -- create new statemodeldef if no existing one with same input id

set -- replace the content of node with input id

We also add the following test cases:

Test delete model one; expect success
Test delete model one again; expect success
Create the deleted model one; expect success
Create the deleted model one again; expect failure as the same model id exists
Set the model one with modified content; expect success
Read the model one; expect the content would be same as modified content
Set the model one to original content restore original state; expect success

2 years agoChange IllegalStateException to Helix Exception for CRUSH based rebalance strategy...
Ali Reza Zamani Zadeh Najari [Mon, 15 Jul 2019 22:25:52 +0000 (15:25 -0700)] 
Change IllegalStateException to Helix Exception for CRUSH based rebalance strategy algorithm

In this commit the IllegalStateException has been caught and HelixException has been thrown for the upper layer instead. The error log shows more meaningful exception.
A test has been changed accordingly.

This commit fixes issue #322.

2 years agoFix test fail for TestRebalanceScheduler
Junkai Xue [Mon, 22 Jul 2019 22:25:01 +0000 (15:25 -0700)] 
Fix test fail for TestRebalanceScheduler

2 years agoFix invoke rebalance by "touching" IdealState/ResourceConfig
Junkai Xue [Tue, 16 Jul 2019 01:32:57 +0000 (18:32 -0700)] 
Fix invoke rebalance by "touching" IdealState/ResourceConfig

Current HelixDataAccesor updateProperty uses ZNRecordUpdater. It's merge logic just simply adding all elements when do a merge for ZNRecord. That could cause lot of duplication of listFields.
This impact the invokeRebalanceForResourceConfig. The fix will be implementing a customized updater.

In this commit:
1. Fix invoke rebalance with customized updater.
2. Add comments for ZNRecord merge.
3. Add checks in TaskUtil to only trigger Workflow Config "touch" when purge job.
4. Add a test for RebalanceScheduler.

2 years agoIllegalStateException for CRUSH based rebalance strategy algorithm.
Ali Najari [Mon, 15 Jul 2019 22:25:52 +0000 (15:25 -0700)] 
IllegalStateException for CRUSH based rebalance strategy algorithm.

This commit fixes the error log and exception that is shown when there is not enough eligible instance to use.

2 years agoExclude ANY_INSTANCE for customized sibling checks
Junkai Xue [Tue, 9 Jul 2019 23:38:10 +0000 (16:38 -0700)] 
Exclude ANY_INSTANCE for customized sibling checks

Current Helix HealthCheck API checks the ANY_INSTANCE resources, which is not necessary. Since ANY_INSTANCE resources only have single partition with 1 replica, there is no need to check sibling health status.

This commit fixes issue #328

2 years agoDisable helix-front build
Junkai Xue [Tue, 9 Jul 2019 23:47:58 +0000 (16:47 -0700)] 
Disable helix-front build

3 years agoUpdate all markdown files to 0.9.0
Hunter Lee [Wed, 12 Jun 2019 17:27:45 +0000 (10:27 -0700)] 
Update all markdown files to 0.9.0

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoUpdate website with 0.9.0 with maven plugin version upgrades
Hunter Lee [Tue, 11 Jun 2019 19:01:49 +0000 (12:01 -0700)] 
Update website with 0.9.0 with maven plugin version upgrades

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd Release Notes and Docs for 0.9.0 release
Hunter Lee [Tue, 11 Jun 2019 00:37:00 +0000 (17:37 -0700)] 
Add Release Notes and Docs for 0.9.0 release

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoUpgrade ivy version for 0.9.0 release
Hunter Lee [Mon, 10 Jun 2019 23:57:51 +0000 (16:57 -0700)] 
Upgrade ivy version for 0.9.0 release

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoprepare for next development iteration
Hunter Lee [Mon, 3 Jun 2019 00:56:17 +0000 (17:56 -0700)] 
prepare for next development iteration

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years ago[maven-release-plugin] prepare release helix-0.9.0
Hunter Lee [Tue, 25 Jun 2019 22:08:22 +0000 (15:08 -0700)] 
[maven-release-plugin] prepare release helix-0.9.0
[maven-release-plugin] prepare release helix-0.9.0

3 years agoEnable helix-front in pom.xml
Hunter Lee [Sun, 2 Jun 2019 23:55:24 +0000 (16:55 -0700)] 
Enable helix-front in pom.xml

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoMerge differences with another branch
Hunter Lee [Tue, 25 Jun 2019 07:08:44 +0000 (00:08 -0700)] 
Merge differences with another branch

There are multiple branches against which Helix devs have been doing development work. We wish to consolidate them into one by reconciling all differences. This diff makes such changes. This diff does not contain any changes in logic or functionality.

3 years agoFix looping with keySet and modifying keySet same time
Junkai Xue [Fri, 21 Jun 2019 18:58:13 +0000 (11:58 -0700)] 
Fix looping with keySet and modifying keySet same time

Looping with keySet and modifying with keySet entry at same time could cause ConcurrentModificationException. Fix that with adding to extra new Set and remove after looping is done.

RB=1711266
G=helix-reviewers
A=jjwang,hulee,ksun

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAlways try reading from EphemeralOwner state first while reading the session ID from...
Jiajun Wang [Mon, 17 Jun 2019 21:37:02 +0000 (14:37 -0700)] 
Always try reading from EphemeralOwner state first while reading the session ID from a live instance node.

This is to avoid inconsistent session ID in the node content and the emphemeral owner state.
Note that in order to ensure backward compatiblity and some test cases, the newly introduced method will still read from the node content if the ephemeral owner state is empty (-1 or 0).

RB=1704942
BUG=HELIX-1969
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix compute IdealState mapping tool
Junkai Xue [Tue, 18 Jun 2019 22:38:24 +0000 (15:38 -0700)] 
Fix compute IdealState mapping tool

There is a bug in IdealState mapping tool. It does not filtered out the instances are live but disabled. Add this logic and add extra tests for it.

RB=1706106
BUG=HELIX-1974
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoEnable default Jersey server metric reporting
Junkai Xue [Thu, 13 Jun 2019 01:39:50 +0000 (18:39 -0700)] 
Enable default Jersey server metric reporting

For monitoring Helix REST, we can support both REST server monitoring and customized logic monitoring.
In this rb, we enable the Jersey server monitoring metrics and adding testing for that.

RB=1701238
BUG=HELIX-1963
G=helix-reviewers
A=ywang4

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRemove relay message from controller's message cache immediately if the partition...
Lei Xia [Wed, 5 Jun 2019 00:15:17 +0000 (17:15 -0700)] 
Remove relay message from controller's message cache immediately if the partition on relay host turned to ERROR state while transits off from top-state.

RB=1689771
BUG=HELIX-1900
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoUpgrade Apache rat version and add exclusion paths
Hunter Lee [Mon, 10 Jun 2019 22:29:58 +0000 (15:29 -0700)] 
Upgrade Apache rat version and add exclusion paths

A part of Helix release process requires the rat plugin to perform checks. However, there are scripts and website files that do not require this kind of checks. This diff adds exclusion paths to the pom.xml. Note that there are several Java files that do not still pass the checks due to them not having the Apache license. This still needs to be fixed in the future.

RB=1695987
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoCatch exception and log error when helix-admin-webapp fails to read data from certain...
Yi Wang [Wed, 24 Apr 2019 23:50:30 +0000 (16:50 -0700)] 
Catch exception and log error when helix-admin-webapp fails to read data from certain path

RB=1644066
BUG=https://jira01.corp.linkedin.com:8443/browse/EXC-114388
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix http request hanging issue to the SN API
Yi Wang [Mon, 3 Jun 2019 18:15:54 +0000 (11:15 -0700)] 
Fix http request hanging issue to the SN API

RB=1684758
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoChange output behavior for non-exist instances
Junkai Xue [Thu, 30 May 2019 23:47:16 +0000 (16:47 -0700)] 
Change output behavior for non-exist instances

Current behavior of non-existing instance will be not showing in output. So user is hard to differentiate whether instance does not exist or not belongs to same zone.

Add the logic to check instances exists in instance list or not.

RB=1684700
BUG=HELIX-1911
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix check for disabled partitions
Junkai Xue [Thu, 30 May 2019 01:19:43 +0000 (18:19 -0700)] 
Fix check for disabled partitions

For the map field of disabled partitions, even they are all enabled, there could be some key left over for resources. We cannot just check if there is any resource entries. With this fix, Helix loops all the resource entries of disabled map to see whether there is a parition list is not empty.

In addition, fix failed tests in REST.

RB=1683071
BUG=HELIX-1910
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRemove workaround in sending S->M message when there is a same pending relay message.
Lei Xia [Fri, 17 May 2019 17:36:22 +0000 (10:36 -0700)] 
Remove workaround in sending S->M message when there is a same pending relay message.

RB=1670732
BUG=HELIX-1871
G=helix-reviewers
A=jjwang,jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoChange state transition monitor to per cluster per state transition
Junkai Xue [Thu, 23 May 2019 19:36:59 +0000 (12:36 -0700)] 
Change state transition monitor to per cluster per state transition

Existing state transition metrics recording per cluster, per resource and per state transition metrics. For the large cluster containing lots of resources may bring tremendous number of metrics at participant side.

This RB changes the metrics to per cluster per state transition, which could be fair enough for monitoring purpose.

RB=1677106

RB=1677106
BUG=HELIX-1890
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoUpgrade ZK to 3.4.13
Hunter Lee [Sat, 1 Jun 2019 00:08:03 +0000 (17:08 -0700)] 
Upgrade ZK to 3.4.13

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdding Zk data change callback propagation latency metric.
Jiajun Wang [Wed, 22 May 2019 02:01:16 +0000 (19:01 -0700)] 
Adding Zk data change callback propagation latency metric.

Note that the latency metric only covers data change callback for now.
To adding child change callback, we need to find a way to avoid the additional ZK access that is required for read children node stats. Added TODO in the corresponding code block.

RB=1674550
G=helix-reviewers
A=jxue,ksun

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoIntroduce ZkPathStatRecord to record watch reinstall in ZkClient.
Jiajun Wang [Thu, 23 May 2019 07:52:18 +0000 (00:52 -0700)] 
Introduce ZkPathStatRecord to record watch reinstall in ZkClient.

This is to avoid duplicate watch re-install in the current implementation.
In addition, we will also leverage this Event to report ZK data propagation latency. The related change has been split to an other rb.

3 years agoDisable JavaDoc check in pom.xml
Hunter Lee [Sun, 2 Jun 2019 23:54:03 +0000 (16:54 -0700)] 
Disable JavaDoc check in pom.xml

Java started enforcing JavaDoc lint checks going from Java 7 to Java 8. Since the codebase has a lot of JavaDoc that does not pass the lint check, we are going to disable this check temporarily.

3 years agoFix TestZNRecordSizeLimit
Hunter Lee [Tue, 28 May 2019 18:25:05 +0000 (11:25 -0700)] 
Fix TestZNRecordSizeLimit

This diff fixes TestZNRecordSizeLimit so that it considers the default behavior to be auto-compression of ZNodes enabled.

3 years agoRemove vestiges of cluster view aggregator
Hunter Lee [Sat, 25 May 2019 01:18:54 +0000 (18:18 -0700)] 
Remove vestiges of cluster view aggregator

While merging commits, some code got merged into the master branch accidentally, and this commit removes such code so that the project builds.

3 years agoTwo minor improvements. 1) Avoid persisting null entry into CurrentStateOutput, 2...
Lei Xia [Fri, 17 May 2019 16:17:33 +0000 (09:17 -0700)] 
Two minor improvements. 1) Avoid persisting null entry into CurrentStateOutput, 2) add addition info to CallbackProcess thread name to differeniate different threads.

RB=1670214
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRefactor StateTransitionStatMonitor extends DynamicMbean
Junkai Xue [Wed, 15 May 2019 15:28:16 +0000 (08:28 -0700)] 
Refactor StateTransitionStatMonitor extends DynamicMbean

To support per state transition latency, the first step is to change the StateTransitionStatMonitor to DynamicMbean.

RB=1671496

RB=1671496

RB=1671496
BUG=HELIX-1890
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd message latency record to StateTransitionStatMonitor.
Jiajun Wang [Fri, 22 Feb 2019 23:36:46 +0000 (15:36 -0800)] 
Add message latency record to StateTransitionStatMonitor.

This record provides with additional breakdown to understand the state transition delay.

RB=1573606
BUG=HELIX-1625
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix unstable test for TestZKUtil
Junkai Xue [Sat, 18 May 2019 00:43:40 +0000 (17:43 -0700)] 
Fix unstable test for TestZKUtil

Since tests run parallel, it caused race condition for data messed up in ZK. Fix it with different id.

RB=1671516

RB=1671516
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTitle: Helix-1842: test active a cluster to super cluster with default to FULL_AUTO
Kai Sun [Mon, 20 May 2019 16:43:36 +0000 (09:43 -0700)] 
Title: Helix-1842: test active a cluster to super cluster with default to FULL_AUTO

Description:
This is a follow up of previous diff at rb 1666833. In the previous diff, we did to really trigger the logic such that the participant (controllers) in supercluster will monitor the added cluster. We fixed it in this diff.

Also, we enhanced the test cases to tear down orphan-ed threads.

The following is the original description about this task:

Current v2 rest api, adding a cluster to supercluster will put the new ideastate to SEMI_AUTO. In this change, we make the default as follwing:

REBALANCE_MODE = FULL_AUTO,
replicas = 3,
REBALANCER = DelayedAutoRebalancer,
REBALANCE_STRATEGY = CrushEdRebalanceStrategy.

Also, we will indeed make the replicas of controller to be 3, instead of all of the controllers as currently implemented.

RB=1671448
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd support for HTTPS in CustomRestClient
Hunter Lee [Mon, 20 May 2019 01:10:31 +0000 (18:10 -0700)] 
Add support for HTTPS in CustomRestClient

This diff configures SSLContext (Helix REST server's) into its HTTP client

RB=1671108
G=helix-reviewers
R=cjerian,zpolicze
A=ywang4

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoSkip the sibling checks for resource without minActiveReplica checks
Yi Wang [Fri, 17 May 2019 22:48:07 +0000 (15:48 -0700)] 
Skip the sibling checks for resource without minActiveReplica checks

RB=1670752

RB=1670752

RB=1670752
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTEST: Further fix Helix test suite
Hunter Lee [Fri, 17 May 2019 00:40:34 +0000 (17:40 -0700)] 
TEST: Further fix Helix test suite

This diff does the following:
1. Replace Thread.sleep statements with TestHelper.verify (polling with conditions)
2. Increases GC pause between tests to 4 seconds
3. Improve ZKHelixClusterVerifier's verifyByPolling method by adding invokeRebalance() method

RB=1669831

RB=1669831
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRefine missing top state log method.
Jiajun Wang [Thu, 16 May 2019 22:46:29 +0000 (15:46 -0700)] 
Refine missing top state log method.

The parameter naming was confusing. The log message was not clear. This RB fixes both issues.

RB=1669555

RB=1669555
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTitle: Helix-1842: add a resource/cluster to super cluster with default FULL_AUTO
Kai Sun [Tue, 14 May 2019 20:55:33 +0000 (13:55 -0700)] 
Title: Helix-1842: add a resource/cluster to super cluster with default FULL_AUTO

Current v2 rest api, adding a cluster to supercluster will put the new ideastate to SEMI_AUTO. In this change, we make the default as follwing:

REBALANCE_MODE = FULL_AUTO,
replicas = 3,
REBALANCER = DelayedAutoRebalancer,
REBALANCE_STRATEGY = CrushEdRebalanceStrategy.

Also, we will indeed make the replicas of controller to be 3, instead of all of the controllers as currently implemented.

address the first batch of reviews from hunter and jiajun.
address further comments

RB=1666833
BUG=Helix-1842
G=helix-reviewers
R=jxue,jjwang,ywang4,lxia,hulee,eblumena
A=hulee,jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTEST: Groom and refactor Helix integration tests
Hunter Lee [Sat, 4 May 2019 00:52:17 +0000 (17:52 -0700)] 
TEST: Groom and refactor Helix integration tests

It was observed that there was a lot of technical debt (improper and buggy cleanup) in Helix's unit and integration tests. There were also mock controller and participant threads that were never shut down properly. This was preventing mvn test suite from completing over a remote machine (TMC), and even on local environments, mvn test was not passing. This diff refactors tests and makes sure that ZK is cleaned up after tests.

Changelist:
1. Inspect and correct mock threads (controller, participant, spectator, etc)
2. Ensure there are no leftover garbage clusters from tests
3. Java 8 syntax
4. Style fixes in old tests using Helix open source style file (helix-style.xml)

RB=1654905
G=helix-reviewers
A=jxue,eblumena

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix critical Task Framework throttle bug
Hunter Lee [Thu, 9 May 2019 22:59:59 +0000 (15:59 -0700)] 
Fix critical Task Framework throttle bug

Task throttling feature had a logical bug where it wouldn't count any of the pending task assignments, which was breaking task throttling. This diff fixes it.

RB=1661127
BUG=HELIX-1875
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd tests for cancellation message with p2p
Junkai Xue [Thu, 9 May 2019 22:25:40 +0000 (15:25 -0700)] 
Add tests for cancellation message with p2p

Adding a test case to ensure cancellation message will not cancel the message of p2p relay message when it is under pending state.

RB=1661028
G=helix-reviewers
A=lxia

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoBug fix: reuse the stable logics to verfiy the difference between idealStates and...
Yi Wang [Fri, 3 May 2019 23:03:37 +0000 (16:03 -0700)] 
Bug fix: reuse the stable logics to verfiy the difference between idealStates and externalViews

RB=1654700
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAvoid lock the cache object when require a FullRefresh.
Jiajun Wang [Tue, 30 Apr 2019 22:18:21 +0000 (15:18 -0700)] 
Avoid lock the cache object when require a FullRefresh.

The old synchronize control logic will prevent requiring full refresh if a refresh is in progress. This may lead to a slow callback handling.In this change, we remove the original synchronize control. The current cache update logic will be able to handle gradually refreshed data. There is no need to lock the full refresh request.

RB=1652941
BUG=HELIX-1851,gcn-29329
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoIntegrate customRestClient health check with instance service main logic
Yi Wang [Fri, 3 May 2019 00:45:08 +0000 (17:45 -0700)] 
Integrate customRestClient health check with instance service main logic

RB=1645567
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoimplementation of CustomRestClient (post request and get health checks)
Yi Wang [Sat, 20 Apr 2019 00:27:19 +0000 (17:27 -0700)] 
implementation of CustomRestClient (post request and get health checks)

RB=1638858
G=helix-reviewers
R=cjerian
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix the log logic in HelixManager.isLeader().
Jiajun Wang [Thu, 25 Apr 2019 21:03:21 +0000 (14:03 -0700)] 
Fix the log logic in HelixManager.isLeader().

The output is not correct when isLeader() is false.

RB=1645218
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoSupport partion level health mapping fetch from ZK
Junkai Xue [Thu, 11 Apr 2019 21:38:45 +0000 (14:38 -0700)] 
Support partion level health mapping fetch from ZK

For partition level health status is different from per instance querying. Helix will try to get data from ZK under HEALTH_REPORT folder first. If the data is expired (check with EXPIRE entry), Helix will directly call the API to the participant to get latest data.

Otherwise, we shall assume the customized check as failed.

RB=1628988
BUG=HELIX-1785
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix the public API non-backward compatible change
Yi Wang [Tue, 23 Apr 2019 18:57:36 +0000 (11:57 -0700)] 
Fix the public API non-backward compatible change

RB=1641513
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Fix String formatting issue
Hunter Lee [Tue, 16 Apr 2019 20:21:33 +0000 (13:21 -0700)] 
TASK: Fix String formatting issue

For integers, you must use %d, not %f.

RB=1633161
BUG=HELIX-1794
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoMore unit tests for InstanceValidationUtil
Yi Wang [Wed, 3 Apr 2019 21:36:24 +0000 (14:36 -0700)] 
More unit tests for InstanceValidationUtil

RB=1617333
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd util for checking per instance level health and partition level health
Junkai Xue [Thu, 11 Apr 2019 00:18:30 +0000 (17:18 -0700)] 
Add util for checking per instance level health and partition level health

Customized health check including user customized per instance check which ioslated from other instances.

In addition to per instance level check, partition level check should have complete scope crossing instances which hold sibling partitions. For this partition check is to guarantee shuting down current check instance can have health replicas to hold top state.

RB=1627813
BUG=HELIX-1776
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix TestRecurringJobQueue
Hunter Lee [Wed, 10 Apr 2019 23:58:34 +0000 (16:58 -0700)] 
Fix TestRecurringJobQueue

This diff fixes TestRecurringJobQueue's testDeletingRecurrentQueueWithHistory

RB=1627625
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Fix bug in delete()
Hunter Lee [Wed, 10 Apr 2019 23:56:11 +0000 (16:56 -0700)] 
TASK: Fix bug in delete()

The delete() call was doing a force delete on workflows created from a recurrent workflow. This would cause a race condition between the controller cache and the deletion. This diff fixes this.
Changelist:
1. Fix the logic in delete()

RB=1627615
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoIntermediateStateCalcStage style change
Hunter Lee [Fri, 29 Mar 2019 19:13:48 +0000 (12:13 -0700)] 
IntermediateStateCalcStage style change

This diff includes code style fixes and refactor using Java 8 features.

RB=1613452
BUG=HELIX-1742
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTask Framework code style change
Hunter Lee [Fri, 29 Mar 2019 19:08:07 +0000 (12:08 -0700)] 
Task Framework code style change

This diff includes style changes using Java 8 features.

RB=1613441
BUG=HELIX-1742
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix tests in Helix REST
Junkai Xue [Wed, 10 Apr 2019 20:47:19 +0000 (13:47 -0700)] 
Fix tests in Helix REST

RB=1627064
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Add deleteJob namespaced job name support
Hunter Lee [Tue, 9 Apr 2019 04:44:40 +0000 (21:44 -0700)] 
TASK: Add deleteJob namespaced job name support

Current deletion of jobs from JobQueues only support denamespaced job names. This makes it impossible for users to list all jobs and delete them because they cannot recover denamespaced names sometimes.
Changelist:
1. Add support for namespaced job names for deletion

RB=1624395
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTASK: Fix bug in getExpiredJobs()
Hunter Lee [Tue, 9 Apr 2019 04:08:37 +0000 (21:08 -0700)] 
TASK: Fix bug in getExpiredJobs()

getExpiredJobs() had a bug where if the job has the same expiry time as workflow's default expiry, it would always override it with Workflow's expiry config. This is not correct.
Changelist:
1. Remove a block of code where it overrides expiry config with WorkflowConfig's default expiry

RB=1624376
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix faulty logic in BestPossibleExternalViewVerifier
Hunter Lee [Thu, 4 Apr 2019 00:12:56 +0000 (17:12 -0700)] 
Fix faulty logic in BestPossibleExternalViewVerifier

removeEntryWithIgnoredStates() was not really doing what it was supposed to do. This diff fixes this.
Also, a small delay added to make TestDrop more stable.

RB=1619153
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoTEST: Fix UserContentStore related tests in helix-rest
Hunter Lee [Wed, 3 Apr 2019 21:34:29 +0000 (14:34 -0700)] 
TEST: Fix UserContentStore related tests in helix-rest

The behavior changed such that if the client-side code does not find the UserContent ZNode, it creates one instead of throwing an NPE. This fixes the tests so that it adapts to the new behavior. This behavior should be reverted eventually because UserContent ZNode should be created only by the Controller.

RB=1618685
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoCheck sibling nodes to guarantee MIN_ACTIVE_REPLICAS satisfied
Yi Wang [Sat, 30 Mar 2019 00:28:07 +0000 (17:28 -0700)] 
Check sibling nodes to guarantee MIN_ACTIVE_REPLICAS satisfied

RB=1614128
G=helix-reviewers
A=jxue,hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix test failures and fix logic check stable state
Junkai Xue [Mon, 1 Apr 2019 23:56:03 +0000 (16:56 -0700)] 
Fix test failures and fix logic check stable state

Fix test failures:
1. Add logic to skip task framework idealstates
2. fix logic for test failure.

RB=1615941
BUG=HELIX-1725
G=helix-reviewers
A=lxia

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix unit test by starting rest sever only once.
Lei Xia [Mon, 1 Apr 2019 22:55:05 +0000 (15:55 -0700)] 
Fix unit test by starting rest sever only once.

RB=1615610
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoSwallow exceptions during health status checks for getting instance by id
Yi Wang [Mon, 1 Apr 2019 22:54:21 +0000 (15:54 -0700)] 
Swallow exceptions during health status checks for getting instance by id

RB=1615554
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRefactor InstanceAccessor to InstancesAccessor and PerInstanceAccessor
Junkai Xue [Fri, 29 Mar 2019 23:40:51 +0000 (16:40 -0700)] 
Refactor InstanceAccessor to InstancesAccessor and PerInstanceAccessor

RB=1614063
BUG=HELIX-1725
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoGlobal instance stoppable API
Junkai Xue [Fri, 29 Mar 2019 18:11:35 +0000 (11:11 -0700)] 
Global instance stoppable API

This API will be input list of instances and return stoppable instances. So checks performed here:

1. single stoppable for each instance.
2. shutdown instances will cause replicas drop less than min active number.
For first phase, we do not implement instance based selection.

Here we added an integration test:
1. test instances disabled, has disable partition, not same zone, not alive.
2. disable one stoppable instance, check failed
3. reeable the instance and remove the disabled partition, check for that instance passed again.

Several places make set, map to be TreeSet and TreeMap is that we would like to guarantee the output result is consistent. We do see sorting different for Java 7 and Java 8.

RB=1596424
BUG=HELIX-1680
G=helix-reviewers
A=lxia

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoReport instance started & health status when getting by id
Yi Wang [Wed, 27 Mar 2019 00:53:19 +0000 (17:53 -0700)] 
Report instance started & health status when getting by id

RB=1609426
BUG=helix-1732
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRename instance health check enum to be more explicit
Yi Wang [Thu, 28 Mar 2019 23:33:01 +0000 (16:33 -0700)] 
Rename instance health check enum to be more explicit

RB=1612544
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoSingle stoppable API impl
Yi Wang [Wed, 20 Mar 2019 23:46:22 +0000 (16:46 -0700)] 
Single stoppable API impl

RB=1603158
G=helix-reviewers
A=jxue,hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoImplementation of ClusterService's getClusterTopology method
Yi Wang [Tue, 19 Mar 2019 21:16:53 +0000 (14:16 -0700)] 
Implementation of ClusterService's getClusterTopology method

RB=1601257
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoInterface design for zone mapping information
Yi Wang [Tue, 19 Mar 2019 21:16:16 +0000 (14:16 -0700)] 
Interface design for zone mapping information

RB=1578905
BUG=helix-1646
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoInterface design for zone mapping information
Yi Wang [Tue, 19 Mar 2019 21:16:16 +0000 (14:16 -0700)] 
Interface design for zone mapping information

RB=1578905
BUG=helix-1646
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix node swap test.
Jiajun Wang [Wed, 14 Nov 2018 00:59:18 +0000 (16:59 -0800)] 
Fix node swap test.

Add sleep to stablize the test. Several cluster operations require controller reaction before checking.

RB=1484466
G=helix-reviewers
A=hrzhang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd Util check instance is already in stable state
Junkai Xue [Fri, 15 Mar 2019 23:35:01 +0000 (16:35 -0700)] 
Add Util check instance is already in stable state

We have two choice of checking instance in stable state:
1. Compare IdealState with ExternalView
2. Compare IdealState with CurrentState.

Finally choose IS vs EV is because:
1. We have simple cache in REST, read current state will still cause multiple reads from different hosts for current state. But EV can shared by each host.
2. EV is the decision maker for router, which is kinda source of truth of real production environment. So EV is the final choice.

RB=1598176
BUG=HELIX-1676
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoDummy check for customized API
Junkai Xue [Tue, 12 Mar 2019 19:40:41 +0000 (12:40 -0700)] 
Dummy check for customized API

For this change, it build the dummy check for customized API. It contains following changes:
1. RESTConfig can setup the customized URL
2. Define the end point of per participant and per partition.
3. Add dummy logic that return true for all the check status of customized checks.

RB=1596427
BUG=HELIX-1678
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>