helix.git
2 years agoUpdate website with 0.9.4 helix-0.9.4-release
Hunter Lee [Thu, 23 Jan 2020 20:13:12 +0000 (12:13 -0800)] 
Update website with 0.9.4

2 years ago[maven-release-plugin] prepare for next development iteration
Hunter Lee [Wed, 22 Jan 2020 02:54:37 +0000 (18:54 -0800)] 
[maven-release-plugin] prepare for next development iteration

2 years ago[maven-release-plugin] prepare release helix-0.9.4 helix-0.9.4
Hunter Lee [Wed, 22 Jan 2020 02:54:26 +0000 (18:54 -0800)] 
[maven-release-plugin] prepare release helix-0.9.4

2 years agoEnable helix-front for release
Hunter Lee [Wed, 22 Jan 2020 02:31:51 +0000 (18:31 -0800)] 
Enable helix-front for release

2 years agoRevert "Deep copy for mapFields and listFields in ZNRecord's copy constructor. (...
Huizhi Lu [Mon, 4 Nov 2019 22:12:12 +0000 (14:12 -0800)] 
Revert "Deep copy for mapFields and listFields in ZNRecord's copy constructor. (#552)"

This reverts commit 2a335cf73ac65b53fd2b06c6b1ee8c70553d30b1.

2 years agoDeep copy for mapFields and listFields in ZNRecord's copy constructor. (#552)
Huizhi L [Thu, 31 Oct 2019 18:03:27 +0000 (11:03 -0700)] 
Deep copy for mapFields and listFields in ZNRecord's copy constructor. (#552)

Deep copy for mapFields and listFields in ZNRecord's copy constructor.
Change list:
1. deep copy for mapFields and listFields in ZNRecord's copy constructor.
2. add unit test for the deep copy constructor.

2 years agoFix null response for instance stoppable check when connection refused. (#504)
Huizhi L [Wed, 23 Oct 2019 06:29:09 +0000 (23:29 -0700)] 
Fix null response for instance stoppable check when connection refused. (#504)

Issue: Instance stoppable check endpoint /clusters//instances//stoppable returns null when connection between helix rest server and storage node.
This diff fixes this by: return a StoppableCheck object when connection refused.

2 years agoAdd back the original DataPropagationLatencyGuage (with a typo) and mark it deprecate...
Huizhi L [Wed, 23 Oct 2019 02:54:05 +0000 (19:54 -0700)] 
Add back the original DataPropagationLatencyGuage (with a typo) and mark it deprecated (#517)

If we remove the name with a typo, DataPropagationLatencyGuage, current metrics graph may not see the old metric and historical DataPropagationLatencyGuage data might get lost. To support backward compatibility, adding back DataPropagationLatencyGuage and mark it as deprecated.

2 years agoAdd a null check for StateModel in Participant reset logic (#523)
Hunter Lee [Tue, 22 Oct 2019 18:29:29 +0000 (11:29 -0700)] 
Add a null check for StateModel in Participant reset logic (#523)

It was discovered that sometimes during shutdown/disconnect, this reset() gets called, and due to the partition having been dropped right around the same time, we get an NPE on the state model. Added a null check.

2 years agoFix name typo for DataPropagationLatencyGuage. (#513)
Huizhi L [Fri, 18 Oct 2019 00:15:59 +0000 (17:15 -0700)] 
Fix name typo for DataPropagationLatencyGuage. (#513)

DataPropagationLatencyGuage has a typo which makes helix clients/products confusing to use. Fix this name typo to make it clearer.

Change list:

Refactor the name DataPropagationLatencyGuage with DataPropagationLatencyGauge.
Replace hard coded names DataPropagationLatencyGuage with enum name DataPropagationLatencyGuage in unit tests.

2 years agoAdd import order for java and javax. (#499)
Huizhi L [Thu, 10 Oct 2019 21:29:34 +0000 (14:29 -0700)] 
Add import order for java and javax. (#499)

* Add import order for java and javax.
* Remve the empty line between java and javax.

2 years agoAdd unit test for setting application name.
Huizhi Lu [Mon, 30 Sep 2019 17:15:09 +0000 (10:15 -0700)] 
Add unit test for setting application name.

2 years ago#493 Set jersey servlet application name with namespace name.
Huizhi Lu [Sat, 28 Sep 2019 03:53:13 +0000 (20:53 -0700)] 
#493 Set jersey servlet application name with namespace name.

2 years agoMake the Java Doc for API more clear
Junkai Xue [Mon, 23 Sep 2019 23:57:15 +0000 (16:57 -0700)] 
Make the Java Doc for API more clear

Some users got confused with inputs based on the Java doc. Make it more clear for user usage.

2 years agoAdd Intellij code style XML file for Helix code style. (#481)
pkuwm [Tue, 24 Sep 2019 21:46:28 +0000 (14:46 -0700)] 
Add Intellij code style XML file for Helix code style. (#481)

Add Intellij code style XML file so we can import it into Intellij to configure java code style for Helix.

2 years agoChange the way Helix triggers rebalance (#472)
Ali Reza Zamani Zadeh Najari [Tue, 24 Sep 2019 17:28:24 +0000 (10:28 -0700)] 
Change the way Helix triggers rebalance (#472)

A method is added which generates OnDemandRebalance event.
This event causes the controller to run the rebalance pipeline for both of the pipelines.
"Touch" logic (as in directly reading and writing to ZNodes) has been removed and replaced by this new method.

2 years agoFilter instances of weight = 0 for any partition assignment (#369)
Yi Wang [Wed, 18 Sep 2019 21:25:46 +0000 (14:25 -0700)] 
Filter instances of weight = 0 for any partition assignment (#369)

1. Fix the available space calculation issue in card dealing algorithm
2. Remove the instances of weight 0 from AbstractEvenDistributionRebalanceStrategy.java#computePartitionAssignment's input parameters

2 years ago[helix-rest] Delete unused default namespace (api "/namespaces/default") (#449)
pkuwm [Tue, 17 Sep 2019 21:33:37 +0000 (14:33 -0700)] 
[helix-rest] Delete unused default namespace (api "/namespaces/default") (#449)

We have a namespace api: /admin/v2/namespaces/{namespace}/. However, the /namespaces/default path is not in use. We need to delete it. On the code level,
if there is not a default namespace, we won't create a DEFAULT_SERVLET.
On the app level, we can configure app not to add name "default" namespace.
With this change, endpoint /admin/v2/namespaces/ will be disable if no namespace
sets IS_DEFAULT to true.

2 years agoFix CustomRebalancer's assignment computation (#477)
Hunter Lee [Mon, 16 Sep 2019 21:40:20 +0000 (14:40 -0700)] 
Fix CustomRebalancer's assignment computation (#477)

It was observed that sometimes CustomRebalancer would leave out an instance entirely if an instance is disabled or the partition on the instance was still bootstrapping (current state is null). This would cause a cluster not to converge. This diff fixes this by 1) still including an assignment from IdealState even though the current state is null (maybe due to a pending state transition) 2) putting disabled partitions in InitialState.
Changelist:
1. Fix the issue
2. Add a test: TestCustomRebalancer

2 years agoFix missed callbacks in CurrentStates based RoutingTableProvider. (#458)
pkuwm [Sat, 14 Sep 2019 00:08:17 +0000 (17:08 -0700)] 
Fix missed callbacks in CurrentStates based RoutingTableProvider. (#458)

1. Update BasicClusterDataCache to do refresh with selective update. Only when a change happens, we do the cache refresh only for that change type.
2. Improve RoutingTableProvider.queueEvent() and RoutingTableProvider.handleEvent(). Return instanceConfigs snapshot to callback immediately, instead of waiting for currentStates completion.

2 years agoFix helix-front build failure by downgrading types/lodash version. (#470)
pkuwm [Fri, 13 Sep 2019 00:51:50 +0000 (17:51 -0700)] 
Fix helix-front build failure by downgrading types/lodash version. (#470)

Fix helix-front build failure by downgrading types/lodash version.

2 years agoAdd field for MIN_ACTIVE_REPLICA_NOT_SET
Junkai Xue [Tue, 10 Sep 2019 22:42:42 +0000 (15:42 -0700)] 
Add field for MIN_ACTIVE_REPLICA_NOT_SET

2 years agoMake State Transition Throttling respect MIN_ACTIVE_REPLICA
Junkai Xue [Tue, 10 Sep 2019 00:14:57 +0000 (17:14 -0700)] 
Make State Transition Throttling respect MIN_ACTIVE_REPLICA

There are two phases for improving Helix state transition throttling:
1. Respect MIN_ACTIVE_REPLICA
2. Throttle per replica state transitions.

This commit contains the logic of respecting MIN_ACTIVE_REPLICA in IntermediateCalStage and state transition throttling.

2 years agoFix the issue where JobContext is not updated properly (#435)
Ali Reza Zamani Zadeh Najari [Tue, 10 Sep 2019 18:47:34 +0000 (11:47 -0700)] 
Fix the issue where JobContext is not updated properly (#435)

1- A method has been added which extracts the "prevInstanceToTaskAssignments" information from the context.
2- If it is confirmed that the currentstate is null:
An "if statement" is being utilized which sets the context using the target state information.
3- An integration test is added.

2 years agoFix the order of workflow context update
Ali Reza Zamani Zadeh Najari [Wed, 4 Sep 2019 19:39:45 +0000 (12:39 -0700)] 
Fix the order of workflow context update

* Fix the order of workflow context update

In this commit:
The order that workflow dispatcher updates the workflow status has been changed.
If execution delay is set and job is inflight, the context will get updated.
An integration test has been added.

* minor fixes

2 years agoTASK: Fix forceDelete for jobs in JobQueue
Hunter Lee [Wed, 4 Sep 2019 19:26:10 +0000 (12:26 -0700)] 
TASK: Fix forceDelete for jobs in JobQueue

We observed that the force delete functionality doesn't really work when the job is running, saying that the job is currently running. Force delete should go through regardless of the current job status.
Changelist:
1. Change the semantics in deleteJobFromQueue
2. Add an integration test: TestDeleteJobFromJobQueue

2 years agoremove all unused imports
leesf [Thu, 8 Aug 2019 07:46:02 +0000 (15:46 +0800)] 
remove all unused imports

2 years agoAdd integration test for workflow ForceDelete
Ali Reza Zamani Zadeh Najari [Fri, 30 Aug 2019 16:56:39 +0000 (09:56 -0700)] 
Add integration test for workflow ForceDelete

This commit adds integration tests for ForceDelete.
check the functionality of ForceDelete.
Add comment to ForceDelete that discourages users from using ForceDelete.
Several workflow states have been considered and checked for ForceDelete.

2 years agoTASK: Fix incorrect counting of numAttempts for tasks (#432)
Hunter Lee [Mon, 26 Aug 2019 20:48:21 +0000 (13:48 -0700)] 
TASK: Fix incorrect counting of numAttempts for tasks (#432)

TASK: Fix incorrect counting of numAttempts for tasks

It was discovered that sometimes the tasks' NUM_ATTEMPTS field in JobContext was getting incremented even without the tasks being retried. This was because the numAttempts field was getting incremented in other (incorrect) places than at scheduling time. The logic for incrementing the number of attempts has been moved to the schedule logic in this diff.
Changelist:
1. Modify tests so that they test for numAttempts more tightly
2. Fix the incrementation logic
3. Add a new integration test: TestTaskNumAttempts

2 years agoMake the reservoir sliding window length used in Helix monintor metrics configurable...
chenboat [Thu, 22 Aug 2019 04:43:01 +0000 (21:43 -0700)] 
Make the reservoir sliding window length used in Helix monintor metrics configurable. #382

2 years agoMake the reservoir sliding window length used in Helix monintor metrics configurable...
chenboat [Wed, 21 Aug 2019 04:21:03 +0000 (21:21 -0700)] 
Make the reservoir sliding window length used in Helix monintor metrics configurable. #382

2 years agoMake the reservoir sliding window length used in Helix monintor metrics configurable...
chenboat [Tue, 20 Aug 2019 06:11:43 +0000 (23:11 -0700)] 
Make the reservoir sliding window length used in Helix monintor metrics configurable. #382

2 years agoMake the reservoir sliding window length used in Helix monintor metrics configurable...
chenboat [Tue, 20 Aug 2019 04:59:59 +0000 (21:59 -0700)] 
Make the reservoir sliding window length used in Helix monintor metrics configurable. #382

2 years agoFix a typo. #382
chenboat [Sun, 18 Aug 2019 02:01:40 +0000 (19:01 -0700)] 
Fix a typo. #382

2 years agoFix a typo. #382
chenboat [Wed, 14 Aug 2019 06:22:33 +0000 (23:22 -0700)] 
Fix a typo. #382

2 years agoAdd a unit test case.. #382
chenboat [Wed, 14 Aug 2019 06:20:59 +0000 (23:20 -0700)] 
Add a unit test case.. #382

2 years agoUse the system property value as the sliding window length. #382
chenboat [Tue, 13 Aug 2019 05:55:22 +0000 (22:55 -0700)] 
Use the system property value as the sliding window length. #382

2 years agoUse the system property value as the sliding window length. #382
chenboat [Tue, 13 Aug 2019 05:47:54 +0000 (22:47 -0700)] 
Use the system property value as the sliding window length. #382

2 years agoUse the system property value as the sliding window length. #382
chenboat [Fri, 9 Aug 2019 07:06:11 +0000 (00:06 -0700)] 
Use the system property value as the sliding window length. #382

2 years agoFix the execution delay for the jobs
Ali Reza Zamani Zadeh Najari [Wed, 14 Aug 2019 15:57:49 +0000 (08:57 -0700)] 
Fix the execution delay for the jobs

In the Task Framework part of helix, execution delay for the jobs is not respected.
In this commit, when the job is extracted from the inflighJobs queue, the timeline has been checked before scheduling.

2 years agoMove partition heatlh check method into dataAccessor layer
Yi Wang [Thu, 15 Aug 2019 18:53:26 +0000 (11:53 -0700)] 
Move partition heatlh check method into dataAccessor layer

2 years agoUpdate menu bar.
Jiajun Wang [Mon, 19 Aug 2019 21:00:31 +0000 (14:00 -0700)] 
Update menu bar.

2 years agoFix typo for process name
Junkai Xue [Mon, 19 Aug 2019 18:15:05 +0000 (11:15 -0700)] 
Fix typo for process name

2 years agoRevert "Add ChangeDetector interface and ResourceChangeDetector implementation (...
Hunter Lee [Thu, 15 Aug 2019 21:46:31 +0000 (14:46 -0700)] 
Revert "Add ChangeDetector interface and ResourceChangeDetector implementation (#388)"

This reverts commit e0c1c66dd6ed9a01955927ea1828fabcf59eeaad.

2 years agoAdd ChangeDetector interface and ResourceChangeDetector implementation (#388)
Hunter Lee [Thu, 15 Aug 2019 21:33:02 +0000 (14:33 -0700)] 
Add ChangeDetector interface and ResourceChangeDetector implementation (#388)

Add ChangeDetector interface and ResourceChangeDetector implementation

In order to efficiently react to changes happening to the cluster in the new WAGED rebalancer, a new component called ChangeDetector was added.

Changelist:
1. Add ChangeDetector interface
2. Implement ResourceChangeDetector
3. Add ResourceChangeCache, a wrapper for critical cluster metadata
4. Add an integration test, TestResourceChangeDetector

2 years agoFix issue when client only sets ANY at cluster level throttle config
Yi Wang [Fri, 9 Aug 2019 00:34:47 +0000 (17:34 -0700)] 
Fix issue when client only sets ANY at cluster level throttle config

fixes #332
Added unit test for StateTransitionThrottleController
Added integration test for verifying case when only cluster level ANY throttle set to 1# Please enter the commit message for your changes. Lines starting

2 years agoFix ZNode does not exist in HealthCheck
Junkai Xue [Wed, 14 Aug 2019 18:23:11 +0000 (11:23 -0700)] 
Fix ZNode does not exist in HealthCheck

If the ZNode of PartitionHealth does not exist, REST will return failed checks due to NPE. The fix will be adding the instance to be refreshed entirely. Then REST can check based on API refreshed result.

2 years agoRevert "Reenable helix-front module for official release." (#406)
Jiajun Wang [Wed, 14 Aug 2019 20:58:22 +0000 (13:58 -0700)] 
Revert "Reenable helix-front module for official release." (#406)

This reverts commit 3c3db0bf797cbc1e0c1aec59395c8632ed6455db.

2 years agoRelease note for 0.9.1.
jiajunwang [Tue, 13 Aug 2019 23:10:56 +0000 (16:10 -0700)] 
Release note for 0.9.1.

2 years agoBump up the snapshot version.
jiajunwang [Tue, 13 Aug 2019 23:54:12 +0000 (16:54 -0700)] 
Bump up the snapshot version.

Also fix the missing helix-agent snapshot update logic in the bump-up.comand.

2 years agoMerge with the lastest optimization on batch get zookeeper properties
Yi Wang [Thu, 8 Aug 2019 19:11:19 +0000 (12:11 -0700)] 
Merge ... the lastest optimization on batch get zookeeper properties

2 years agoAdd InstanceServieImpl#batchGetInstancesStoppableChecks to solve performance issue...
Yi Wang [Tue, 6 Aug 2019 01:33:38 +0000 (18:33 -0700)] 
Add InstanceServieImpl#batchGetInstancesStoppableChecks to solve performance issue #366

2 years ago[maven-release-plugin] prepare for next development iteration
Jiajun Wang [Tue, 13 Aug 2019 18:30:45 +0000 (11:30 -0700)] 
[maven-release-plugin] prepare for next development iteration

2 years ago[maven-release-plugin] prepare release helix-0.9.1 helix-0.9.1
Jiajun Wang [Tue, 13 Aug 2019 18:30:35 +0000 (11:30 -0700)] 
[maven-release-plugin] prepare release helix-0.9.1

2 years agoReenable helix-front module for official release.
Jiajun Wang [Mon, 12 Aug 2019 20:57:21 +0000 (13:57 -0700)] 
Reenable helix-front module for official release.

2 years agoRevert "[maven-release-plugin] prepare release helix-0.9.1"
Jiajun Wang [Mon, 12 Aug 2019 20:55:40 +0000 (13:55 -0700)] 
Revert "[maven-release-plugin] prepare release helix-0.9.1"

This reverts commit c7e8e6366f6e5360d416e2fd1867252ebdcd7242.

2 years agoRevert "[maven-release-plugin] prepare for next development iteration"
Jiajun Wang [Mon, 12 Aug 2019 20:55:35 +0000 (13:55 -0700)] 
Revert "[maven-release-plugin] prepare for next development iteration"

This reverts commit f2746c823193991a0dd6152827b7344d66226368.

2 years ago[maven-release-plugin] prepare for next development iteration
Jiajun Wang [Mon, 12 Aug 2019 20:36:09 +0000 (13:36 -0700)] 
[maven-release-plugin] prepare for next development iteration

2 years ago[maven-release-plugin] prepare release helix-0.9.1
Jiajun Wang [Mon, 12 Aug 2019 20:35:57 +0000 (13:35 -0700)] 
[maven-release-plugin] prepare release helix-0.9.1

2 years agoFix the CallbackHandler registration logic in DistributedLeaderElection (#395)
Jiajun Wang [Mon, 12 Aug 2019 17:58:21 +0000 (10:58 -0700)] 
Fix the CallbackHandler registration logic in DistributedLeaderElection (#395)

* Fix the CallbackHandler registration logic in DistributedLeaderElection that may cause a leader node has no callback registered.

Our current initialization logic assumes a strict leader acquire/relinquish events sequence. However, due to the possible carried over ZK events from the previous ZK session, the controller node change event might be triggered in the following sequence:
1. CALLBACK (from the previous session): Create new leader node and add handlers.
2. FINALIZE (Handle the previous session expire): Clean up handlers.
3. INIT (For the new session establishment): Expect to add the handlers back again.
As a result, if the INIT event processing does not recover the handlers, the leader controller won't be able to manage anything. This fix ensures all the acquireLeadership call will try to initialize the leader controller's callback handlers.

Also, add the additional test logic in TestHandleNewSession to verify the fix.

* Improve the leader history update logic so there is no duplicate entry recorded.

2 years agoTASK: Drop all tasks whose requested states are DROPPED
Hunter Lee [Fri, 9 Aug 2019 23:57:05 +0000 (16:57 -0700)] 
TASK: Drop all tasks whose requested states are DROPPED

Upon a Participant disconnect, the Participant would carry over from the last session. This would copy all previous task states to the current session and set their requested states as DROPPED (for INIT and RUNNING states).

It came to our attention that sometimes these Participants experience connection issues and the tasks happen to be in TASK_ERROR or COMPLETED states. These tasks would get stuck on the Participant and never be dropped. This issue proposes to add the logic that would get all tasks whose requested states are DROPPED to be dropped immediately.
Changelist:
1. Make sure all tasks whose requested state is DROPPED get added to tasksToDrop
2. Add a unit test: TestDropTerminalTasksUponReset

2 years agoImprove ZK read with batch call
Junkai Xue [Mon, 5 Aug 2019 23:33:50 +0000 (16:33 -0700)] 
Improve ZK read with batch call

Current HealthReport read is single call for each participant. Improve it will batch call to ZK to reduce the number of calls.

2 years agoAdd reviews@helix.apache.org to mailing list
Junkai Xue [Tue, 6 Aug 2019 03:53:13 +0000 (20:53 -0700)] 
Add reviews@helix.apache.org to mailing list

2 years agoStablize the REST tests
Junkai Xue [Mon, 5 Aug 2019 23:25:03 +0000 (16:25 -0700)] 
Stablize the REST tests

Stablize the REST tests by following changes:
1. Remove temporary cluster which impact the ClusterAccessor test
2. Add all start/end message for test debug purpose.
3. Disable unstable monitoring test for default MBeans. Sometimes we can query it sometimes not. It is not critical test path. Let's make it stable later.

2 years agoRead ClusterConfig from ZK selectively
Hunter Lee [Tue, 6 Aug 2019 18:32:16 +0000 (11:32 -0700)] 
Read ClusterConfig from ZK selectively

Previously, ClusterConfig would be read from ZK every pipeline run. This PR makes it a selective read and also add to the set of all changed types so that cluster change detector could more easily tell whether ClusterConfig changed without having to store two copies of ClusterConfig objects.

2 years agoFix RoutingTableProvider statePropagationLatency metric reporting bug (#365)
kaisun2000 [Tue, 6 Aug 2019 18:58:16 +0000 (11:58 -0700)] 
Fix RoutingTableProvider statePropagationLatency metric reporting bug (#365)

Issue:

CurrentStateCache updating snapshot would miss all the existing partitions that having state change.

RoutingTableProvider callback on the main event thread. Time is not accounted in log.

Description:
fix the bug by updating the snapshot with the correct reloadkeys.

enhanced log to accout for user callback code separately.

Tests:
mvn test passed.

2 years agoDynamically change the processor thread name when consuming event
Yi Wang [Tue, 23 Jul 2019 23:27:59 +0000 (16:27 -0700)] 
Dynamically change the processor thread name when consuming event

2 years agoRemove DEFAULT_VIEW_CLUSTER_REFRESH_PERIOD from ClusterConfig
Hunter Lee [Mon, 5 Aug 2019 17:16:27 +0000 (10:16 -0700)] 
Remove DEFAULT_VIEW_CLUSTER_REFRESH_PERIOD from ClusterConfig

This is a constant that is no longer used.

2 years agoRemove .reviewboardrc from the open source repository
Hunter Lee [Mon, 5 Aug 2019 16:19:51 +0000 (09:19 -0700)] 
Remove .reviewboardrc from the open source repository

2 years agoRemove unnecessary touch logics that trigge pipeline
Ali Reza Zamani Zadeh Najari [Thu, 1 Aug 2019 17:54:28 +0000 (10:54 -0700)] 
Remove unnecessary touch logics that trigge pipeline

In the places that ZooKeeper Resourceconfig is updated,
it is not necessary to do touch logic anymore to run the pipeline again.
Resourcesconfig update automatically runs triggers pipeline.

This commit fixes issue #370.

2 years agoFix the race condition while Helix refresh cluster status cache. (#363)
jiajunwang [Tue, 30 Jul 2019 21:41:32 +0000 (14:41 -0700)] 
Fix the race condition while Helix refresh cluster status cache. (#363)

* Fix the race condition while Helix refresh cluster status cache.

This change fix issue #331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.

2 years agoRemove TODO NPE log for computeResourceBestPossibleState
Ali Reza Zamani Zadeh Najari [Tue, 23 Jul 2019 22:15:47 +0000 (15:15 -0700)] 
Remove TODO NPE log for computeResourceBestPossibleState

The logs related to NPE in computeResourceBestPossibleState is not needed anymore.

This commit fixes issue #351.

2 years agoRead Failure while reading non-existent znode
Ali Najari [Wed, 17 Jul 2019 21:21:40 +0000 (14:21 -0700)] 
Read Failure while reading non-existent znode

In this commit, in case of encountering NoNodeException while reading data from a znode that does not exist, the NoNodeException will be caught and readfailurecounter will not incremented.
Instead, the related information (read Counter, read Latency, etc.) will be recorded.

This commit fixes issue #345.

2 years agoImplementation of stateModelDef modification in REST 2.0
Kai Sun [Wed, 17 Jul 2019 01:36:42 +0000 (18:36 -0700)] 
Implementation of stateModelDef modification in REST 2.0

Current implementation of Rest 2.0 does not support stateModelDef modification. Here, we will implement

delete -- remove the stateModelDef with the input id.

put -- create new statemodeldef if no existing one with same input id

set -- replace the content of node with input id

We also add the following test cases:

Test delete model one; expect success
Test delete model one again; expect success
Create the deleted model one; expect success
Create the deleted model one again; expect failure as the same model id exists
Set the model one with modified content; expect success
Read the model one; expect the content would be same as modified content
Set the model one to original content restore original state; expect success

2 years agoChange IllegalStateException to Helix Exception for CRUSH based rebalance strategy...
Ali Reza Zamani Zadeh Najari [Mon, 15 Jul 2019 22:25:52 +0000 (15:25 -0700)] 
Change IllegalStateException to Helix Exception for CRUSH based rebalance strategy algorithm

In this commit the IllegalStateException has been caught and HelixException has been thrown for the upper layer instead. The error log shows more meaningful exception.
A test has been changed accordingly.

This commit fixes issue #322.

2 years agoFix test fail for TestRebalanceScheduler
Junkai Xue [Mon, 22 Jul 2019 22:25:01 +0000 (15:25 -0700)] 
Fix test fail for TestRebalanceScheduler

2 years agoFix invoke rebalance by "touching" IdealState/ResourceConfig
Junkai Xue [Tue, 16 Jul 2019 01:32:57 +0000 (18:32 -0700)] 
Fix invoke rebalance by "touching" IdealState/ResourceConfig

Current HelixDataAccesor updateProperty uses ZNRecordUpdater. It's merge logic just simply adding all elements when do a merge for ZNRecord. That could cause lot of duplication of listFields.
This impact the invokeRebalanceForResourceConfig. The fix will be implementing a customized updater.

In this commit:
1. Fix invoke rebalance with customized updater.
2. Add comments for ZNRecord merge.
3. Add checks in TaskUtil to only trigger Workflow Config "touch" when purge job.
4. Add a test for RebalanceScheduler.

2 years agoIllegalStateException for CRUSH based rebalance strategy algorithm.
Ali Najari [Mon, 15 Jul 2019 22:25:52 +0000 (15:25 -0700)] 
IllegalStateException for CRUSH based rebalance strategy algorithm.

This commit fixes the error log and exception that is shown when there is not enough eligible instance to use.

2 years agoExclude ANY_INSTANCE for customized sibling checks
Junkai Xue [Tue, 9 Jul 2019 23:38:10 +0000 (16:38 -0700)] 
Exclude ANY_INSTANCE for customized sibling checks

Current Helix HealthCheck API checks the ANY_INSTANCE resources, which is not necessary. Since ANY_INSTANCE resources only have single partition with 1 replica, there is no need to check sibling health status.

This commit fixes issue #328

2 years agoDisable helix-front build
Junkai Xue [Tue, 9 Jul 2019 23:47:58 +0000 (16:47 -0700)] 
Disable helix-front build

3 years agoUpdate all markdown files to 0.9.0
Hunter Lee [Wed, 12 Jun 2019 17:27:45 +0000 (10:27 -0700)] 
Update all markdown files to 0.9.0

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoUpdate website with 0.9.0 with maven plugin version upgrades
Hunter Lee [Tue, 11 Jun 2019 19:01:49 +0000 (12:01 -0700)] 
Update website with 0.9.0 with maven plugin version upgrades

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAdd Release Notes and Docs for 0.9.0 release
Hunter Lee [Tue, 11 Jun 2019 00:37:00 +0000 (17:37 -0700)] 
Add Release Notes and Docs for 0.9.0 release

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoUpgrade ivy version for 0.9.0 release
Hunter Lee [Mon, 10 Jun 2019 23:57:51 +0000 (16:57 -0700)] 
Upgrade ivy version for 0.9.0 release

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoprepare for next development iteration
Hunter Lee [Mon, 3 Jun 2019 00:56:17 +0000 (17:56 -0700)] 
prepare for next development iteration

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years ago[maven-release-plugin] prepare release helix-0.9.0
Hunter Lee [Tue, 25 Jun 2019 22:08:22 +0000 (15:08 -0700)] 
[maven-release-plugin] prepare release helix-0.9.0
[maven-release-plugin] prepare release helix-0.9.0

3 years agoEnable helix-front in pom.xml
Hunter Lee [Sun, 2 Jun 2019 23:55:24 +0000 (16:55 -0700)] 
Enable helix-front in pom.xml

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoMerge differences with another branch
Hunter Lee [Tue, 25 Jun 2019 07:08:44 +0000 (00:08 -0700)] 
Merge differences with another branch

There are multiple branches against which Helix devs have been doing development work. We wish to consolidate them into one by reconciling all differences. This diff makes such changes. This diff does not contain any changes in logic or functionality.

3 years agoFix looping with keySet and modifying keySet same time
Junkai Xue [Fri, 21 Jun 2019 18:58:13 +0000 (11:58 -0700)] 
Fix looping with keySet and modifying keySet same time

Looping with keySet and modifying with keySet entry at same time could cause ConcurrentModificationException. Fix that with adding to extra new Set and remove after looping is done.

RB=1711266
G=helix-reviewers
A=jjwang,hulee,ksun

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoAlways try reading from EphemeralOwner state first while reading the session ID from...
Jiajun Wang [Mon, 17 Jun 2019 21:37:02 +0000 (14:37 -0700)] 
Always try reading from EphemeralOwner state first while reading the session ID from a live instance node.

This is to avoid inconsistent session ID in the node content and the emphemeral owner state.
Note that in order to ensure backward compatiblity and some test cases, the newly introduced method will still read from the node content if the ephemeral owner state is empty (-1 or 0).

RB=1704942
BUG=HELIX-1969
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix compute IdealState mapping tool
Junkai Xue [Tue, 18 Jun 2019 22:38:24 +0000 (15:38 -0700)] 
Fix compute IdealState mapping tool

There is a bug in IdealState mapping tool. It does not filtered out the instances are live but disabled. Add this logic and add extra tests for it.

RB=1706106
BUG=HELIX-1974
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoEnable default Jersey server metric reporting
Junkai Xue [Thu, 13 Jun 2019 01:39:50 +0000 (18:39 -0700)] 
Enable default Jersey server metric reporting

For monitoring Helix REST, we can support both REST server monitoring and customized logic monitoring.
In this rb, we enable the Jersey server monitoring metrics and adding testing for that.

RB=1701238
BUG=HELIX-1963
G=helix-reviewers
A=ywang4

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRemove relay message from controller's message cache immediately if the partition...
Lei Xia [Wed, 5 Jun 2019 00:15:17 +0000 (17:15 -0700)] 
Remove relay message from controller's message cache immediately if the partition on relay host turned to ERROR state while transits off from top-state.

RB=1689771
BUG=HELIX-1900
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoUpgrade Apache rat version and add exclusion paths
Hunter Lee [Mon, 10 Jun 2019 22:29:58 +0000 (15:29 -0700)] 
Upgrade Apache rat version and add exclusion paths

A part of Helix release process requires the rat plugin to perform checks. However, there are scripts and website files that do not require this kind of checks. This diff adds exclusion paths to the pom.xml. Note that there are several Java files that do not still pass the checks due to them not having the Apache license. This still needs to be fixed in the future.

RB=1695987
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoCatch exception and log error when helix-admin-webapp fails to read data from certain...
Yi Wang [Wed, 24 Apr 2019 23:50:30 +0000 (16:50 -0700)] 
Catch exception and log error when helix-admin-webapp fails to read data from certain path

RB=1644066
BUG=https://jira01.corp.linkedin.com:8443/browse/EXC-114388
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix http request hanging issue to the SN API
Yi Wang [Mon, 3 Jun 2019 18:15:54 +0000 (11:15 -0700)] 
Fix http request hanging issue to the SN API

RB=1684758
G=helix-reviewers
A=jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoChange output behavior for non-exist instances
Junkai Xue [Thu, 30 May 2019 23:47:16 +0000 (16:47 -0700)] 
Change output behavior for non-exist instances

Current behavior of non-existing instance will be not showing in output. So user is hard to differentiate whether instance does not exist or not belongs to same zone.

Add the logic to check instances exists in instance list or not.

RB=1684700
BUG=HELIX-1911
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoFix check for disabled partitions
Junkai Xue [Thu, 30 May 2019 01:19:43 +0000 (18:19 -0700)] 
Fix check for disabled partitions

For the map field of disabled partitions, even they are all enabled, there could be some key left over for resources. We cannot just check if there is any resource entries. With this fix, Helix loops all the resource entries of disabled map to see whether there is a parition list is not empty.

In addition, fix failed tests in REST.

RB=1683071
BUG=HELIX-1910
G=helix-reviewers
A=hulee

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoRemove workaround in sending S->M message when there is a same pending relay message.
Lei Xia [Fri, 17 May 2019 17:36:22 +0000 (10:36 -0700)] 
Remove workaround in sending S->M message when there is a same pending relay message.

RB=1670732
BUG=HELIX-1871
G=helix-reviewers
A=jjwang,jxue

Signed-off-by: Hunter Lee <hulee@linkedin.com>
3 years agoChange state transition monitor to per cluster per state transition
Junkai Xue [Thu, 23 May 2019 19:36:59 +0000 (12:36 -0700)] 
Change state transition monitor to per cluster per state transition

Existing state transition metrics recording per cluster, per resource and per state transition metrics. For the large cluster containing lots of resources may bring tremendous number of metrics at participant side.

This RB changes the metrics to per cluster per state transition, which could be fair enough for monitoring purpose.

RB=1677106

RB=1677106
BUG=HELIX-1890
G=helix-reviewers
A=jjwang

Signed-off-by: Hunter Lee <hulee@linkedin.com>