Hunter Lee [Tue, 21 Jan 2020 18:42:18 +0000 (10:42 -0800)]
[maven-release-plugin] prepare release helix-0.9.300
Hunter Lee [Tue, 21 Jan 2020 18:20:43 +0000 (10:20 -0800)]
Enable helix-front for release
Huizhi Lu [Mon, 4 Nov 2019 22:12:12 +0000 (14:12 -0800)]
Revert "Deep copy for mapFields and listFields in ZNRecord's copy constructor. (#552)"
This reverts commit
2a335cf73ac65b53fd2b06c6b1ee8c70553d30b1.
Huizhi L [Thu, 31 Oct 2019 18:03:27 +0000 (11:03 -0700)]
Deep copy for mapFields and listFields in ZNRecord's copy constructor. (#552)
Deep copy for mapFields and listFields in ZNRecord's copy constructor.
Change list:
1. deep copy for mapFields and listFields in ZNRecord's copy constructor.
2. add unit test for the deep copy constructor.
Huizhi L [Wed, 23 Oct 2019 06:29:09 +0000 (23:29 -0700)]
Fix null response for instance stoppable check when connection refused. (#504)
Issue: Instance stoppable check endpoint /clusters//instances//stoppable returns null when connection between helix rest server and storage node.
This diff fixes this by: return a StoppableCheck object when connection refused.
Huizhi L [Wed, 23 Oct 2019 02:54:05 +0000 (19:54 -0700)]
Add back the original DataPropagationLatencyGuage (with a typo) and mark it deprecated (#517)
If we remove the name with a typo, DataPropagationLatencyGuage, current metrics graph may not see the old metric and historical DataPropagationLatencyGuage data might get lost. To support backward compatibility, adding back DataPropagationLatencyGuage and mark it as deprecated.
Hunter Lee [Tue, 22 Oct 2019 18:29:29 +0000 (11:29 -0700)]
Add a null check for StateModel in Participant reset logic (#523)
It was discovered that sometimes during shutdown/disconnect, this reset() gets called, and due to the partition having been dropped right around the same time, we get an NPE on the state model. Added a null check.
Huizhi L [Fri, 18 Oct 2019 00:15:59 +0000 (17:15 -0700)]
Fix name typo for DataPropagationLatencyGuage. (#513)
DataPropagationLatencyGuage has a typo which makes helix clients/products confusing to use. Fix this name typo to make it clearer.
Change list:
Refactor the name DataPropagationLatencyGuage with DataPropagationLatencyGauge.
Replace hard coded names DataPropagationLatencyGuage with enum name DataPropagationLatencyGuage in unit tests.
Huizhi L [Thu, 10 Oct 2019 21:29:34 +0000 (14:29 -0700)]
Add import order for java and javax. (#499)
* Add import order for java and javax.
* Remve the empty line between java and javax.
Huizhi Lu [Mon, 30 Sep 2019 17:15:09 +0000 (10:15 -0700)]
Add unit test for setting application name.
Huizhi Lu [Sat, 28 Sep 2019 03:53:13 +0000 (20:53 -0700)]
#493 Set jersey servlet application name with namespace name.
Junkai Xue [Mon, 23 Sep 2019 23:57:15 +0000 (16:57 -0700)]
Make the Java Doc for API more clear
Some users got confused with inputs based on the Java doc. Make it more clear for user usage.
pkuwm [Tue, 24 Sep 2019 21:46:28 +0000 (14:46 -0700)]
Add Intellij code style XML file for Helix code style. (#481)
Add Intellij code style XML file so we can import it into Intellij to configure java code style for Helix.
Ali Reza Zamani Zadeh Najari [Tue, 24 Sep 2019 17:28:24 +0000 (10:28 -0700)]
Change the way Helix triggers rebalance (#472)
A method is added which generates OnDemandRebalance event.
This event causes the controller to run the rebalance pipeline for both of the pipelines.
"Touch" logic (as in directly reading and writing to ZNodes) has been removed and replaced by this new method.
Yi Wang [Wed, 18 Sep 2019 21:25:46 +0000 (14:25 -0700)]
Filter instances of weight = 0 for any partition assignment (#369)
1. Fix the available space calculation issue in card dealing algorithm
2. Remove the instances of weight 0 from AbstractEvenDistributionRebalanceStrategy.java#computePartitionAssignment's input parameters
pkuwm [Tue, 17 Sep 2019 21:33:37 +0000 (14:33 -0700)]
[helix-rest] Delete unused default namespace (api "/namespaces/default") (#449)
We have a namespace api: /admin/v2/namespaces/{namespace}/. However, the /namespaces/default path is not in use. We need to delete it. On the code level,
if there is not a default namespace, we won't create a DEFAULT_SERVLET.
On the app level, we can configure app not to add name "default" namespace.
With this change, endpoint /admin/v2/namespaces/ will be disable if no namespace
sets IS_DEFAULT to true.
Hunter Lee [Mon, 16 Sep 2019 21:40:20 +0000 (14:40 -0700)]
Fix CustomRebalancer's assignment computation (#477)
It was observed that sometimes CustomRebalancer would leave out an instance entirely if an instance is disabled or the partition on the instance was still bootstrapping (current state is null). This would cause a cluster not to converge. This diff fixes this by 1) still including an assignment from IdealState even though the current state is null (maybe due to a pending state transition) 2) putting disabled partitions in InitialState.
Changelist:
1. Fix the issue
2. Add a test: TestCustomRebalancer
pkuwm [Sat, 14 Sep 2019 00:08:17 +0000 (17:08 -0700)]
Fix missed callbacks in CurrentStates based RoutingTableProvider. (#458)
1. Update BasicClusterDataCache to do refresh with selective update. Only when a change happens, we do the cache refresh only for that change type.
2. Improve RoutingTableProvider.queueEvent() and RoutingTableProvider.handleEvent(). Return instanceConfigs snapshot to callback immediately, instead of waiting for currentStates completion.
pkuwm [Fri, 13 Sep 2019 00:51:50 +0000 (17:51 -0700)]
Fix helix-front build failure by downgrading types/lodash version. (#470)
Fix helix-front build failure by downgrading types/lodash version.
Junkai Xue [Tue, 10 Sep 2019 22:42:42 +0000 (15:42 -0700)]
Add field for MIN_ACTIVE_REPLICA_NOT_SET
Junkai Xue [Tue, 10 Sep 2019 00:14:57 +0000 (17:14 -0700)]
Make State Transition Throttling respect MIN_ACTIVE_REPLICA
There are two phases for improving Helix state transition throttling:
1. Respect MIN_ACTIVE_REPLICA
2. Throttle per replica state transitions.
This commit contains the logic of respecting MIN_ACTIVE_REPLICA in IntermediateCalStage and state transition throttling.
Ali Reza Zamani Zadeh Najari [Tue, 10 Sep 2019 18:47:34 +0000 (11:47 -0700)]
Fix the issue where JobContext is not updated properly (#435)
1- A method has been added which extracts the "prevInstanceToTaskAssignments" information from the context.
2- If it is confirmed that the currentstate is null:
An "if statement" is being utilized which sets the context using the target state information.
3- An integration test is added.
Ali Reza Zamani Zadeh Najari [Wed, 4 Sep 2019 19:39:45 +0000 (12:39 -0700)]
Fix the order of workflow context update
* Fix the order of workflow context update
In this commit:
The order that workflow dispatcher updates the workflow status has been changed.
If execution delay is set and job is inflight, the context will get updated.
An integration test has been added.
* minor fixes
Hunter Lee [Wed, 4 Sep 2019 19:26:10 +0000 (12:26 -0700)]
TASK: Fix forceDelete for jobs in JobQueue
We observed that the force delete functionality doesn't really work when the job is running, saying that the job is currently running. Force delete should go through regardless of the current job status.
Changelist:
1. Change the semantics in deleteJobFromQueue
2. Add an integration test: TestDeleteJobFromJobQueue
leesf [Thu, 8 Aug 2019 07:46:02 +0000 (15:46 +0800)]
remove all unused imports
Ali Reza Zamani Zadeh Najari [Fri, 30 Aug 2019 16:56:39 +0000 (09:56 -0700)]
Add integration test for workflow ForceDelete
This commit adds integration tests for ForceDelete.
check the functionality of ForceDelete.
Add comment to ForceDelete that discourages users from using ForceDelete.
Several workflow states have been considered and checked for ForceDelete.
Hunter Lee [Mon, 26 Aug 2019 20:48:21 +0000 (13:48 -0700)]
TASK: Fix incorrect counting of numAttempts for tasks (#432)
TASK: Fix incorrect counting of numAttempts for tasks
It was discovered that sometimes the tasks' NUM_ATTEMPTS field in JobContext was getting incremented even without the tasks being retried. This was because the numAttempts field was getting incremented in other (incorrect) places than at scheduling time. The logic for incrementing the number of attempts has been moved to the schedule logic in this diff.
Changelist:
1. Modify tests so that they test for numAttempts more tightly
2. Fix the incrementation logic
3. Add a new integration test: TestTaskNumAttempts
chenboat [Thu, 22 Aug 2019 04:43:01 +0000 (21:43 -0700)]
Make the reservoir sliding window length used in Helix monintor metrics configurable. #382
chenboat [Wed, 21 Aug 2019 04:21:03 +0000 (21:21 -0700)]
Make the reservoir sliding window length used in Helix monintor metrics configurable. #382
chenboat [Tue, 20 Aug 2019 06:11:43 +0000 (23:11 -0700)]
Make the reservoir sliding window length used in Helix monintor metrics configurable. #382
chenboat [Tue, 20 Aug 2019 04:59:59 +0000 (21:59 -0700)]
Make the reservoir sliding window length used in Helix monintor metrics configurable. #382
chenboat [Sun, 18 Aug 2019 02:01:40 +0000 (19:01 -0700)]
Fix a typo. #382
chenboat [Wed, 14 Aug 2019 06:22:33 +0000 (23:22 -0700)]
Fix a typo. #382
chenboat [Wed, 14 Aug 2019 06:20:59 +0000 (23:20 -0700)]
Add a unit test case.. #382
chenboat [Tue, 13 Aug 2019 05:55:22 +0000 (22:55 -0700)]
Use the system property value as the sliding window length. #382
chenboat [Tue, 13 Aug 2019 05:47:54 +0000 (22:47 -0700)]
Use the system property value as the sliding window length. #382
chenboat [Fri, 9 Aug 2019 07:06:11 +0000 (00:06 -0700)]
Use the system property value as the sliding window length. #382
Ali Reza Zamani Zadeh Najari [Wed, 14 Aug 2019 15:57:49 +0000 (08:57 -0700)]
Fix the execution delay for the jobs
In the Task Framework part of helix, execution delay for the jobs is not respected.
In this commit, when the job is extracted from the inflighJobs queue, the timeline has been checked before scheduling.
Yi Wang [Thu, 15 Aug 2019 18:53:26 +0000 (11:53 -0700)]
Move partition heatlh check method into dataAccessor layer
Jiajun Wang [Mon, 19 Aug 2019 21:00:31 +0000 (14:00 -0700)]
Update menu bar.
Junkai Xue [Mon, 19 Aug 2019 18:15:05 +0000 (11:15 -0700)]
Fix typo for process name
Hunter Lee [Thu, 15 Aug 2019 21:46:31 +0000 (14:46 -0700)]
Revert "Add ChangeDetector interface and ResourceChangeDetector implementation (#388)"
This reverts commit
e0c1c66dd6ed9a01955927ea1828fabcf59eeaad.
Hunter Lee [Thu, 15 Aug 2019 21:33:02 +0000 (14:33 -0700)]
Add ChangeDetector interface and ResourceChangeDetector implementation (#388)
Add ChangeDetector interface and ResourceChangeDetector implementation
In order to efficiently react to changes happening to the cluster in the new WAGED rebalancer, a new component called ChangeDetector was added.
Changelist:
1. Add ChangeDetector interface
2. Implement ResourceChangeDetector
3. Add ResourceChangeCache, a wrapper for critical cluster metadata
4. Add an integration test, TestResourceChangeDetector
Yi Wang [Fri, 9 Aug 2019 00:34:47 +0000 (17:34 -0700)]
Fix issue when client only sets ANY at cluster level throttle config
fixes #332
Added unit test for StateTransitionThrottleController
Added integration test for verifying case when only cluster level ANY throttle set to 1# Please enter the commit message for your changes. Lines starting
Junkai Xue [Wed, 14 Aug 2019 18:23:11 +0000 (11:23 -0700)]
Fix ZNode does not exist in HealthCheck
If the ZNode of PartitionHealth does not exist, REST will return failed checks due to NPE. The fix will be adding the instance to be refreshed entirely. Then REST can check based on API refreshed result.
Jiajun Wang [Wed, 14 Aug 2019 20:58:22 +0000 (13:58 -0700)]
Revert "Reenable helix-front module for official release." (#406)
This reverts commit
3c3db0bf797cbc1e0c1aec59395c8632ed6455db.
jiajunwang [Tue, 13 Aug 2019 23:10:56 +0000 (16:10 -0700)]
Release note for 0.9.1.
jiajunwang [Tue, 13 Aug 2019 23:54:12 +0000 (16:54 -0700)]
Bump up the snapshot version.
Also fix the missing helix-agent snapshot update logic in the bump-up.comand.
Yi Wang [Thu, 8 Aug 2019 19:11:19 +0000 (12:11 -0700)]
Merge ... the lastest optimization on batch get zookeeper properties
Yi Wang [Tue, 6 Aug 2019 01:33:38 +0000 (18:33 -0700)]
Add InstanceServieImpl#batchGetInstancesStoppableChecks to solve performance issue #366
Jiajun Wang [Tue, 13 Aug 2019 18:30:45 +0000 (11:30 -0700)]
[maven-release-plugin] prepare for next development iteration
Jiajun Wang [Tue, 13 Aug 2019 18:30:35 +0000 (11:30 -0700)]
[maven-release-plugin] prepare release helix-0.9.1
Jiajun Wang [Mon, 12 Aug 2019 20:57:21 +0000 (13:57 -0700)]
Reenable helix-front module for official release.
Jiajun Wang [Mon, 12 Aug 2019 20:55:40 +0000 (13:55 -0700)]
Revert "[maven-release-plugin] prepare release helix-0.9.1"
This reverts commit
c7e8e6366f6e5360d416e2fd1867252ebdcd7242.
Jiajun Wang [Mon, 12 Aug 2019 20:55:35 +0000 (13:55 -0700)]
Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit
f2746c823193991a0dd6152827b7344d66226368.
Jiajun Wang [Mon, 12 Aug 2019 20:36:09 +0000 (13:36 -0700)]
[maven-release-plugin] prepare for next development iteration
Jiajun Wang [Mon, 12 Aug 2019 20:35:57 +0000 (13:35 -0700)]
[maven-release-plugin] prepare release helix-0.9.1
Jiajun Wang [Mon, 12 Aug 2019 17:58:21 +0000 (10:58 -0700)]
Fix the CallbackHandler registration logic in DistributedLeaderElection (#395)
* Fix the CallbackHandler registration logic in DistributedLeaderElection that may cause a leader node has no callback registered.
Our current initialization logic assumes a strict leader acquire/relinquish events sequence. However, due to the possible carried over ZK events from the previous ZK session, the controller node change event might be triggered in the following sequence:
1. CALLBACK (from the previous session): Create new leader node and add handlers.
2. FINALIZE (Handle the previous session expire): Clean up handlers.
3. INIT (For the new session establishment): Expect to add the handlers back again.
As a result, if the INIT event processing does not recover the handlers, the leader controller won't be able to manage anything. This fix ensures all the acquireLeadership call will try to initialize the leader controller's callback handlers.
Also, add the additional test logic in TestHandleNewSession to verify the fix.
* Improve the leader history update logic so there is no duplicate entry recorded.
Hunter Lee [Fri, 9 Aug 2019 23:57:05 +0000 (16:57 -0700)]
TASK: Drop all tasks whose requested states are DROPPED
Upon a Participant disconnect, the Participant would carry over from the last session. This would copy all previous task states to the current session and set their requested states as DROPPED (for INIT and RUNNING states).
It came to our attention that sometimes these Participants experience connection issues and the tasks happen to be in TASK_ERROR or COMPLETED states. These tasks would get stuck on the Participant and never be dropped. This issue proposes to add the logic that would get all tasks whose requested states are DROPPED to be dropped immediately.
Changelist:
1. Make sure all tasks whose requested state is DROPPED get added to tasksToDrop
2. Add a unit test: TestDropTerminalTasksUponReset
Junkai Xue [Mon, 5 Aug 2019 23:33:50 +0000 (16:33 -0700)]
Improve ZK read with batch call
Current HealthReport read is single call for each participant. Improve it will batch call to ZK to reduce the number of calls.
Junkai Xue [Tue, 6 Aug 2019 03:53:13 +0000 (20:53 -0700)]
Add reviews@helix.apache.org to mailing list
Junkai Xue [Mon, 5 Aug 2019 23:25:03 +0000 (16:25 -0700)]
Stablize the REST tests
Stablize the REST tests by following changes:
1. Remove temporary cluster which impact the ClusterAccessor test
2. Add all start/end message for test debug purpose.
3. Disable unstable monitoring test for default MBeans. Sometimes we can query it sometimes not. It is not critical test path. Let's make it stable later.
Hunter Lee [Tue, 6 Aug 2019 18:32:16 +0000 (11:32 -0700)]
Read ClusterConfig from ZK selectively
Previously, ClusterConfig would be read from ZK every pipeline run. This PR makes it a selective read and also add to the set of all changed types so that cluster change detector could more easily tell whether ClusterConfig changed without having to store two copies of ClusterConfig objects.
kaisun2000 [Tue, 6 Aug 2019 18:58:16 +0000 (11:58 -0700)]
Fix RoutingTableProvider statePropagationLatency metric reporting bug (#365)
Issue:
CurrentStateCache updating snapshot would miss all the existing partitions that having state change.
RoutingTableProvider callback on the main event thread. Time is not accounted in log.
Description:
fix the bug by updating the snapshot with the correct reloadkeys.
enhanced log to accout for user callback code separately.
Tests:
mvn test passed.
Yi Wang [Tue, 23 Jul 2019 23:27:59 +0000 (16:27 -0700)]
Dynamically change the processor thread name when consuming event
Hunter Lee [Mon, 5 Aug 2019 17:16:27 +0000 (10:16 -0700)]
Remove DEFAULT_VIEW_CLUSTER_REFRESH_PERIOD from ClusterConfig
This is a constant that is no longer used.
Hunter Lee [Mon, 5 Aug 2019 16:19:51 +0000 (09:19 -0700)]
Remove .reviewboardrc from the open source repository
Ali Reza Zamani Zadeh Najari [Thu, 1 Aug 2019 17:54:28 +0000 (10:54 -0700)]
Remove unnecessary touch logics that trigge pipeline
In the places that ZooKeeper Resourceconfig is updated,
it is not necessary to do touch logic anymore to run the pipeline again.
Resourcesconfig update automatically runs triggers pipeline.
This commit fixes issue #370.
jiajunwang [Tue, 30 Jul 2019 21:41:32 +0000 (14:41 -0700)]
Fix the race condition while Helix refresh cluster status cache. (#363)
* Fix the race condition while Helix refresh cluster status cache.
This change fix issue #331.
The design is ensuring one read only to avoid locking during the change notification. However, a later update introduced addition read. The result is that two reads may have different results because notification is lock free. This leads the cache to be in an inconsistent state. The impact is that the expected rebalance might not happen.
Ali Reza Zamani Zadeh Najari [Tue, 23 Jul 2019 22:15:47 +0000 (15:15 -0700)]
Remove TODO NPE log for computeResourceBestPossibleState
The logs related to NPE in computeResourceBestPossibleState is not needed anymore.
This commit fixes issue #351.
Ali Najari [Wed, 17 Jul 2019 21:21:40 +0000 (14:21 -0700)]
Read Failure while reading non-existent znode
In this commit, in case of encountering NoNodeException while reading data from a znode that does not exist, the NoNodeException will be caught and readfailurecounter will not incremented.
Instead, the related information (read Counter, read Latency, etc.) will be recorded.
This commit fixes issue #345.
Kai Sun [Wed, 17 Jul 2019 01:36:42 +0000 (18:36 -0700)]
Implementation of stateModelDef modification in REST 2.0
Current implementation of Rest 2.0 does not support stateModelDef modification. Here, we will implement
delete -- remove the stateModelDef with the input id.
put -- create new statemodeldef if no existing one with same input id
set -- replace the content of node with input id
We also add the following test cases:
Test delete model one; expect success
Test delete model one again; expect success
Create the deleted model one; expect success
Create the deleted model one again; expect failure as the same model id exists
Set the model one with modified content; expect success
Read the model one; expect the content would be same as modified content
Set the model one to original content restore original state; expect success
Ali Reza Zamani Zadeh Najari [Mon, 15 Jul 2019 22:25:52 +0000 (15:25 -0700)]
Change IllegalStateException to Helix Exception for CRUSH based rebalance strategy algorithm
In this commit the IllegalStateException has been caught and HelixException has been thrown for the upper layer instead. The error log shows more meaningful exception.
A test has been changed accordingly.
This commit fixes issue #322.
Junkai Xue [Mon, 22 Jul 2019 22:25:01 +0000 (15:25 -0700)]
Fix test fail for TestRebalanceScheduler
Junkai Xue [Tue, 16 Jul 2019 01:32:57 +0000 (18:32 -0700)]
Fix invoke rebalance by "touching" IdealState/ResourceConfig
Current HelixDataAccesor updateProperty uses ZNRecordUpdater. It's merge logic just simply adding all elements when do a merge for ZNRecord. That could cause lot of duplication of listFields.
This impact the invokeRebalanceForResourceConfig. The fix will be implementing a customized updater.
In this commit:
1. Fix invoke rebalance with customized updater.
2. Add comments for ZNRecord merge.
3. Add checks in TaskUtil to only trigger Workflow Config "touch" when purge job.
4. Add a test for RebalanceScheduler.
Ali Najari [Mon, 15 Jul 2019 22:25:52 +0000 (15:25 -0700)]
IllegalStateException for CRUSH based rebalance strategy algorithm.
This commit fixes the error log and exception that is shown when there is not enough eligible instance to use.
Junkai Xue [Tue, 9 Jul 2019 23:38:10 +0000 (16:38 -0700)]
Exclude ANY_INSTANCE for customized sibling checks
Current Helix HealthCheck API checks the ANY_INSTANCE resources, which is not necessary. Since ANY_INSTANCE resources only have single partition with 1 replica, there is no need to check sibling health status.
This commit fixes issue #328
Junkai Xue [Tue, 9 Jul 2019 23:47:58 +0000 (16:47 -0700)]
Disable helix-front build
Hunter Lee [Wed, 12 Jun 2019 17:27:45 +0000 (10:27 -0700)]
Update all markdown files to 0.9.0
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 11 Jun 2019 19:01:49 +0000 (12:01 -0700)]
Update website with 0.9.0 with maven plugin version upgrades
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 11 Jun 2019 00:37:00 +0000 (17:37 -0700)]
Add Release Notes and Docs for 0.9.0 release
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Mon, 10 Jun 2019 23:57:51 +0000 (16:57 -0700)]
Upgrade ivy version for 0.9.0 release
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Mon, 3 Jun 2019 00:56:17 +0000 (17:56 -0700)]
prepare for next development iteration
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 25 Jun 2019 22:08:22 +0000 (15:08 -0700)]
[maven-release-plugin] prepare release helix-0.9.0
[maven-release-plugin] prepare release helix-0.9.0
Hunter Lee [Sun, 2 Jun 2019 23:55:24 +0000 (16:55 -0700)]
Enable helix-front in pom.xml
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Tue, 25 Jun 2019 07:08:44 +0000 (00:08 -0700)]
Merge differences with another branch
There are multiple branches against which Helix devs have been doing development work. We wish to consolidate them into one by reconciling all differences. This diff makes such changes. This diff does not contain any changes in logic or functionality.
Junkai Xue [Fri, 21 Jun 2019 18:58:13 +0000 (11:58 -0700)]
Fix looping with keySet and modifying keySet same time
Looping with keySet and modifying with keySet entry at same time could cause ConcurrentModificationException. Fix that with adding to extra new Set and remove after looping is done.
RB=
1711266
G=helix-reviewers
A=jjwang,hulee,ksun
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Mon, 17 Jun 2019 21:37:02 +0000 (14:37 -0700)]
Always try reading from EphemeralOwner state first while reading the session ID from a live instance node.
This is to avoid inconsistent session ID in the node content and the emphemeral owner state.
Note that in order to ensure backward compatiblity and some test cases, the newly introduced method will still read from the node content if the ephemeral owner state is empty (-1 or 0).
RB=
1704942
BUG=HELIX-1969
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Tue, 18 Jun 2019 22:38:24 +0000 (15:38 -0700)]
Fix compute IdealState mapping tool
There is a bug in IdealState mapping tool. It does not filtered out the instances are live but disabled. Add this logic and add extra tests for it.
RB=
1706106
BUG=HELIX-1974
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 13 Jun 2019 01:39:50 +0000 (18:39 -0700)]
Enable default Jersey server metric reporting
For monitoring Helix REST, we can support both REST server monitoring and customized logic monitoring.
In this rb, we enable the Jersey server monitoring metrics and adding testing for that.
RB=
1701238
BUG=HELIX-1963
G=helix-reviewers
A=ywang4
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Lei Xia [Wed, 5 Jun 2019 00:15:17 +0000 (17:15 -0700)]
Remove relay message from controller's message cache immediately if the partition on relay host turned to ERROR state while transits off from top-state.
RB=
1689771
BUG=HELIX-1900
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Mon, 10 Jun 2019 22:29:58 +0000 (15:29 -0700)]
Upgrade Apache rat version and add exclusion paths
A part of Helix release process requires the rat plugin to perform checks. However, there are scripts and website files that do not require this kind of checks. This diff adds exclusion paths to the pom.xml. Note that there are several Java files that do not still pass the checks due to them not having the Apache license. This still needs to be fixed in the future.
RB=
1695987
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Wed, 24 Apr 2019 23:50:30 +0000 (16:50 -0700)]
Catch exception and log error when helix-admin-webapp fails to read data from certain path
RB=
1644066
BUG=https://jira01.corp.linkedin.com:8443/browse/EXC-114388
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Yi Wang [Mon, 3 Jun 2019 18:15:54 +0000 (11:15 -0700)]
Fix http request hanging issue to the SN API
RB=
1684758
G=helix-reviewers
A=jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 30 May 2019 23:47:16 +0000 (16:47 -0700)]
Change output behavior for non-exist instances
Current behavior of non-existing instance will be not showing in output. So user is hard to differentiate whether instance does not exist or not belongs to same zone.
Add the logic to check instances exists in instance list or not.
RB=
1684700
BUG=HELIX-1911
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 30 May 2019 01:19:43 +0000 (18:19 -0700)]
Fix check for disabled partitions
For the map field of disabled partitions, even they are all enabled, there could be some key left over for resources. We cannot just check if there is any resource entries. With this fix, Helix loops all the resource entries of disabled map to see whether there is a parition list is not empty.
In addition, fix failed tests in REST.
RB=
1683071
BUG=HELIX-1910
G=helix-reviewers
A=hulee
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Lei Xia [Fri, 17 May 2019 17:36:22 +0000 (10:36 -0700)]
Remove workaround in sending S->M message when there is a same pending relay message.
RB=
1670732
BUG=HELIX-1871
G=helix-reviewers
A=jjwang,jxue
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Junkai Xue [Thu, 23 May 2019 19:36:59 +0000 (12:36 -0700)]
Change state transition monitor to per cluster per state transition
Existing state transition metrics recording per cluster, per resource and per state transition metrics. For the large cluster containing lots of resources may bring tremendous number of metrics at participant side.
This RB changes the metrics to per cluster per state transition, which could be fair enough for monitoring purpose.
RB=
1677106
RB=
1677106
BUG=HELIX-1890
G=helix-reviewers
A=jjwang
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Hunter Lee [Sat, 1 Jun 2019 00:08:03 +0000 (17:08 -0700)]
Upgrade ZK to 3.4.13
Signed-off-by: Hunter Lee <hulee@linkedin.com>
Jiajun Wang [Wed, 22 May 2019 02:01:16 +0000 (19:01 -0700)]
Adding Zk data change callback propagation latency metric.
Note that the latency metric only covers data change callback for now.
To adding child change callback, we need to find a way to avoid the additional ZK access that is required for read children node stats. Added TODO in the corresponding code block.
RB=
1674550
G=helix-reviewers
A=jxue,ksun
Signed-off-by: Hunter Lee <hulee@linkedin.com>