Junkai Xue [Mon, 4 May 2020 23:12:53 +0000 (16:12 -0700)]
[maven-release-plugin] prepare release helix-1.0.0
Junkai Xue [Mon, 4 May 2020 21:28:43 +0000 (14:28 -0700)]
Revert "[maven-release-plugin] prepare release helix-1.0.0"
This reverts commit
a71153cfe1e188b0a870a7b113ac3b112cb25a47.
Junkai Xue [Mon, 4 May 2020 20:02:33 +0000 (13:02 -0700)]
Revert "[maven-release-plugin] prepare for next development iteration"
This reverts commit
0fd189e099389ce8859870212f7539124e0b017f.
Junkai Xue [Mon, 4 May 2020 20:02:13 +0000 (13:02 -0700)]
Revert "Add async call retry to resolve the transient ZK connection issue. (#970)"
This reverts commit
96ebb27c23004a7a69dc4799b14586ff82d53c9e.
Hunter Lee [Mon, 4 May 2020 21:28:43 +0000 (14:28 -0700)]
Rearrange zookeeper imports in pom.xml (#995)
This commit makes sure pom.xml is up to date in preparation for the 1.0.X release.
Jiajun Wang [Mon, 4 May 2020 19:36:13 +0000 (12:36 -0700)]
Add async call retry to resolve the transient ZK connection issue. (#970)
If any exceptions happen during the async call, the current design will fail the operation and may eventually return a partial result.
This change makes the ZkClient retry operation if the error is because of a temporary ZK connection issue (CONNECTIONLOSS, SESSIONEXPIRED, SESSIONMOVED).
So the async call has a larger chance to finish the operation. Note that if the exception is due to business logic, the async call will still fail and the right return code will be sent to the callback handler.
Junkai Xue [Mon, 4 May 2020 19:32:57 +0000 (12:32 -0700)]
[maven-release-plugin] prepare for next development iteration
Junkai Xue [Mon, 4 May 2020 19:23:27 +0000 (12:23 -0700)]
[maven-release-plugin] prepare release helix-1.0.0
Junkai Xue [Mon, 4 May 2020 19:13:16 +0000 (12:13 -0700)]
Enable helix-front for release
Meng Zhang [Mon, 4 May 2020 17:38:23 +0000 (10:38 -0700)]
fix version comparison issue in compatibility check stage (#992)
Ali Reza Zamani Zadeh Najari [Fri, 1 May 2020 00:48:20 +0000 (17:48 -0700)]
Stabilizing 4 flaky tests (#981)
Four tests has been stabilized in this commit.
These tests are:
1-TestJobFailure
2-TestRebalanceRunningTask
3-TestTaskRebalancerStopResume
4-TestTaskSchedulingTwoCurrentStates
TestJobFailure was unstable because we get ExternalView of a resources and if the ExternalView is not populated yet by the controller, we hit NullPointerException.
TestRebalanceRunningTask was unstable. In this PR, we make sure that the master is existed in two different nodes (master is switched to new instance) and then we check the assigned participants.
TestRebalanceStopAndResume was unstable because of Thread.Sleep usage. Instead of stopping the workflow after some time, we first make sure that workflow and job is IN_PROGRESS and then stop the workflow.
TestTaskSchedulingTwoCurrent has been stabilized by making sure that master has been switched to new instance after modifying IS. After that we make sure that task is assigned to the correct instance and make sure it does not switched to new instance and cancel is not being called incorrectly.
Junkai Xue [Wed, 29 Apr 2020 00:42:21 +0000 (17:42 -0700)]
Add 1.0.0 release folder and release notes
Huizhi Lu [Fri, 24 Apr 2020 23:46:58 +0000 (16:46 -0700)]
Fix failed tests in helix-rest (#966)
In tests: TestZkRoutingDataWriter and TestZkRoutingDataReader, the zkClient is trying to read/write ZNRecords, however, zkClient's serializer is not a ZNRecordSerializer but a BasicZkSerializer. So when read/write a ZNRecord, a ZkMarshallingError is thrown and causes the tests failed.
Hunter Lee [Fri, 24 Apr 2020 22:40:09 +0000 (15:40 -0700)]
Improve logging for isClusterSetup (#968)
If isClusterSetup fails, we don't get the errorMsg. This PR improves that behavior by making it a warn log.
Molly Gao [Thu, 23 Apr 2020 19:34:57 +0000 (12:34 -0700)]
Change DistributedLock interface APIs (#961)
Change DistributedLock interface API names to follow Java convention
Molly Gao [Wed, 19 Feb 2020 01:58:19 +0000 (17:58 -0800)]
Rename interface HelixLock to DistributedLock
Molly Gao [Tue, 18 Feb 2020 22:21:15 +0000 (14:21 -0800)]
Remove dependency of LockInfo on HelixProperty
Molly Gao [Sat, 15 Feb 2020 02:51:48 +0000 (18:51 -0800)]
Clean up code
Molly Gao [Fri, 14 Feb 2020 21:28:07 +0000 (13:28 -0800)]
Created LockScope interface
Molly Gao [Tue, 11 Feb 2020 02:31:26 +0000 (18:31 -0800)]
refactor LockInfo and some updates on the HelixLockScope
Molly Gao [Fri, 7 Feb 2020 22:53:03 +0000 (14:53 -0800)]
Added cluster level to HelixLockScope and convert lock path to uppercase
Molly Gao [Fri, 7 Feb 2020 04:40:50 +0000 (20:40 -0800)]
simplified acquireLock logic
Molly Gao [Thu, 6 Feb 2020 23:46:12 +0000 (15:46 -0800)]
Fixed lock path generation
Molly Gao [Thu, 6 Feb 2020 20:02:00 +0000 (12:02 -0800)]
Changed method doc for releaseLock in HelixLock interface
Molly Gao [Thu, 6 Feb 2020 02:07:00 +0000 (18:07 -0800)]
A few fixes on syntax
Molly Gao [Wed, 5 Feb 2020 01:38:13 +0000 (17:38 -0800)]
Added test to acquire lock simultaneously
Molly Gao [Tue, 4 Feb 2020 18:41:13 +0000 (10:41 -0800)]
Fixed logic of release and isOwner
Molly Gao [Mon, 3 Feb 2020 22:26:41 +0000 (14:26 -0800)]
Added unit tests for Helix nonblocking lock
Molly Gao [Wed, 29 Jan 2020 02:16:45 +0000 (18:16 -0800)]
created Helix nonblocking lock based on zk
Molly Gao [Thu, 30 Jan 2020 17:51:55 +0000 (09:51 -0800)]
Added details in comments
Molly Gao [Fri, 24 Jan 2020 23:33:04 +0000 (15:33 -0800)]
Added LockInfo interface
Molly Gao [Thu, 23 Jan 2020 19:30:59 +0000 (11:30 -0800)]
Created Helix distributed lock design (apache#702)
mgao0 [Thu, 30 Jan 2020 21:39:16 +0000 (13:39 -0800)]
Created Helix distributed lock interface (#703)
Added LockInfo interfaces.
zhangmeng916 [Wed, 15 Jan 2020 01:11:25 +0000 (17:11 -0800)]
Add Helix Distributed lock module (#673)
Add Helix Lock module.
Meng Zhang [Mon, 13 Apr 2020 20:53:18 +0000 (13:53 -0700)]
use new ZNRecord and update test
Molly Gao [Mon, 13 Apr 2020 17:56:48 +0000 (10:56 -0700)]
Move routing table provider initialization (#946)
This PR moves the initialization of routing table provider to before class so it is initialized before any updates. Also, added several checks into the validation method to cover some edge cases.
Ali Reza Zamani Zadeh Najari [Mon, 13 Apr 2020 17:53:44 +0000 (10:53 -0700)]
Add registration logic for CustomizedView listeners (#944)
In this commit, a new logic is added which target the scenario where user
disables and enables specific types. In this case, since the CustomizedView
path for that type is removed by the controller, the router looses its
listener. In this commit, we added root change lister and re-registers the
listens again.
Meng Zhang [Fri, 10 Apr 2020 07:16:47 +0000 (00:16 -0700)]
Add cache update/delete in customized view aggregation stage (#934)
Add customized view cache update in customized view aggregation stage. This is to ensure that the cache does not have stale data
meng [Tue, 7 Apr 2020 18:04:40 +0000 (11:04 -0700)]
fix customized state provider (#928)
Modify customized state provider factory. The new factory can build a customized state provider with either Helix own manager or a customer input manager.
Molly Gao [Tue, 31 Mar 2020 17:49:52 +0000 (10:49 -0700)]
Add integration test to customized view aggregation (#912)
The integration test involves components: update customized state using customized view provider, and use routing table provider to get customized view snapshots which are aggregated in controller.
Ali Reza Zamani Zadeh Najari [Mon, 30 Mar 2020 23:47:13 +0000 (16:47 -0700)]
Replace customized view cache with property cache (#869)
In this commit, the custom implementation of customized view cache
has been replaced with property cache implementation.
zhangmeng916 [Mon, 30 Mar 2020 21:26:57 +0000 (14:26 -0700)]
minor fix for customized view aggregation (#917)
Fix minor issues in customized view aggregation logic and add some more tests.
Co-authored-by: Meng Zhang <mnzhang@mnzhang-mn1.linkedin.biz>
zhangmeng916 [Wed, 25 Mar 2020 21:24:25 +0000 (14:24 -0700)]
Add new stages in Helix generic controller for customized view aggregation. (#851)
Add extra stages and pipelines in controller for customized state computation and customized view aggregation.
Add refresh logic in resource data provider for customized view related data refresh.
Add customized state event handling in CallbackHandler.
Add integration test for customized view aggregation.
Modify existing tests to verify new logic.
Co-authored-by: Meng Zhang <mnzhang@mnzhang-mn1.linkedin.biz>
Ali Reza Zamani Zadeh Najari [Fri, 20 Mar 2020 18:32:47 +0000 (11:32 -0700)]
Complete the Routing Table Provider for CustomizedView (#834)
In this commit, the routing table provider has been changed in a way to include customized view feature.
zhangmeng916 [Wed, 18 Mar 2020 06:08:20 +0000 (23:08 -0700)]
Add two stages for customized state view aggregation. (#888)
1. One stage is the computation stage for customized state. It takes the Zookeeper data of customized states and converts them to the formatted output used by the other stage.
2. The other stage is customized view aggregation stage. It will take the output from the customized state computation stage, and output the customized view to Zookeeper.
3. The two stages together compute the customized view from the customized states.
4. Unit tests are added to verify the correctness of the two stages.
zhangmeng916 [Thu, 12 Mar 2020 17:07:23 +0000 (10:07 -0700)]
update cache functions for customized view aggregation (#887)
Update some cache functions for customized view aggregation
Molly Gao [Wed, 11 Mar 2020 17:45:58 +0000 (10:45 -0700)]
Use updater to update customized state for concurrency control (#859)
Currently the update customized state method is made synchronized for concurrency control. This commit modifies the implementation of update to leave the responsibility of concurrency control to ZooKeeper by using updater to update the customize state. With delete method already implemented with updater, we can prevent unexpected change of the customize state data.
zhangmeng916 [Wed, 11 Mar 2020 05:45:05 +0000 (22:45 -0700)]
rename custmized state aggregation config to customized state config (#885)
Rename customized state aggregation config to customized state config for future extendibility.
Co-authored-by: Meng Zhang <mnzhang@mnzhang-mn1.linkedin.biz>
Meng Zhang [Thu, 5 Mar 2020 18:24:17 +0000 (10:24 -0800)]
Add intermediate storage for customized state (#827)
1. Implement a participant state cache for generalizing functions in both current state cache and customized state cache.
2. Implement an intermediate data structure to store the result of customized state computation and prepare the data to be the format that can be used by customized view computation later.
mgao0 [Tue, 3 Mar 2020 18:56:16 +0000 (10:56 -0800)]
Improve CustomizedStateProvider tests (#840)
Add tests to make CustomizedStateProvider tests comprehensive.
zhangmeng916 [Thu, 27 Feb 2020 00:34:27 +0000 (16:34 -0800)]
add listener and config for customized view aggregation (#815)
Add listeners for customized state and customized state aggregation config in Helix managers
Ali Reza Zamani Zadeh Najari [Thu, 27 Feb 2020 00:29:27 +0000 (16:29 -0800)]
Add basic functionalities for RoutingTableProvider for CustomizedView (#814)
This commit contains the basic functionalities for
CustomizedView RoutingTableProvider.
Here are the new added functionalities:
1- CustomizedViewChangeListener
2- Addition of CustomizedView to helix PropertyType and PropertyKey
3- Implementation of CustomizedView cache.
4- Registering CallbackHandler for CustomizedView.
zhangmeng916 [Wed, 26 Feb 2020 18:05:55 +0000 (10:05 -0800)]
Implement Helix API for updating customized state (#729)
Implement Helix APIs in CustomizedStateProvider for customers to operate on their own customized state. The available operations include update, get, and delete. To use CustomizedStateProvider, Helix user should initialize its factory and pass required parameters.
Ali Reza Zamani Zadeh Najari [Tue, 25 Feb 2020 22:05:10 +0000 (14:05 -0800)]
Add REST API to add, remove and update CustomizedStateAggregationConfig (#797)
In this commit the below REST APIs have been added.
1- addCustomizedStateAggregationConfig
2- removeCustomizedStateAggregationConfig
3- getCustomizedStateAggregationConfig
4- updateCustomizedStateAggregationConfig
Tests have been added to check the functionality of these REST APIs.
Also some of the depricated calls have been updated.
Ali Reza Zamani Zadeh Najari [Fri, 21 Feb 2020 19:40:28 +0000 (11:40 -0800)]
Add java API to add or remove CustomizedStateAggregationConfig (#792)
In this commit the below APIs have been added.
1- addCustomizedStateAggregationConfig
2- removeCustomizedStateAggregationConfig.
3- addTypeToCustomizedStateAggregationConfig
4- removeTypeFromCustomizedStateAggregationConfig
Tests have been added to check the functionality of these APIs.
zhangmeng916 [Thu, 20 Feb 2020 19:53:44 +0000 (11:53 -0800)]
add CustomizedStateAggregation config (#776)
Add CustomizedViewAggregation as a cluster level config. This config defines the types of customized states that will be aggregated by Helix controller to generate a customized view. If Helix customers would like to have an aggregated view generated for their own states, they will need to add the type of the state to the list view in this config.
Ali Reza Zamani Zadeh Najari [Wed, 5 Feb 2020 21:19:22 +0000 (13:19 -0800)]
Add the CustomizedView Helix property (#723)
This commit contains the bare minimum properties for CustomizedView Helix property.
Huizhi Lu [Thu, 23 Apr 2020 01:59:38 +0000 (18:59 -0700)]
Fix routing data refreshing in MSDS (#955)
testSetNamespaceRoutingData is flaky because namespace is removed when refreshing routing data. It is caused by race condition between the read request after updating and data change callback to refresh routing. And routing data cache should not be refreshed if writing routing to ZK fails.
Neal Sun [Tue, 21 Apr 2020 19:49:56 +0000 (12:49 -0700)]
Fix Regression on Flaky Tests in TestResourceAccessor (#959)
The previous fix to TestResourceAccessor (stopping all mock controllers) might have affected other tests. A better approach, instead of stopping all controllers, is to pause the clusters instead.
Jiajun Wang [Tue, 14 Apr 2020 23:27:36 +0000 (16:27 -0700)]
Fix unexpceted partition movements in the CrushEd strategy. (#941)
This is a workaround fix to ensure backward compatibility. An additional cache map is used to keep the stable partition list so as to remove the randomness in the algorithm input.
Note that the right fix would be cleaner that we sort the list inside the strategy class. However, that will also change all existing cluster assignments in production.
Jiajun Wang [Tue, 14 Apr 2020 21:41:21 +0000 (14:41 -0700)]
Fix unstable test TestRebalancePipeline.testMsgTriggeredRebalance() (#953)
Remove all the hardcoded thread sleeps in the test case. Replaced with Verifiers.
kaisun2000 [Tue, 14 Apr 2020 21:18:48 +0000 (14:18 -0700)]
Fix ZkHelixPropertyStore loses Zookeeper notification issue (#924)
ZkHelixPropertyStore loses ZK notification after session expires.
THe issue was caused by a bug in Share ZkClient code path. More
specifically, Share ZkClient would not call fireAllEvent when ZK
session expires. Thus, ZkHelixPropertyStore would not install
watches for corresponding ZkPath. Thus, lose Zookeeper
nofiticaition when changes happens.
Co-authored-by: Kai Sun <ksun@ksun-mn1.linkedin.biz>
Huizhi Lu [Tue, 14 Apr 2020 00:56:41 +0000 (17:56 -0700)]
Simplify logging
Huizhi Lu [Mon, 13 Apr 2020 20:25:20 +0000 (13:25 -0700)]
Fix TestCrushAutoRebalanceNonRack failure of dropping instance
Huizhi Lu [Tue, 14 Apr 2020 00:10:21 +0000 (17:10 -0700)]
Add integration test for Helix Java APIs using different MSDS endpoints (#948)
To make sure Helix Java API is connecting to the correct MSDS endpoint, we add an integration test. This test verifies that each API only connects to the configured MSDS endpoint but not other endpoints.
Hunter Lee [Fri, 10 Apr 2020 22:40:03 +0000 (15:40 -0700)]
Change thread pool used for TTL-based GC in ZkBucketDataAccessor (#945)
ZkBucketDataAccessor supports TTL-based garbage collection of stale versions. We create a thread pool for this of size 1, but the way the thread pool is created currently may not guarantee strictly sequential execution of GC timer tasks, which may cause unexpected GC behaviors. This commit uses Executors.newSingleThreadScheduledExecutor()to create the GC thread pool with a sequential execution guarantee.
Note: another way to do this is to implement a Bounded event queue-based GC, but Java already provides this queue-based model in Executors interface - we want to try our best at avoid reinventing the wheel.
Neal Sun [Thu, 9 Apr 2020 00:34:35 +0000 (17:34 -0700)]
Fix flaky resource accessor tests (#935)
This PR fixes the flaky tests in TestResourceAccessor, namely testPartitionHealth and testResourceHealth. The tests have been failing because the external views created during the tests could be sometimes removed before the health check, causing health checks to fail. How to reproduce: add time delay before health check calls will always fail the test cases.
We believe that the reason behind external view removal is due to the controllers, and disabling the controllers on the test clusters has made the tests pass even with the time delay.
Neal Sun [Wed, 8 Apr 2020 01:03:50 +0000 (18:03 -0700)]
Fix MetadataStoreDirectory routing data cache refresh bug (#933)
This PR ensures that routing data cache is cleared before updating in ZkMetadataStoreDirectory. Also, this PR fixes an edge case during TrieRoutingData creation when no zk realms has any sharding key.
Hunter Lee [Mon, 30 Mar 2020 17:31:14 +0000 (10:31 -0700)]
Fix getClusters() in ZKHelixAdmin for multi-zk mode (#916)
This PR fixes the logic in getClusters() so that it works in a multi-zk environment. On multi-zk mode, the API will query for raw routing data from MSDS and produce a list of all clusters in the namespace. The behavior for single-zk mode remains the same for backward-compatibility.
Hunter Lee [Fri, 27 Mar 2020 00:54:58 +0000 (17:54 -0700)]
Make multiZkEnabled configurable in HelixRestNamespace (#915)
It was observed that we need more fine-grained control over this multiZkEnabled config because there could exists namespaces with differing modes. Because multiple namespaces may be co-deployed, we cannot simply make it a system config.
Hunter Lee [Thu, 26 Mar 2020 05:20:10 +0000 (22:20 -0700)]
Make Helix REST realm-aware (#908)
Helix REST needs to start using a realm-aware ZkClient on multi-zk mode. Also it needs to become a listener on routing data because we don't want to restart the HelixRestServer every time we update the routing data.
Changelist:
Make ServerContext listen on routing data paths if run on multi-zk mode
Make HelixRestServer use RealmAwareZkClient (FederatedZkClient) on multi-zk mode
Hunter Lee [Sat, 21 Mar 2020 04:28:38 +0000 (21:28 -0700)]
Use Java Generics and inheritance to reduce duplicate code in Helix API Builders (#899)
This PR removes duplicate logic and refactors the ZK helix API Builder logic into one single public abstract class so that other Builders can inherit from it. It makes use of Builder inheritance and Java Generics. This PR promotes code reuse and better craftsmanship.
Neal Sun [Wed, 18 Mar 2020 17:01:48 +0000 (10:01 -0700)]
Fix setRoutingData boolean handling; fix leader forwarding url construction (#902)
This PR makes SetRoutingData respect the return value of the underlying function; it also fixes request forwarding urls, allowing ports and endpoint prefixes to be added.
Hunter Lee [Tue, 17 Mar 2020 17:35:43 +0000 (10:35 -0700)]
Add integration tests for Helix Java APIs (#892)
This commit adds a comprehensive integration test for Helix Java APIs. All Helix Java APIs are tested using regular resource rebalancing and task framework.
Hunter Lee [Sat, 14 Mar 2020 00:33:11 +0000 (17:33 -0700)]
Make ZkUtil realm-aware (#896)
There were some places in the code that were missed in previous PRs. This makes ZkUtil realm-aware by replacing HelixZkClient with RealmAwareZkClient.
Hunter Lee [Sat, 14 Mar 2020 00:30:32 +0000 (17:30 -0700)]
Make ZkBucketDataAccessor realm-aware (#894)
Because Helix Controller now uses WAGED rebalancer as the default rebalancer, it tries to create an instance of ZkBucketDataAccessor. This will fail unless ZkBucketDataAccessor was also made realm-aware. We can simply use a FederatedZkClient here since BucketDataAccessor does not support ephemeral operations.
Hunter Lee [Sat, 14 Mar 2020 00:28:01 +0000 (17:28 -0700)]
Update listClusters() in ZkHelixAdmin (#895)
This PR updates getClusters() in ZkHelixAdmin in realm-aware mode. It is difficult to reason about getting all clusters - and this method, on single-realm mode, is also broken anyways because in the case users create nested clusters, this method will end up returning nothing. This update is backward-compatible for single-realm users.
Ideally, users should use Metadata Store Directory Service to get the list of all clusters.
Hunter Lee [Sat, 14 Mar 2020 00:19:34 +0000 (17:19 -0700)]
Reformat ZkBaseDataAccessor (#893)
Changelist:
1. Add generic type markers to Builder (<T>)
2. Fix a bug in validate function
3. Default to ZNRecordSerializer to preserve existing behavior
Huizhi Lu [Thu, 12 Mar 2020 16:44:38 +0000 (09:44 -0700)]
Make ZKHelixAdmin and ZKHelixManager Realm-aware (#846)
To make Helix Java APIs realm-aware, we need to make both ZKHelixAdmin and ZKHelixManager realm-aware. This commit adds a Builder to set client config and connection config for building realm-aware ZkClients underneath.
Huizhi Lu [Thu, 12 Mar 2020 06:38:52 +0000 (23:38 -0700)]
Make ZkBaseDataAccessor realm-aware (#855)
This commit makes ZkBaseDataAccessor realm-aware by building according realm-aware ZkClients in the constructor. A Builder is provided to set realm-aware client config and connection config.
Hunter Lee [Thu, 12 Mar 2020 03:06:18 +0000 (20:06 -0700)]
Make ZkHelixClusterVerifier and its child classes realm-aware (#867)
Changelist:
Make sure constructors accept RealmAwareZkClient
Add Builders in each child class of ZkHelixClusterVerifier so that ZkClient configs are configurable and uses realm-aware ZkClient APIs
Hunter Lee [Thu, 12 Mar 2020 02:44:26 +0000 (19:44 -0700)]
Make ZkCacheBaseDataAccessor and ZkHelixPropertyStore realm-aware (#863)
This commit makes both ZkCacheBaseDataAccessor and ZkHelixPropertyStore realm-aware by choosing the appropriate realm-aware ZkClients in the constructor. Also, we add a Builder here to give users options to set Connection config and Client config.
Note that ZkHelixPropertyStore extends CacheBaseDataAccessor so there is no change needed.
Hunter Lee [Thu, 12 Mar 2020 02:35:42 +0000 (19:35 -0700)]
Make ClusterSetup realm-aware (#861)
We make ClusterSetup, a Helix Java API, realm-aware so that this could be used in a multi-ZK environment.
Changelist:
Add a Builder to enable users to set internal ZkClient parameters
Add the realm-aware behavior in existing constructors
Update ConfigAccessor to reflect the change in the logic
Hunter Lee [Fri, 6 Mar 2020 02:55:27 +0000 (18:55 -0800)]
Add rerunFailingTestsCount config to surefire-plugin (#865)
It was observed that if build fails (if there is a test failure), then not all of the test goals are executed. There are currently two goals: default-test (single ZK) and multi-zk. If default-test has any test failures, we won't ever see multi-zk get executed, and this is a problem because we don't get to run the test suite in a multi-zk setup.
This config change is a workaround for this - we allow failing tests to be retried up to 3 times to make them pass. If they still fail, we will consider them as flaky tests and have to fix them moving forward.
This PR also address minor comments and fixes a test.
Hunter Lee [Wed, 4 Mar 2020 20:24:34 +0000 (12:24 -0800)]
Instrument ConfigAccessor's constructors (#856)
This diff instruments ConfigAccessor's constructors to make it realm-aware. If ConfigAccessor is unable to start on multi-realm mode, then it falls back to starting on single-realm mode.
Hunter Lee [Wed, 4 Mar 2020 08:01:40 +0000 (00:01 -0800)]
Make RealmAwareZkClient implementations use HttpRoutingDataReader for routing data (#819)
We want all implementations of RealmAwareZkClient to do a one-time query to Metadata Store Directory Service for routing data and cache it in memory. In order to accomplish that, we have introduced HttpRoutingDataReader, which is a Singleton class that makes a REST call to read routing data and caches it in memory. This diff updates the initialization logic in RealmAwareZkClients accordingly.
Changelist:
1. Update all RealmAwareZkClient initialization logic
2. Fix tests
Neal Sun [Wed, 4 Mar 2020 01:37:15 +0000 (17:37 -0800)]
Implement setRoutingData for MetadataStoreDirectoryService (#844)
Implement setRoutingData endpoint. Modify TrieRoutingData construction in MetadataStoreDirectory. Fix race conditions among writing operations in MetadataStoreDirectory.
Hunter Lee [Tue, 3 Mar 2020 04:48:10 +0000 (20:48 -0800)]
Make ConfigAccessor and ZkUtil realm-aware (#838)
To make Helix Java APIs realm-aware, we first make ConfigAccessor and ZkUtil realm-aware by instrumenting these APIs with a Builder and RealmAwareZkClients.
The Builder pattern is chosen because it is a scalable option when there are a lot of configurable parameters. It makes it easy to validate the given parameters as well.
Huizhi Lu [Sat, 29 Feb 2020 17:45:23 +0000 (09:45 -0800)]
Add FederatedZkClient (#789)
As part of ZkClient API enhancement, we wish to add FederatedZkClient, which is a wrapper of the raw ZkClient, that provides realm-aware access to ZooKeeper.
FederatedZkClient will internally maintain multiple ZooKeeper sessions connecting to different ZooKeeper realms on an as-needed basis and route requests to the appropriate ZooKeeper based on the ZK path sharding key. Ephemeral node creation is not supported.
Hunter Lee [Sat, 29 Feb 2020 02:47:41 +0000 (18:47 -0800)]
Make MSDS endpoint configurable for HttpRoutingDataReader (#836)
We need to add a few more constructors that allows the users to configure which MSDS to talk to. Applications may wish to create RealmAwareZkClients connecting to different regions or namespaces.
Neal Sun [Fri, 28 Feb 2020 00:41:39 +0000 (16:41 -0800)]
Improve MetadataStoreDirectoryAccessor endpoints and fix bugs in ZkRoutingDataReader/Writer
Improve the current status code design of MetadataStoreDirectoryAccessor endpoints by more clearly translate underlying exceptions to status codes. Fix 4 bugs in Reader/Writer.
kaisun2000 [Thu, 27 Feb 2020 22:55:41 +0000 (14:55 -0800)]
Add SharedZkClient/InnerSharedZkClient implementation (#796)
Refactor the original SharedZkClient to InnerSharedZkClient. Add
SharedZkClient implementation. The implementation use composition
pattern. It would check the ZkPath validity and delegate the
implementation to InnerSharedZkClient. In sum, InnerSharedZkClient
is shared ZkClient but not realm aware. SharedZkClient is truely
realm aware ZkClient.
Hunter Lee [Thu, 27 Feb 2020 06:45:48 +0000 (22:45 -0800)]
Update bump-up.command and ivy imports (#824)
We update the bump-up.command script here so that it includes newly added modules.
.ivy change was needed for wrapper repositories could pick up on the right open-source libraries.
Neal Sun [Thu, 27 Feb 2020 01:24:22 +0000 (17:24 -0800)]
Fix InvalidRoutingData error message in tests (#821)
This PR aims to fix the InvalidRoutingDataException error message that shows up in tests. The source of this error was due to two test cases in TestZkMetadataStoreDirectory, in which both of the test cases inserted "/a/b/c/d/e" directly into the same set of routing data. This renders the routing data invalid because one sharding key is pointing to two realms.
This was not caught at insertion because the insertion was done directly instead of through a dedicated API (because the code in question is a test case). The tests didn't fail because they relied on the raw routing data for correctness. The construction of MetadataStoreRoutingData failed "correctly" as can be seen from the error messages, however the construction failure was after raw routing data being fetched, therefore making the situation undetected.
Neal Sun [Thu, 27 Feb 2020 01:21:44 +0000 (17:21 -0800)]
Implement request forwarding for ZkRoutingDataWriter (#788)
This PR added the request forwarding feature to ZkRoutingDataWriter. It also included a lot of other work, most notably: changing ZkMetadataStoreDirectory to singleton in order to allow leader forwarding, added integration tests for the request forwarding flow, modified MetadataStoreDirectoryAccessor for it to respect underlying return values, fixed numerous behavior bugs.
Neal Sun [Wed, 26 Feb 2020 22:07:39 +0000 (14:07 -0800)]
Add getShardingKeyInPath to MetadataStoreRoutingData (#817)
Add getShardingKeyInPath to MetadataStoreRoutingData
Hunter Lee [Tue, 25 Feb 2020 22:45:28 +0000 (14:45 -0800)]
Add HttpRoutingDataReader (#775)
HttpRoutingDataReader is a component used by new ZkClient APIs to make an HTTP read request to the metadata store directory service to retrieve routing data. ZkClient APIs will construct an internal MetadataStoreRoutingData instance based on the raw routing data retrieved from MSDS.
Note: this change contains modifications to MockMSDS because the actual endpoint names changed. The methods and http server contexts have been updated.
Huizhi Lu [Sun, 23 Feb 2020 23:19:37 +0000 (15:19 -0800)]
[helix-rest] Add endpoint to get namespace routing data (#799)
RealmAwareZkClient construction needs a REST endpoint to get routing data in a namespace. This endpoint will help with reducing REST calls down to one single call to read all raw routing data in ZK.
This commit adds Java API getNamespaceRoutingData() and an endpoint "GET /routing-data".
Huizhi Lu [Sun, 23 Feb 2020 02:12:53 +0000 (18:12 -0800)]
Add REST read endpoints to helix-rest for metadata store directory (#761)
We need restful metadata store directory service to help scale out zookeeper.
This commit adds REST read endpoints to helix-rest to get sharding keys, realms and namespaces.
Hunter Lee [Fri, 21 Feb 2020 00:17:03 +0000 (16:17 -0800)]
Add DedicatedZkClient and update DedicatedZkClientFactory (#765)
As part of ZkClient API enhancement, we wish to add DedicatedZkClient, which is a wrapper of the raw ZkClient, that provides realm-aware access to ZooKeeper.
Realm-aware in that it only performs requests whose path's Zk path sharding key belongs to the ZK realm it's connected to.
Also, we need to modify DedicatedZkClientFactory so that users could use this factory to generate instances of DedicatedZkClient.