Adam Kocoloski [Wed, 30 Oct 2013 15:05:41 +0000 (11:05 -0400)]
Fix two bugs in the global balancing phase
* Nodes with 0 shards were being ignored because wrong datatype.
* The limit was being ignored because max not min.
BugzID: 24680
Adam Kocoloski [Wed, 30 Oct 2013 14:00:08 +0000 (10:00 -0400)]
Allow skip straight to global phase
When rebalancing a DB-per-user cluster with small Q values its typical
that
a) the local phase takes a loooong time, and
b) the local phase doesn't suggest any moves
While the local phase should still run at least once, we'll expose a
flag to skip straight to the global phase since we'll need to run the
plan generator many many times and we can't afford to wait.
BugzID: 24680
Adam Kocoloski [Wed, 23 Oct 2013 14:08:51 +0000 (10:08 -0400)]
Refuse to place shards on decom:true nodes
BugzID: 24420
Adam Kocoloski [Wed, 23 Oct 2013 14:02:33 +0000 (10:02 -0400)]
Rely on decom:true attribute to filter decom nodes
BugzID: 24420
Adam Kocoloski [Tue, 22 Oct 2013 19:29:00 +0000 (15:29 -0400)]
Ensure that the owner of a doc is also a host
BugzID: 24395
Adam Kocoloski [Wed, 25 Sep 2013 18:25:44 +0000 (14:25 -0400)]
Rewrite rebalancing plan generator
This patch splits the functionality of the module out into three
classes or work:
* Fixing zoning and replica level violations
* Contracting a cluster
* Rebalancing shards across a cluster
The implementations of the first two features are pretty similar - find
the shards that need to be moved, then choose an optimal home for each
of them. By default the contraction code will remove shards from nodes
in the "decom" zone, and the rebalancing code will ignore that zone
entirrely. An optimal home is a node that
a) is in the correct zone, and
b) has the fewest # of shards for the DB among nodes in the zone, and
c) has the fewest total # of shards among nodes satisfying a) and b)
The implementation of rebalancing is a bit more complicated. The
rebalancing algorithm looks roughly like this
For DB in all_dbs:
Ensure all nodes have at least (N*Q) div length(Nodes) shards
Ensure no node has more than (N*Q) div length(Nodes) + 1 shards
For node in nodes:
If node has more than TotalShards div length(Nodes) + 1 shards:
Donate shard to another node
The net result is that each database is balanced across the cluster and
the cluster as a whole is globally balanced.
The current version of the module prints out shard move and copy
operations in a clou-friendly format via io:format. It also returns a
list of {Op, #shard{}, node()} tuples representing the operations.
The rebalancer will stop after generating 1000 operations by default.
The limit can be customized by using the 1-arity versions of expand,
contract and fix_zoning, but note that the performance of the rebalancer
degrades as the number of pending operations increases.
BugzID: 23690
BugzID: 20770
Paul J. Davis [Thu, 5 Sep 2013 19:08:48 +0000 (14:08 -0500)]
Fix latent single-shard range hack
We had an off-by-one error when we fake #shard{} records for node local
databases. This fixes the issue. The bug was noticeable when attempting
to pass these shards to `fabric_view:is_progress_possible/1`.
BugzId: 22809
Adam Kocoloski [Mon, 19 Aug 2013 13:43:08 +0000 (09:43 -0400)]
Use a consistent commenting syntax
Adam Kocoloski [Mon, 19 Aug 2013 13:40:38 +0000 (09:40 -0400)]
Address comments from PR
Adam Kocoloski [Fri, 16 Aug 2013 17:40:50 +0000 (13:40 -0400)]
Ensure all shards are moved off non-target nodes
BugzID: 20742
Robert Newson [Wed, 31 Jul 2013 10:14:30 +0000 (11:14 +0100)]
Stabilize mem3_util:owner/2
BugzID: 21413
Robert Newson [Wed, 31 Jul 2013 10:14:15 +0000 (11:14 +0100)]
Move rotate_list to mem3_util
Adam Kocoloski [Tue, 2 Jul 2013 18:30:09 +0000 (14:30 -0400)]
Support balancing across a subset of nodes
mem3_balance implicitly assumed the set of nodes over which the DB is
hosted is expanding. We need to make a couple of small changes in the
case of cluster contraction.
BugzID: 20742
Robert Newson [Thu, 27 Jun 2013 18:19:24 +0000 (19:19 +0100)]
Fix load_shards_from_disk/2
load_shards_from_disk/2 did not expect #ordered_shards to be returned
from load_shards_from_disk/1. Since it uses a list comprehension the
mistake is silently squashed, resulting in an empty list.
In production this manifests are the occasional failure, where 'n' is
calculated as 0, causing quorum reads to fail. The very next call
succeeds as it reads the cached versions and correctly downcasts.
BugzID: 20629
Robert Newson [Tue, 25 Jun 2013 11:37:33 +0000 (12:37 +0100)]
Preserve key and incorporate range into rotation key
Robert Newson [Tue, 25 Jun 2013 11:36:59 +0000 (12:36 +0100)]
we're not rotating by DbName any more
Robert Newson [Tue, 25 Jun 2013 11:20:59 +0000 (12:20 +0100)]
refactor choose_ushards
Adam Kocoloski [Fri, 21 Jun 2013 04:14:42 +0000 (00:14 -0400)]
Zero out shard caches on upgrade
The mix of #shard and #ordered_shard records breaks ushards. Different
nodes can start returning different results.
Robert Newson [Wed, 3 Apr 2013 19:13:35 +0000 (20:13 +0100)]
Add function to assist with rebalancing
This function takes either a database name or a list of shards and a
list of target nodes to balance the shards across. Every node with
less than a fair share of shards will steal shards from the node with
the most shards as long as both shards are in the same zone.
BugzID: 18638
Adam Kocoloski [Fri, 24 May 2013 19:03:54 +0000 (15:03 -0400)]
Use a private record for event listener state
Adam Kocoloski [Fri, 24 May 2013 19:00:32 +0000 (15:00 -0400)]
Fix trivial typo
Adam Kocoloski [Thu, 23 May 2013 14:32:58 +0000 (10:32 -0400)]
Balance replication ownership across nodes
The previous algorithm was biased towards low-numbered nodes, and in the
case of a 3 node cluster would declare db1 to be the owner of all
replications. We can do better just by leveraging the existing
ushards code.
There's a possibility to refactor this as a new ushards/2 function if
that's perceived as useful.
BugzID: 19870
Paul J. Davis [Tue, 23 Apr 2013 22:26:30 +0000 (17:26 -0500)]
Update to use the new couch_event application
Robert Newson [Tue, 23 Apr 2013 21:54:48 +0000 (22:54 +0100)]
Choose ushards according to persistent record
The order of nodes in the by_range section of "dbs" documents is now
promoted to the principal order for ushards. Ushards still accounts
for Liveness, selecting the first live replica and still supports
Spread by rotating this list using the CRC32 of the database name
(since many databases will have the same layout).
If by_range and by_node are not symmetrical then by_node is used and
order is undefined to match existing behavior.
Paul J. Davis [Tue, 16 Apr 2013 22:12:38 +0000 (17:12 -0500)]
If two shards differ we need to sync
There's no security if two shards return different answers but it gives
us enough of a signal to know that we need to trigger a full on
synchronization.
BugzId: 18955
Russell Branca [Fri, 12 Apr 2013 20:48:14 +0000 (16:48 -0400)]
Moving shard maps _membership endpoint to _shards db handler
Russell Branca [Fri, 12 Apr 2013 19:06:58 +0000 (15:06 -0400)]
Add doc shard info endpoint
Russell Branca [Thu, 11 Apr 2013 18:18:12 +0000 (14:18 -0400)]
Fix _membership/$DBNAME api endpoint
This switches the JSON key to be a binary, as required by jiffy.
Also, remove extraneous <<"parts">> path from the url.
Show full shard range.
Paul J. Davis [Tue, 19 Mar 2013 03:57:46 +0000 (22:57 -0500)]
Update to use new multi rexi_server protocol
Russell Branca [Wed, 18 Jun 2014 22:04:35 +0000 (15:04 -0700)]
Handle the #doc_info case in changes_enumerator
This is to handle the special case where the user migrates a CouchDB
database to BigCouch and they have not yet compacted the
database. Once the database has been compacted, this #doc_info clause
should never be encountered.
Robert Newson [Tue, 3 Jun 2014 10:21:02 +0000 (11:21 +0100)]
Don't log when ensuring dbs exist
Robert Newson [Wed, 7 May 2014 13:48:25 +0000 (14:48 +0100)]
Add function to determine shard membership locally
mem3:belongs/2 allows you to determine if a given doc id belongs to a
given shard (whether a #shard{} record or just the filename of a
shard) without looking up the shard map or making any remote
calls.
Robert Newson [Wed, 12 Feb 2014 23:23:47 +0000 (23:23 +0000)]
Change API to function per level
Robert Newson [Wed, 12 Feb 2014 20:11:56 +0000 (20:11 +0000)]
Switch to couch_log
Paul J. Davis [Tue, 11 Feb 2014 07:54:37 +0000 (01:54 -0600)]
Add license headers
Robert Newson [Mon, 23 Dec 2013 16:55:10 +0000 (16:55 +0000)]
Add ejson_body to all mem3 open_doc attempts that need it
Robert Newson [Thu, 19 Dec 2013 18:16:58 +0000 (18:16 +0000)]
Remove references to margaret
Robert Newson [Wed, 18 Dec 2013 14:04:59 +0000 (14:04 +0000)]
Build with rebar
Robert Newson [Thu, 13 Jun 2013 12:42:11 +0000 (13:42 +0100)]
Fix up copyright headers
Paul J. Davis [Tue, 5 Mar 2013 23:55:26 +0000 (17:55 -0600)]
New build system for mem3
Paul J. Davis [Wed, 20 Mar 2013 10:04:53 +0000 (05:04 -0500)]
Remove Cloudant build system remnants
Adam Kocoloski [Thu, 7 Mar 2013 22:03:38 +0000 (14:03 -0800)]
Merge pull request #43 from cloudant/guard-against-empty-list
Guard against empty list
Robert Newson [Thu, 7 Mar 2013 21:29:35 +0000 (15:29 -0600)]
Guard against empty list
Rotating an empty list gives badarith so add a guard clause, since the
result of rotating an empty list is well known.
BugzID: 17801
Adam Kocoloski [Thu, 7 Mar 2013 20:17:36 +0000 (12:17 -0800)]
Merge pull request #42 from cloudant/17801-spread-the-pain
BugzID: 17801
Robert Newson [Wed, 6 Mar 2013 14:29:13 +0000 (08:29 -0600)]
Spread ushards load to more nodes
In some cases, notably q=1 databases, the current ushards algorithm
will always choose the same replica (because of the lists:sort and
order-preserving orddict). This causes a severely skewed load profile
if you have lots of these cases.
This patch rotates each group of nodes using the crc32 of the database
name, spreading out the load pretty evenly.
The patch is a little obscure because ushards still has remnants of
previous work (breaking nodes into the local, same zone, different
zone, but then deliberately merging local and same zone back together
because that was a silly idea).
BugzID: 17801
Adam Kocoloski [Thu, 28 Feb 2013 21:05:47 +0000 (16:05 -0500)]
Ignore other config changes
Adam Kocoloski [Wed, 27 Feb 2013 19:32:11 +0000 (11:32 -0800)]
Merge pull request #40 from cloudant/13179-refactor-config-registration
Use config app instead of couch_config
Adam Kocoloski [Wed, 27 Feb 2013 19:31:23 +0000 (14:31 -0500)]
Updated the tests too
Adam Kocoloski [Wed, 27 Feb 2013 02:18:13 +0000 (21:18 -0500)]
Use config app instead of couch_config
BugzID: 13179
Adam Kocoloski [Thu, 21 Feb 2013 14:16:52 +0000 (06:16 -0800)]
Merge pull request #38 from cloudant/17185-reduce-log-spam
Replace cache miss log with metrics
BugzID: 17185
Adam Kocoloski [Thu, 21 Feb 2013 14:14:31 +0000 (06:14 -0800)]
Merge pull request #39 from cloudant/15754-mem3-sync-backlog
Add an API for mem3_sync queue lengths
BugzID: 15754
Paul J. Davis [Thu, 21 Feb 2013 07:08:21 +0000 (01:08 -0600)]
Add an API for mem3_sync queue lengths
Paul J. Davis [Sun, 10 Feb 2013 21:30:49 +0000 (15:30 -0600)]
Replace cache miss log with metrics
This also adds metrics for cache hits and evictions as well as the miss.
Adam Kocoloski [Wed, 19 Dec 2012 16:07:35 +0000 (08:07 -0800)]
Merge pull request #36 from cloudant/13605-fix-shards-badmatch
Protect against cache_hits on non-existant entries
BugzID: 13605
Paul J. Davis [Tue, 11 Dec 2012 21:11:15 +0000 (15:11 -0600)]
Protect against cache_hits on non-existant entries
This can happen if we load a shard set from the cache and then eject the
shards before processing the cache_hit '$gen_cast' message. Instead of
trying to be fancy and reinserting the shards into the cache directly we
just rely on the fact that they'll be reinserted normally on the next
request.
Robert Newson [Thu, 6 Dec 2012 19:13:32 +0000 (11:13 -0800)]
Merge pull request #35 from cloudant/15924-dont-resurrect-on-delete
Don't resurrect shards on deletion
Robert Newson [Tue, 4 Dec 2012 19:42:05 +0000 (19:42 +0000)]
Don't resurrect shards on deletion
In bigcouch and bigcouchplus (at least) the deleted property is a
binary, <<"deleted">>, and not an atom, deleted, as it used to
be. This causes mem3_shards to recreate a shard immediately after it
is deleted, leading to profound silliness.
BugzID: 15924
Adam Kocoloski [Mon, 1 Oct 2012 20:10:01 +0000 (13:10 -0700)]
Merge pull request #33 from cloudant/11602-sync-security
BugzID: 11602
Adam Kocoloski [Mon, 1 Oct 2012 18:51:37 +0000 (11:51 -0700)]
Merge pull request #34 from cloudant/explicit_zone_placement
Explicit zone placement
BugzID: 14920
Robert Newson [Thu, 27 Sep 2012 22:47:49 +0000 (23:47 +0100)]
Placement is always specified as a string
Robert Newson [Wed, 26 Sep 2012 23:37:54 +0000 (00:37 +0100)]
Remove cruft
Robert Newson [Wed, 26 Sep 2012 23:38:09 +0000 (00:38 +0100)]
Explicit zone placement
Paul J. Davis [Tue, 25 Sep 2012 18:06:45 +0000 (13:06 -0500)]
Check security objects during internal replication
If we detect that two shards have different values for a security object
during internal replication we automatically trigger a security object
synchronization.
BugzId: 11602
Paul J. Davis [Tue, 25 Sep 2012 17:54:23 +0000 (12:54 -0500)]
Relax the mem3_sync_security fix constraint
Instead of requiring that we only have an empty security object and a
single non-empty version that out numbers empties, we relax the fixable
constraint to be a single simple majority of (N/2)+1 objects of a single
value.
BugzId: 11602
Paul J. Davis [Tue, 25 Sep 2012 22:05:27 +0000 (17:05 -0500)]
Wait for rexi_server before adding a node
The race condition between a nodeup event and rexi_server starting was
causing some superfluous errors. This just waits for rexi_server to boot
before notifying mem3_sync_nodes.
Adam Kocoloski [Fri, 7 Sep 2012 14:28:11 +0000 (07:28 -0700)]
Merge pull request #32 from cloudant/14654-mem3-sync-stuck-replications
Fix stuck internal replications after node down
BugzID: 14654
Paul J. Davis [Fri, 7 Sep 2012 06:57:03 +0000 (01:57 -0500)]
Fix stuck internal replications after node down
We weren't removing entries from the dict tracking what was in the job
queue. This looks like a bug after the switch from tuples to the #job{}
record which means its probably been around for quite awhile. Simple fix
is simply to use the correct dict key.
BugzId: 14654
Adam Kocoloski [Wed, 22 Aug 2012 13:29:18 +0000 (06:29 -0700)]
Merge pull request #31 from cloudant/14348-node-redirects
Configurable redirect of mem3 push jobs
BugzID: 14348
Robert Newson [Tue, 21 Aug 2012 16:31:08 +0000 (17:31 +0100)]
Configurable redirect of mem3 push jobs
In order to faciliate smoother node replacements, mem3 can be
configured to redirect push jobs intended for one node (the failed
one) to another (its replacement). e.g,
[mem3.redirects]
dbcore@db1.foo.cloudant.com = dbcore@db2.foo.cloudant.com
BugzID: 14348
Adam Kocoloski [Wed, 6 Jun 2012 20:58:12 +0000 (16:58 -0400)]
Remove obsolete appup
Adam Kocoloski [Wed, 6 Jun 2012 19:51:58 +0000 (15:51 -0400)]
Merge 'origin/replicator', closes #16
Conflicts:
src/mem3_rep_manager.erl
src/mem3_sup.erl
src/mem3_util.erl
Case 13884
Adam Kocoloski [Tue, 5 Jun 2012 20:19:16 +0000 (16:19 -0400)]
Add upgrade instructions
Adam Kocoloski [Tue, 5 Jun 2012 17:52:32 +0000 (13:52 -0400)]
Export live_shards/2
Adam Kocoloski [Mon, 4 Jun 2012 21:07:36 +0000 (17:07 -0400)]
Remove unused include_lib
Eventually we should insert the validation function automatically.
BugzID: 13780
Adam Kocoloski [Mon, 4 Jun 2012 21:06:46 +0000 (17:06 -0400)]
Avoid #changes_args.db_open_options for compatibility
We'll add it once we deploy the new version of the #changes_args record.
BugzID: 13780
Adam Kocoloski [Mon, 4 Jun 2012 15:11:36 +0000 (11:11 -0400)]
Look for 'deleted' and <<"deleted">> for compatibility
BugzID: 13780
Adam Kocoloski [Mon, 4 Jun 2012 14:58:40 +0000 (10:58 -0400)]
Fix error causing crash on add_node
Adam Kocoloski [Fri, 1 Jun 2012 16:03:55 +0000 (12:03 -0400)]
Merge branch '1.3.x'
Conflicts:
src/mem3.erl
src/mem3_cache.erl
src/mem3_nodes.erl
src/mem3_rep.erl
src/mem3_sup.erl
src/mem3_sync.erl
src/mem3_sync_event.erl
BugzID: 13780
Adam Kocoloski [Fri, 25 May 2012 14:39:47 +0000 (07:39 -0700)]
Merge pull request #30 from cloudant/13606-node-info-ets
Publish node metadata in a protected ets table
BugzID: 13606
Adam Kocoloski [Thu, 24 May 2012 16:43:23 +0000 (12:43 -0400)]
Remove remaining references to #state.nodes
Adam Kocoloski [Tue, 22 May 2012 01:32:22 +0000 (21:32 -0400)]
Publish node metadata in a protected ets table
This allows for far cheaper access to the zone information. The
fallback to gen_server:calls is only for the initial hot upgrade and can
be removed afterwards along with the code_change.
BugzID: 13606
Paul J. Davis [Fri, 11 May 2012 23:31:18 +0000 (18:31 -0500)]
Fix edge condition when loading shards from disk
BugzId: 13386
Adam Kocoloski [Tue, 8 May 2012 17:04:10 +0000 (13:04 -0400)]
Remove custom appup
Paul J. Davis [Tue, 8 May 2012 01:12:51 +0000 (20:12 -0500)]
Don't include deleted dbs in mem3:fold_shards/2
The `delete` option was passed to couch_db:open_doc/2 which ended up
causing deleted databases to be included in mem3:fold_shards/2. This is
counterintuitive so is being removed.
Adam Kocoloski [Mon, 7 May 2012 17:36:20 +0000 (10:36 -0700)]
Merge pull request #29 from cloudant/13511-1.3.x-improve-internal-replicator
Improve internal replicator configuration
BugzID: 13511
Paul J. Davis [Sun, 29 Apr 2012 17:07:52 +0000 (12:07 -0500)]
Improve internal replicator configuration
This work is needed to support the internal replication requirements for
cluster elasticity. This adds three new options:
* batch_size - The number of revisions to replicate in a single batch
* batch_count - The number of batches to replicate. The special value
`all` means to replicate until finished.
* filter - A 1-arity function that takes a #full_doc_info{} record and
returns `keep` or `discard` that determines if that doc should be
included in the replication.
Adam Kocoloski [Thu, 3 May 2012 20:13:55 +0000 (16:13 -0400)]
Put mem3_sync replication exit messages on one line
Adam Kocoloski [Thu, 3 May 2012 17:56:44 +0000 (10:56 -0700)]
Merge pull request #28 from cloudant/13529-reconfigure-ring-on-nodeup
Reconfigure ring on nodeup events
BugzID: 13529
Robert Newson [Thu, 3 May 2012 13:48:16 +0000 (14:48 +0100)]
Deduplicate lookup of special local databases
Adam Kocoloski [Thu, 3 May 2012 13:20:37 +0000 (09:20 -0400)]
Reconfigure ring replications on nodeup
BugzID: 13529
Adam Kocoloski [Wed, 2 May 2012 20:12:03 +0000 (16:12 -0400)]
Reply immediately if we already enqeueued the job
Adam Kocoloski [Wed, 2 May 2012 20:09:03 +0000 (16:09 -0400)]
Really remove the Job from the Q
Adam Kocoloski [Wed, 2 May 2012 19:01:44 +0000 (15:01 -0400)]
Reorder supervision tree to start mem3_nodes earlier
mem3_sync requires mem3_nodes to be running during init.
Adam Kocoloski [Tue, 1 May 2012 18:27:05 +0000 (11:27 -0700)]
Merge pull request #20 from cloudant/13470-make-ushards-zone-aware_master
Reimplement mem3:ushards to honor all 5 properties
BugzID: 13470
Adam Kocoloski [Tue, 1 May 2012 14:52:53 +0000 (10:52 -0400)]
Add upgrade instructions for .4,.6 -> .7
Adam Kocoloski [Tue, 1 May 2012 14:42:54 +0000 (07:42 -0700)]
Merge pull request #27 from cloudant/fix-initial-sync
Add a manager for node syncrhonization
Paul J. Davis [Tue, 1 May 2012 03:23:45 +0000 (22:23 -0500)]
Add a manager for node synchronization
Adam Kocoloski [Tue, 1 May 2012 01:02:02 +0000 (21:02 -0400)]
Don't be dumb about response formats
Adam Kocoloski [Mon, 30 Apr 2012 21:22:33 +0000 (17:22 -0400)]
Guard code_change to prevent future surprises
Adam Kocoloski [Mon, 30 Apr 2012 21:22:05 +0000 (17:22 -0400)]
Remove obsolete appup