Paul J. Davis [Thu, 8 Jan 2015 19:08:03 +0000 (13:08 -0600)]
Make sure mem3_rep autocreates target shards
The change to our fancier history entries introduced a regression that
internal replication wouldn't automatically create the target shards.
This fixes the issue by adding a get_or_create_db/2 in mem3_rep and
switches the use of couch_db:open_int/2 to that function.
Robert Newson [Tue, 10 May 2016 20:02:53 +0000 (21:02 +0100)]
Pass ADMIN_CTX when opening dbs
COUCHDB-3016
Benjamin Anderson [Sun, 10 Apr 2016 06:21:58 +0000 (23:21 -0700)]
Add read_concurrency option to mem3_shards table
This table sees a great deal of activity from various subsystems -
turning on read_concurrency should be a win.
COUCHDB-2984
Benjamin Anderson [Sun, 10 Apr 2016 06:08:39 +0000 (23:08 -0700)]
Use ets:select/2 to retrieve shards by name
The result of mem3_shards:for_db/1 on databases with high q values can
be very large, resulting in suboptimal performance for high-volume
callers.
mem3_sync_event_listener is only interested in a small subset of the
result of mem3_shards:for_db/1; moving this filter in to an ets:select/2
call improves performance significantly.
COUCHDB-2984
Benjamin Anderson [Sun, 10 Apr 2016 05:44:58 +0000 (22:44 -0700)]
Reduce frequency of mem3_sync:push/2 calls
In high-throughput scenarios on databases with large q values the
mem3_sync event listener becomes overloaded with messages due to the
poor performance of the shard selection logic.
It's not strictly necessary to sync on every update, but we do need to
be careful not to lose updates by keeping history too naively. This
patch adds a configurable delay and push frequencyto reduce pressure on
the mem3_sync event listener.
COUCHDB-2984
Benjamin Anderson [Sun, 10 Apr 2016 03:55:58 +0000 (20:55 -0700)]
Refactor mem3_sync events to dedicated module
COUCHDB-2984
Tony Sun [Fri, 26 Feb 2016 17:04:42 +0000 (09:04 -0800)]
Revert "Remove maintenace modes from ushards"
This reverts commit
deed2f0eb15d634a643312e71e343c1e19e1b07e.
COUCHDB-2953
Tony Sun [Wed, 24 Feb 2016 20:38:25 +0000 (12:38 -0800)]
Remove maintenace modes from ushards
Maintenance mode nodes were being served for ushards and this lead to
nodedown errors. We now only serve non-maintenance mode nodes.
COUCHDB-2953
Alexander Shorin [Tue, 1 Dec 2015 14:25:05 +0000 (17:25 +0300)]
Don't start couch_log app if we don't use it
This makes no sense since we mock it, but causes unwanted troubles
with config startup dependency.
Alexander Shorin [Mon, 12 Oct 2015 17:31:00 +0000 (20:31 +0300)]
Return HTTP 405 for unsupported request method
Robert Newson [Tue, 6 Oct 2015 16:57:03 +0000 (17:57 +0100)]
Merge remote-tracking branch 'cloudant/fix-eunit-couch-log'
Nick Vatamaniuc [Tue, 6 Oct 2015 14:55:19 +0000 (10:55 -0400)]
Fix EUnit tests.
Need couch_log. In one case mock it becuase no other apps are started.
In another use application:ensure_all_started() -- R16B02+ feature.
ILYA Khlopotov [Tue, 29 Sep 2015 20:04:43 +0000 (13:04 -0700)]
Pass supervisor's children to couch_epi
ILYA Khlopotov [Tue, 29 Sep 2015 16:09:25 +0000 (09:09 -0700)]
Fix code formating
ILYA Khlopotov [Mon, 28 Sep 2015 20:23:38 +0000 (13:23 -0700)]
Fix typo in behaviour name
ILYA Khlopotov [Mon, 28 Sep 2015 17:28:45 +0000 (10:28 -0700)]
Update to new couch_epi API
Robert Newson [Wed, 23 Sep 2015 18:24:12 +0000 (19:24 +0100)]
Fix crypto deprecations
COUCHDB-2825
ILYA Khlopotov [Wed, 15 Jul 2015 15:37:47 +0000 (08:37 -0700)]
Use dynamic http handlers
We use dynamic http handlers for:
- `_membership`
- `_shards`
Robert Kowalski [Mon, 16 Mar 2015 22:26:09 +0000 (23:26 +0100)]
reformat to 80 chars/line
Robert Kowalski [Sun, 15 Mar 2015 22:27:26 +0000 (23:27 +0100)]
change readme for the couchdb project
Robert Kowalski [Sun, 15 Mar 2015 22:22:44 +0000 (23:22 +0100)]
add license file
Alexander Shorin [Thu, 26 Feb 2015 20:27:20 +0000 (23:27 +0300)]
Merge remote-tracking branch 'kxepal/rename-system-databases'
This closes #6
Alexander Shorin [Thu, 26 Feb 2015 18:46:56 +0000 (21:46 +0300)]
Add underscore prefix for nodes and dbs database names
That's how we name system databases and there should be no exceptions.
COUCHDB-2619
Alexander Shorin [Thu, 26 Feb 2015 18:18:30 +0000 (21:18 +0300)]
Rename "shard_db" option to "shards_db" and "node_db" one to "nodes_db"
COUCHDB-2628
Alexander Shorin [Wed, 4 Feb 2015 15:43:18 +0000 (18:43 +0300)]
Merge remote-tracking branch 'iilyak/2561-make-config-API-consistent'
This closes #5
COUCHDB-2561
ILYA Khlopotov [Fri, 30 Jan 2015 19:21:35 +0000 (11:21 -0800)]
Don't restart event handler on termination
COUCHDB-2561
ILYA Khlopotov [Thu, 29 Jan 2015 21:55:48 +0000 (13:55 -0800)]
Update config_listener behaviuor
COUCHDB-2561
Alexander Shorin [Mon, 26 Jan 2015 04:29:37 +0000 (07:29 +0300)]
Use ADMIN_CTX macro from couch_db.hrl
Benjamin Bastian [Fri, 22 Aug 2014 11:39:43 +0000 (18:39 +0700)]
Update mem3 for new changes API
Robert Newson [Wed, 24 Sep 2014 20:12:36 +0000 (21:12 +0100)]
Delete mem3_rebalance for now, currently useless
Robert Newson [Mon, 15 Sep 2014 23:19:52 +0000 (00:19 +0100)]
Add and export n/2
Robert Newson [Fri, 29 Aug 2014 18:12:54 +0000 (19:12 +0100)]
fix stats paths
Robert Newson [Fri, 29 Aug 2014 16:23:07 +0000 (17:23 +0100)]
open dbs, nodes as sys dbs
Paul J. Davis [Thu, 21 Aug 2014 06:29:28 +0000 (01:29 -0500)]
Update to use couch_stats
Paul J. Davis [Sun, 17 Aug 2014 21:18:30 +0000 (16:18 -0500)]
Replace twig with couch_log
Paul J. Davis [Sun, 17 Aug 2014 18:54:51 +0000 (13:54 -0500)]
Update mem3_rebalance to work with couch_mrview
Robert Newson [Mon, 16 Jun 2014 12:04:36 +0000 (13:04 +0100)]
Remove mem3_util:owner
Russell Branca [Tue, 29 Apr 2014 23:36:06 +0000 (16:36 -0700)]
Get the shard suffix for a given database
This grabs the shards for the given database name, and then pulls out
the first shard and extracts out the suffix. mem3:shards is ets
backed, so in the general case this should be fast.
BugzId: 29571
Russell Branca [Tue, 29 Apr 2014 22:56:59 +0000 (15:56 -0700)]
Allow mem3_shards:local to take a list or binary
BugzId: 29571
Adam Kocoloski [Tue, 4 Feb 2014 20:44:14 +0000 (15:44 -0500)]
Fast forward internal repl. between file copies
In the case where two files have the same UUID we can analyze epoch
information to determine the safe start sequence.
BugzID: 27753
Mike Wallace [Fri, 20 Dec 2013 14:15:43 +0000 (14:15 +0000)]
Avoid decom:true nodes when fixing zoning
This patch prevents mem3_rebalance:fix_zoning from suggesting moves
onto nodes that are flagged with "decom":true.
BugzID: 26362
Paul J. Davis [Mon, 9 Dec 2013 20:04:39 +0000 (14:04 -0600)]
Refactor mem3_rpc:add_checkpoint/2
This is based on Adam Kocoloski's original add_checkpoint/2 but uses a
body recursive function to avoid the final reverse/filter steps.
BugzId: 21973
Adam Kocoloski [Thu, 31 Oct 2013 16:23:49 +0000 (12:23 -0400)]
Write plan to /tmp/rebalance_plan.txt
Was a request by @mattwhite to help with automation. I was fairly
sloppy in the implementation here, could leave this off and do a better
job next time.
Paul J. Davis [Fri, 6 Dec 2013 19:52:28 +0000 (13:52 -0600)]
Allow target_uuid prefixes in find_source_seq
Since sequence values only contain UUID prefixes so we need to account
for that when locating the replication checkpoints.
BugId: 21973
Paul J. Davis [Fri, 6 Dec 2013 18:08:07 +0000 (12:08 -0600)]
Include replication history on checkpoint docs
This changes how and what we store on internal replication checkpoint
documents. The two major changes are that we are now identifying
checkpoint documents by the database UUIDs (instead of the node that
hosted them) and we're storing a history of checkpoint information to
allow us to be able to replace dead shards.
The history is a list of checkpoint entries stored with exponentially
decreasing granularity. This allows us to store ~30 checkpoints covering
ranges into the billions of update sequences which means we won't need
to worry about truncations or other issues for the time being.
There's also a new mem3_rep:find_source_seq/4 helper function that will
find a local update_seq replacement provided information for a remote
shard copy. This logic is a bit subtle and should be reused rather than
reimplemented.
BugzId: 21973
Paul J. Davis [Fri, 6 Dec 2013 17:59:45 +0000 (11:59 -0600)]
Inline open_doc_revs into open_docs
This function was trivial and never reused. It was more confusing to
have it as a separate function rather than just inlining into where it's
used.
Paul J. Davis [Fri, 6 Dec 2013 17:55:47 +0000 (11:55 -0600)]
Add a new mem3_rpc module for replication RPCs
This is intended to make the local/remote code execution contexts a lot
more clear.
Paul J. Davis [Fri, 6 Dec 2013 17:42:49 +0000 (11:42 -0600)]
Reorder functions into a logical progression
This just moves functions around in the mem3_rep module to give a better
logical progression. Purely stylistic but it should make things easier
to read and find.
Paul J. Davis [Fri, 6 Dec 2013 17:36:47 +0000 (11:36 -0600)]
Update whitespace and exports formatting
Robert Newson [Fri, 22 Nov 2013 16:50:15 +0000 (16:50 +0000)]
Remove old code_change, set module version to 1
Adam Kocoloski [Wed, 30 Oct 2013 21:43:40 +0000 (17:43 -0400)]
Allow for rebalancing "special" DBs
For example, _replicator or _users.
BugzID: 24612
Adam Kocoloski [Wed, 30 Oct 2013 21:40:20 +0000 (17:40 -0400)]
Suggest moves from all donor nodes in parallel
Previously the generator would suggest all moves from the first node
before moving onto the second one. In the case where the quantum of
jobs is much smaller than the number of moves per node this results in
the other donors being neglected for long periods.
BugzID: 24612
Adam Kocoloski [Wed, 30 Oct 2013 18:17:03 +0000 (14:17 -0400)]
Allow targets to exceed floor, add another check
Sometimes we want to transfer a shard to a target even though it's
already at the floor. We add another check to make sure we're not
wasitng effort -- the difference in shard counts between the source and
the target must be 2 or greater.
We also refactor the global shard count code to avoid future atom /
binary problems.
BugzID: 24466
Adam Kocoloski [Wed, 30 Oct 2013 17:10:42 +0000 (13:10 -0400)]
Refactor global candidate selection
The old approach was getting unwieldy, hopefully this makes the tests
more explicit.
BugzID: 24466
Adam Kocoloski [Wed, 30 Oct 2013 15:53:22 +0000 (11:53 -0400)]
Stop donating once the target level is achieved
Also switched to a record accumulator for clarity.
BugzID: 24466
Adam Kocoloski [Wed, 30 Oct 2013 15:05:41 +0000 (11:05 -0400)]
Fix two bugs in the global balancing phase
* Nodes with 0 shards were being ignored because wrong datatype.
* The limit was being ignored because max not min.
BugzID: 24680
Adam Kocoloski [Wed, 30 Oct 2013 14:00:08 +0000 (10:00 -0400)]
Allow skip straight to global phase
When rebalancing a DB-per-user cluster with small Q values its typical
that
a) the local phase takes a loooong time, and
b) the local phase doesn't suggest any moves
While the local phase should still run at least once, we'll expose a
flag to skip straight to the global phase since we'll need to run the
plan generator many many times and we can't afford to wait.
BugzID: 24680
Adam Kocoloski [Wed, 23 Oct 2013 14:08:51 +0000 (10:08 -0400)]
Refuse to place shards on decom:true nodes
BugzID: 24420
Adam Kocoloski [Wed, 23 Oct 2013 14:02:33 +0000 (10:02 -0400)]
Rely on decom:true attribute to filter decom nodes
BugzID: 24420
Adam Kocoloski [Tue, 22 Oct 2013 19:29:00 +0000 (15:29 -0400)]
Ensure that the owner of a doc is also a host
BugzID: 24395
Adam Kocoloski [Wed, 25 Sep 2013 18:25:44 +0000 (14:25 -0400)]
Rewrite rebalancing plan generator
This patch splits the functionality of the module out into three
classes or work:
* Fixing zoning and replica level violations
* Contracting a cluster
* Rebalancing shards across a cluster
The implementations of the first two features are pretty similar - find
the shards that need to be moved, then choose an optimal home for each
of them. By default the contraction code will remove shards from nodes
in the "decom" zone, and the rebalancing code will ignore that zone
entirrely. An optimal home is a node that
a) is in the correct zone, and
b) has the fewest # of shards for the DB among nodes in the zone, and
c) has the fewest total # of shards among nodes satisfying a) and b)
The implementation of rebalancing is a bit more complicated. The
rebalancing algorithm looks roughly like this
For DB in all_dbs:
Ensure all nodes have at least (N*Q) div length(Nodes) shards
Ensure no node has more than (N*Q) div length(Nodes) + 1 shards
For node in nodes:
If node has more than TotalShards div length(Nodes) + 1 shards:
Donate shard to another node
The net result is that each database is balanced across the cluster and
the cluster as a whole is globally balanced.
The current version of the module prints out shard move and copy
operations in a clou-friendly format via io:format. It also returns a
list of {Op, #shard{}, node()} tuples representing the operations.
The rebalancer will stop after generating 1000 operations by default.
The limit can be customized by using the 1-arity versions of expand,
contract and fix_zoning, but note that the performance of the rebalancer
degrades as the number of pending operations increases.
BugzID: 23690
BugzID: 20770
Paul J. Davis [Thu, 5 Sep 2013 19:08:48 +0000 (14:08 -0500)]
Fix latent single-shard range hack
We had an off-by-one error when we fake #shard{} records for node local
databases. This fixes the issue. The bug was noticeable when attempting
to pass these shards to `fabric_view:is_progress_possible/1`.
BugzId: 22809
Adam Kocoloski [Mon, 19 Aug 2013 13:43:08 +0000 (09:43 -0400)]
Use a consistent commenting syntax
Adam Kocoloski [Mon, 19 Aug 2013 13:40:38 +0000 (09:40 -0400)]
Address comments from PR
Adam Kocoloski [Fri, 16 Aug 2013 17:40:50 +0000 (13:40 -0400)]
Ensure all shards are moved off non-target nodes
BugzID: 20742
Robert Newson [Wed, 31 Jul 2013 10:14:30 +0000 (11:14 +0100)]
Stabilize mem3_util:owner/2
BugzID: 21413
Robert Newson [Wed, 31 Jul 2013 10:14:15 +0000 (11:14 +0100)]
Move rotate_list to mem3_util
Adam Kocoloski [Tue, 2 Jul 2013 18:30:09 +0000 (14:30 -0400)]
Support balancing across a subset of nodes
mem3_balance implicitly assumed the set of nodes over which the DB is
hosted is expanding. We need to make a couple of small changes in the
case of cluster contraction.
BugzID: 20742
Robert Newson [Thu, 27 Jun 2013 18:19:24 +0000 (19:19 +0100)]
Fix load_shards_from_disk/2
load_shards_from_disk/2 did not expect #ordered_shards to be returned
from load_shards_from_disk/1. Since it uses a list comprehension the
mistake is silently squashed, resulting in an empty list.
In production this manifests are the occasional failure, where 'n' is
calculated as 0, causing quorum reads to fail. The very next call
succeeds as it reads the cached versions and correctly downcasts.
BugzID: 20629
Robert Newson [Tue, 25 Jun 2013 11:37:33 +0000 (12:37 +0100)]
Preserve key and incorporate range into rotation key
Robert Newson [Tue, 25 Jun 2013 11:36:59 +0000 (12:36 +0100)]
we're not rotating by DbName any more
Robert Newson [Tue, 25 Jun 2013 11:20:59 +0000 (12:20 +0100)]
refactor choose_ushards
Adam Kocoloski [Fri, 21 Jun 2013 04:14:42 +0000 (00:14 -0400)]
Zero out shard caches on upgrade
The mix of #shard and #ordered_shard records breaks ushards. Different
nodes can start returning different results.
Robert Newson [Wed, 3 Apr 2013 19:13:35 +0000 (20:13 +0100)]
Add function to assist with rebalancing
This function takes either a database name or a list of shards and a
list of target nodes to balance the shards across. Every node with
less than a fair share of shards will steal shards from the node with
the most shards as long as both shards are in the same zone.
BugzID: 18638
Adam Kocoloski [Fri, 24 May 2013 19:03:54 +0000 (15:03 -0400)]
Use a private record for event listener state
Adam Kocoloski [Fri, 24 May 2013 19:00:32 +0000 (15:00 -0400)]
Fix trivial typo
Adam Kocoloski [Thu, 23 May 2013 14:32:58 +0000 (10:32 -0400)]
Balance replication ownership across nodes
The previous algorithm was biased towards low-numbered nodes, and in the
case of a 3 node cluster would declare db1 to be the owner of all
replications. We can do better just by leveraging the existing
ushards code.
There's a possibility to refactor this as a new ushards/2 function if
that's perceived as useful.
BugzID: 19870
Paul J. Davis [Tue, 23 Apr 2013 22:26:30 +0000 (17:26 -0500)]
Update to use the new couch_event application
Robert Newson [Tue, 23 Apr 2013 21:54:48 +0000 (22:54 +0100)]
Choose ushards according to persistent record
The order of nodes in the by_range section of "dbs" documents is now
promoted to the principal order for ushards. Ushards still accounts
for Liveness, selecting the first live replica and still supports
Spread by rotating this list using the CRC32 of the database name
(since many databases will have the same layout).
If by_range and by_node are not symmetrical then by_node is used and
order is undefined to match existing behavior.
Paul J. Davis [Tue, 16 Apr 2013 22:12:38 +0000 (17:12 -0500)]
If two shards differ we need to sync
There's no security if two shards return different answers but it gives
us enough of a signal to know that we need to trigger a full on
synchronization.
BugzId: 18955
Russell Branca [Fri, 12 Apr 2013 20:48:14 +0000 (16:48 -0400)]
Moving shard maps _membership endpoint to _shards db handler
Russell Branca [Fri, 12 Apr 2013 19:06:58 +0000 (15:06 -0400)]
Add doc shard info endpoint
Russell Branca [Thu, 11 Apr 2013 18:18:12 +0000 (14:18 -0400)]
Fix _membership/$DBNAME api endpoint
This switches the JSON key to be a binary, as required by jiffy.
Also, remove extraneous <<"parts">> path from the url.
Show full shard range.
Paul J. Davis [Tue, 19 Mar 2013 03:57:46 +0000 (22:57 -0500)]
Update to use new multi rexi_server protocol
Russell Branca [Wed, 18 Jun 2014 22:04:35 +0000 (15:04 -0700)]
Handle the #doc_info case in changes_enumerator
This is to handle the special case where the user migrates a CouchDB
database to BigCouch and they have not yet compacted the
database. Once the database has been compacted, this #doc_info clause
should never be encountered.
Robert Newson [Tue, 3 Jun 2014 10:21:02 +0000 (11:21 +0100)]
Don't log when ensuring dbs exist
Robert Newson [Wed, 7 May 2014 13:48:25 +0000 (14:48 +0100)]
Add function to determine shard membership locally
mem3:belongs/2 allows you to determine if a given doc id belongs to a
given shard (whether a #shard{} record or just the filename of a
shard) without looking up the shard map or making any remote
calls.
Robert Newson [Wed, 12 Feb 2014 23:23:47 +0000 (23:23 +0000)]
Change API to function per level
Robert Newson [Wed, 12 Feb 2014 20:11:56 +0000 (20:11 +0000)]
Switch to couch_log
Paul J. Davis [Tue, 11 Feb 2014 07:54:37 +0000 (01:54 -0600)]
Add license headers
Robert Newson [Mon, 23 Dec 2013 16:55:10 +0000 (16:55 +0000)]
Add ejson_body to all mem3 open_doc attempts that need it
Robert Newson [Thu, 19 Dec 2013 18:16:58 +0000 (18:16 +0000)]
Remove references to margaret
Robert Newson [Wed, 18 Dec 2013 14:04:59 +0000 (14:04 +0000)]
Build with rebar
Robert Newson [Thu, 13 Jun 2013 12:42:11 +0000 (13:42 +0100)]
Fix up copyright headers
Paul J. Davis [Tue, 5 Mar 2013 23:55:26 +0000 (17:55 -0600)]
New build system for mem3
Paul J. Davis [Wed, 20 Mar 2013 10:04:53 +0000 (05:04 -0500)]
Remove Cloudant build system remnants
Adam Kocoloski [Thu, 7 Mar 2013 22:03:38 +0000 (14:03 -0800)]
Merge pull request #43 from cloudant/guard-against-empty-list
Guard against empty list
Robert Newson [Thu, 7 Mar 2013 21:29:35 +0000 (15:29 -0600)]
Guard against empty list
Rotating an empty list gives badarith so add a guard clause, since the
result of rotating an empty list is well known.
BugzID: 17801
Adam Kocoloski [Thu, 7 Mar 2013 20:17:36 +0000 (12:17 -0800)]
Merge pull request #42 from cloudant/17801-spread-the-pain
BugzID: 17801
Robert Newson [Wed, 6 Mar 2013 14:29:13 +0000 (08:29 -0600)]
Spread ushards load to more nodes
In some cases, notably q=1 databases, the current ushards algorithm
will always choose the same replica (because of the lists:sort and
order-preserving orddict). This causes a severely skewed load profile
if you have lots of these cases.
This patch rotates each group of nodes using the crc32 of the database
name, spreading out the load pretty evenly.
The patch is a little obscure because ushards still has remnants of
previous work (breaking nodes into the local, same zone, different
zone, but then deliberately merging local and same zone back together
because that was a silly idea).
BugzID: 17801