Alexander Shorin [Sat, 31 Jan 2015 16:26:49 +0000 (19:26 +0300)]
Remove useless eunit include
Paul J. Davis [Thu, 21 Aug 2014 06:11:31 +0000 (01:11 -0500)]
Update to use couch_stats
Paul J. Davis [Tue, 19 Aug 2014 16:56:19 +0000 (11:56 -0500)]
Return ok for pattern matching in client code
Paul J. Davis [Sun, 17 Aug 2014 21:32:01 +0000 (16:32 -0500)]
Replace twig with couch_log
Paul J. Davis [Sun, 17 Aug 2014 20:25:53 +0000 (15:25 -0500)]
The couch_log functions are all two-arity
Paul J. Davis [Sun, 17 Aug 2014 20:25:20 +0000 (15:25 -0500)]
Remove vestigal call to margaret
We'll re-add the metric when couch_stats is merged.
Adam Kocoloski [Fri, 18 Jul 2014 18:58:23 +0000 (14:58 -0400)]
Allow for timeout message while waiting for sender
In
9c36c1b0 we fixed counting bug in the bufffer but introduced another
one. We started asserting that the timeout message can only arrive when
the sender is not nil, but in fact it can absolutely happen that the
buffer is asked to deliver a message while one is in flight. This causes
the rexi_buffer to crash. The fix is to simply restore the clause that
ignores the timeout message if the server is waiting for a sender to
return.
BugzID: 32669
Adam Kocoloski [Thu, 26 Jun 2014 23:49:30 +0000 (19:49 -0400)]
Fix error_limit = 0, and make it the default
Setting the error_limit to 0 would previously cause rexi_server to crash
on the next error because of an unhandled exception in a queue:drop/1.
This fixes that exception and makes a limit of 0 the default. Operators
can still raise the limit to capture errors in the future if we end up
finding that useful.
BugzID: 30821
Robert Newson [Tue, 10 Jun 2014 13:01:07 +0000 (14:01 +0100)]
Add max_count to state record
Adam Kocoloski [Tue, 3 Jun 2014 18:28:37 +0000 (14:28 -0400)]
Fix counting bug in buffer
Quoting @davisp:
There's a bug in rexi_buffer that can lead to the counter in its state
running negative. I noticed this on malort looking for memory usage. A
quick reading of the code suggests its due to us getting a timeout
message with an empty queue. Theoretically this could happen if we've
exceeded the MAX_MEMORY threshold when sending a message or even with
just grabbing the buffered count when its idle.
BugzID: 28049
Adam Kocoloski [Tue, 3 Jun 2014 18:25:59 +0000 (14:25 -0400)]
Configure buffer limit by message count
This allows an operator to decide how large the buffers should be. It
also provides an escape valve to clear the buffer entirely.
Paul J. Davis [Fri, 31 Jan 2014 08:55:57 +0000 (02:55 -0600)]
Slight style tweaks to rexi_buffer hibernation
Avoid using a possibly misleading variable name by not naming it.
Alternatively, this is not the variable name you are looking for.
BugzId: 27672
Paul J. Davis [Fri, 31 Jan 2014 04:03:09 +0000 (22:03 -0600)]
Hibernate rexi_buffer when becoming idle
The rexi_buffer gen_server can hold onto quite a bit of RAM during idle
operation. This just checks when we're going back to the idle state and
hibernates until the next message arrives. This ensures that we run
garbage collection before sitting idle.
BugzId: 27672
Paul J. Davis [Fri, 31 Jan 2014 03:31:11 +0000 (21:31 -0600)]
Removed upgrade statement in rexi:kill/2
Removing cruft we no longer need that was sending an extra useless
message.
BugzId: 27671
Paul J. Davis [Thu, 12 Dec 2013 17:36:05 +0000 (11:36 -0600)]
Revert "Tag all replies to the coordinator"
Not all rexi communication happens through `rexi_utils:recv/6`. We'll
need to audit all of our apps to see anywhere that doesn't to update to
the new message format.
This reverts commit
2f967906730eedf5851752be06155fc49edd28a1.
Paul J. Davis [Tue, 10 Dec 2013 15:59:16 +0000 (09:59 -0600)]
Rename governor to buffer
Buffer describes the behavior quite a bit better than governor.
BugzId: 23717
BugzId: 23718
Adam Kocoloski [Fri, 22 Nov 2013 20:58:02 +0000 (15:58 -0500)]
Don't block the governor to send a message
This allows the governor to continue prioritizing incoming requests
instead of hanging for several seconds to try to connect to the remote
node.
BugzID: 23717
BugzID: 23718
Adam Kocoloski [Wed, 16 Oct 2013 15:51:05 +0000 (11:51 -0400)]
Try to start per-node servers immediately
I can't think of a good reason not to do this.
Adam Kocoloski [Wed, 16 Oct 2013 14:54:15 +0000 (10:54 -0400)]
Remove gov_manager from supervision
We'll remove the module itself in a subsequent release.
Adam Kocoloski [Wed, 16 Oct 2013 14:42:15 +0000 (10:42 -0400)]
Start a supervised rexi_governor per node
This generalizes rexi_server_mon to start per-node versions of a server
specified by a child module.
BugzID: 23717
BugzID: 23718
Adam Kocoloski [Wed, 16 Oct 2013 02:46:52 +0000 (22:46 -0400)]
Rejigger the governor implementation
Previously we'd spawn a number of processes to send messages to
'noconnect' / 'nosuspend' nodes. Now we're buffering the messages that
need to be sent directly in each governor and sending them one at a
time. This prevents the net_kernel from tipping over in the noconnect
case. We also decide whether to drop messages based on memory
consumption in the node instead of process limits (since we're not
spawning anymore).
BugzID: 23717
BugzID: 23718
Robert Newson [Fri, 22 Nov 2013 16:40:23 +0000 (16:40 +0000)]
Remove old code_change, set module version to 1
Paul J. Davis [Mon, 28 Oct 2013 21:03:07 +0000 (16:03 -0500)]
Implement new stream2 API
This embeds the stream_init/1 logic into the stream functions so that we
don't have to maintain the logic for inititalizing the stream for all
clients.
BugzId: 24635
Paul J. Davis [Fri, 4 Oct 2013 18:14:25 +0000 (13:14 -0500)]
Allow callbacks to update their list of workers
We need this to be able to allow coordinators to replace nodes that are
in maintenance mode.
BugzId: 22729
Paul J. Davis [Fri, 6 Sep 2013 20:04:00 +0000 (15:04 -0500)]
Exit with timeout instead of returning an atom
Every function that called `rexi:stream/1` had to check the return value
for timeout and would then call `erlang:exit/1` if that atom were
returned. Rather than force every function to make this check this just
calls exit in `rexi:stream/1`. Its possible to catch this if a process
ever happens to require it.
BugzId: 22729
Paul J. Davis [Wed, 14 Aug 2013 14:06:05 +0000 (09:06 -0500)]
Implement new streaming APIs
This adds new functions that are used by coordinators and workers to
negotiate an RPC stream. A stream is simply any response that requires
multiple messages from the worker.
BugzId: 22729
Adam Kocoloski [Fri, 16 Aug 2013 18:40:31 +0000 (14:40 -0400)]
Use '$initial_call' instead of initial_call
BugzID: 18742
Adam Kocoloski [Thu, 6 Jun 2013 19:53:24 +0000 (15:53 -0400)]
Tag all replies to the coordinator
BugzID: 20204
Adam Kocoloski [Thu, 6 Jun 2013 19:44:05 +0000 (15:44 -0400)]
Handle rexi-tagged messages
BugzID: 20204
Adam Kocoloski [Tue, 16 Apr 2013 14:12:58 +0000 (10:12 -0400)]
Start per-node servers regardless of feature flag
Previously rexi_server_mon would fail to start per-node servers on a
node where the use of those servers in rexi:cast messages was disabled
by the "server_per_node" feature flag. This mostly defeats the purpose
of the feature flag, which is to ensure that we can start all the
per-node servers before we try to use them. Moreover, it cluttered up
the error logs because rexi_server_mon would crash after trying to
execute a `start_child` on rexi_server.
The fix is not DRY but it does the job well.
BugzID: 18970
Adam Kocoloski [Thu, 11 Apr 2013 12:10:20 +0000 (08:10 -0400)]
Use the unadulterated node name in server name
BugzID: 18215
Adam Kocoloski [Thu, 11 Apr 2013 12:02:41 +0000 (08:02 -0400)]
Add a temporary feature flag for server-per-node
Paul J. Davis [Fri, 22 Mar 2013 20:07:53 +0000 (15:07 -0500)]
[2/3] Start a rexi_server per remote node
Switch sending rexi messages to the rexi_server handling requests for
the local node. This replaces the use of the singleton rexi_server
process with a rexi_server instance for each node.
The only slight gotchya in the switch is that we need to send kill
messages to both the per-node and singleton instances during the
transition because we can't tell which process may have generated the
job. The extra kill message will be removed in the next commit.
Paul J. Davis [Tue, 19 Mar 2013 03:51:29 +0000 (22:51 -0500)]
[1/3] Start a rexi_server per remote node
This is part of a multi-release upgrade to switch to using a rexi_server
instance per remote node. The first commit introduces the new
rexi_server instances, the second switches to using the new instances
and the third will remove the singleton rexi_server. Each of these
commits should be in a separate release.
The new pattern introduced by this commit will start a 'rexi_server_%b'
process where the '%b' is replaced by the `erlang:phash2(Node)` for
which it will service requests. After the second commit is release each
rexi generated message will be sent to the rexi_server instance on the
remote node handling requests for the local node. The
`rexi_utils:server_pid/1` function is used to generate the id of the
remote server.
Robert Newson [Wed, 12 Feb 2014 23:25:37 +0000 (23:25 +0000)]
Change API to function per level
Robert Newson [Wed, 12 Feb 2014 20:12:50 +0000 (20:12 +0000)]
Switch to couch_log
Robert Newson [Thu, 19 Dec 2013 18:16:58 +0000 (18:16 +0000)]
Remove references to margaret
Robert Newson [Wed, 18 Dec 2013 14:04:59 +0000 (14:04 +0000)]
Build with rebar
Robert Newson [Thu, 13 Jun 2013 12:42:11 +0000 (13:42 +0100)]
Fix up copyright headers
Paul J. Davis [Wed, 6 Mar 2013 00:03:31 +0000 (18:03 -0600)]
New build system for rexi
Paul J. Davis [Wed, 20 Mar 2013 10:04:53 +0000 (05:04 -0500)]
Remove Cloudant build system remnants
Adam Kocoloski [Wed, 13 Feb 2013 22:15:55 +0000 (17:15 -0500)]
Allow sending to anonymous remote PIDs
BugzID: 17296
Bob Dionne [Wed, 13 Feb 2013 19:48:59 +0000 (14:48 -0500)]
Change startup order in rexi_sup
Ensures rexi_governor comes up before rexi_server. Also removed the custom appup
BugzID:17287
Adam Kocoloski [Wed, 13 Feb 2013 18:14:48 +0000 (13:14 -0500)]
Add custom appup to start governor
Adam Kocoloski [Mon, 11 Feb 2013 20:52:54 +0000 (12:52 -0800)]
Merge pull request #7 from cloudant/15608-too-much-spawning2
Install a governor to limit the amount of spawning we'll do to send messages to an unresponsive node.
BugzID: 15608
Adam Kocoloski [Mon, 11 Feb 2013 20:32:31 +0000 (15:32 -0500)]
Pull settings from new config app
BugzID: 15608
Bob Dionne [Sat, 1 Dec 2012 12:46:34 +0000 (07:46 -0500)]
Manage Excessive amount of spawned pids
rexi_utils:send uses the noconnect and nosuspend options in order to immediately
return to the controller and spawns new processes to actually send the messages, assuming
the remote node is only temporarily down. In the nosuspend case these pids
can hang around indefinitely and build up.
This patch sends the messages that need to be sent from spawned processes to a new
gen_server, rexi_manager, which manages a group of rexi_governors, one for each node.
The governor does the spawning and monitoring of the pids and keeps track of how many are sent.
If a node down message is received the manager sets a timer which when expired tells the appropriate
governor to kill all the pids waiting on that node. A cap prevents spawning of processes above
a certain amount, after which messages are dropped on the floor.
BugzID: 15608
Bob Dionne [Sat, 1 Dec 2012 13:05:40 +0000 (08:05 -0500)]
Ignore some files
Bob Dionne [Sat, 1 Dec 2012 12:45:42 +0000 (07:45 -0500)]
Whitespace
Paul J. Davis [Mon, 28 Jan 2013 23:51:41 +0000 (15:51 -0800)]
Merge pull request #6 from cloudant/16883-smaller-send-buffer
Reduce stream message limit
Paul J. Davis [Mon, 28 Jan 2013 22:17:52 +0000 (16:17 -0600)]
Reduce stream message limit
Its possible that a worker is starved out of running during the initial
spin up of worker processes. When there's a high Q running its possible
that a few of the workers are prevented from sending messages back which
ends up causing significant latency as they wait for other workers to
fill up their 100 message buffer and get swapped out.
Once they start sending though the view streams fine. While we don't
understand why some workers behave like this a quick fix is to just
reduce the size of the buffer so that we're not introducing as large of
a delay in the time to first byte.
Adam Kocoloski [Tue, 17 Jul 2012 17:37:08 +0000 (10:37 -0700)]
Merge pull request #4 from cloudant/13311-improve-view-back-pressure
Implement message stream interface
BugzID: 13311
BugzID: 14075
Paul J. Davis [Tue, 12 Jun 2012 23:29:01 +0000 (18:29 -0500)]
Implement message stream interface
This allows workers to stream messages to a coordinator process which is
then responsible for acking each message. When a worker reaches a
configurable number of outstanding messages it will wait for acks before
continuing to send messages.
Paul J. Davis [Tue, 17 Apr 2012 16:30:53 +0000 (11:30 -0500)]
Provide a rexi:cast that uses erlang:send/2
Sometimes we want to have workers avoid deferring their casts to a
temporary process. This can lead to message pileups on the distribution
control sockets.
BugzID: 13469
Adam Kocoloski [Fri, 17 Feb 2012 17:13:20 +0000 (09:13 -0800)]
Merge pull request #3 from cloudant/13298-remove-blocking-send
Adam Kocoloski [Fri, 17 Feb 2012 14:05:30 +0000 (09:05 -0500)]
Don't block rexi_server to send a message
When a worker dies rexi_server needs to notify the remote client. If
communication to the remote node is impaired rexi_server may block for
several seconds trying to send the rexi_EXIT message.
Fix is to use the same 'noconnect' and 'nosuspend' options in
rexi_server that we use to submit jobs from a client.
BugzID: 13298
Adam Kocoloski [Tue, 11 Oct 2011 17:29:35 +0000 (13:29 -0400)]
Use a tagged version of twig
Bob Dionne [Sun, 9 Oct 2011 14:40:28 +0000 (10:40 -0400)]
Remove unused variables
Adam Kocoloski [Sat, 8 Oct 2011 13:57:54 +0000 (06:57 -0700)]
Merge pull request #2 from cloudant/12713-fail-rexi_DOWN-faster
BugzID: 12713
Adam Kocoloski [Sat, 8 Oct 2011 13:43:28 +0000 (09:43 -0400)]
Remove an ancient debugging statement
Adam Kocoloski [Sat, 8 Oct 2011 00:39:13 +0000 (20:39 -0400)]
Keep the DOWN notifications DRY
Adam Kocoloski [Sat, 8 Oct 2011 00:34:05 +0000 (20:34 -0400)]
Fail fast if asked to monitor process on down node
BugzID: 12713
Adam Kocoloski [Mon, 27 Jun 2011 20:03:31 +0000 (16:03 -0400)]
Track client pid in ets table
We need the client pid in order to respond to clients when the worker
crashes.
BugzID: 12344
Adam Kocoloski [Mon, 27 Jun 2011 19:16:50 +0000 (15:16 -0400)]
Add an index keyed on client reference
This prevents table scans triggered by the kill handler from causing
the server to fall over under very high throughput.
I also took the opportunity to refactor a bit and use #job records
throughout the server instead of raw tuples.
BugzID: 12344
Adam Kocoloski [Wed, 30 Mar 2011 00:25:04 +0000 (20:25 -0400)]
Use 'nosuspend' for all messages to remote nodes
The motivation for this change is to avoid suspending rexi clients
when 'busy_dist_port' events occur.
Adam Kocoloski [Tue, 29 Mar 2011 21:01:23 +0000 (17:01 -0400)]
Do not allow sender to be suspended for any reason
Adam Kocoloski [Thu, 10 Mar 2011 17:26:27 +0000 (12:26 -0500)]
Always provide the nonce to the spawned process
Adam Kocoloski [Thu, 10 Mar 2011 15:42:16 +0000 (10:42 -0500)]
Small tweak to logging level
Adam Kocoloski [Wed, 9 Mar 2011 18:28:04 +0000 (13:28 -0500)]
Use twig for logging
Adam Kocoloski [Mon, 28 Feb 2011 02:30:12 +0000 (21:30 -0500)]
Remove old appup
Adam Kocoloski [Tue, 22 Feb 2011 21:10:54 +0000 (16:10 -0500)]
Adam Kocoloski [Mon, 31 Jan 2011 20:08:13 +0000 (15:08 -0500)]
Ignore unknown cast messages
This server needs to be bulletproof.
BugzID: 11762
Adam Kocoloski [Wed, 26 Jan 2011 19:51:40 +0000 (14:51 -0500)]
Update spec to include accumulator in timeout response
BugzID: 11432
Robert Dionne [Tue, 21 Dec 2010 19:08:43 +0000 (14:08 -0500)]
Include accumulator in timeout response, BugzID 11432
Adam Kocoloski [Mon, 13 Dec 2010 18:54:12 +0000 (13:54 -0500)]
Fix a minor problem in the README and add a couple of links
Adam Kocoloski [Fri, 10 Dec 2010 18:34:29 +0000 (13:34 -0500)]
Include the exception class in the #error{}
Adam Kocoloski [Thu, 9 Dec 2010 22:35:30 +0000 (17:35 -0500)]
Move #error{} definition to rexi.hrl
Adam Kocoloski [Fri, 3 Dec 2010 03:38:47 +0000 (22:38 -0500)]
Multi-faceted refactor of error logging
* Store errors as instances of internal #error{} record
* Move error buffer API calls to rexi.erl
* Don't badmatch if an old process exits abnormally
* Trivial rename of internal state fields
Adam Kocoloski [Fri, 3 Dec 2010 03:38:35 +0000 (22:38 -0500)]
code_change for new server state
Adam Kocoloski [Fri, 3 Dec 2010 01:42:20 +0000 (20:42 -0500)]
correct specification to allow undefined nonces
Adam Kocoloski [Fri, 3 Dec 2010 01:41:54 +0000 (20:41 -0500)]
reduce code duplication in doit handler
Adam Kocoloski [Fri, 3 Dec 2010 01:41:08 +0000 (20:41 -0500)]
keep init_p/2 around for a smoother upgrade
Robert Dionne [Thu, 18 Nov 2010 16:20:26 +0000 (11:20 -0500)]
keep a circular buffer of errors cached in memory
Adam Kocoloski [Fri, 3 Dec 2010 01:06:57 +0000 (20:06 -0500)]
bundle rebar
Adam Kocoloski [Thu, 2 Dec 2010 19:47:15 +0000 (14:47 -0500)]
set version to `git describe` during rebar compile
Robert Dionne [Thu, 4 Nov 2010 18:53:44 +0000 (14:53 -0400)]
relax return type of recv
Adam Kocoloski [Wed, 27 Oct 2010 00:51:54 +0000 (20:51 -0400)]
log abnormal rexi worker deaths
Adam Kocoloski [Sat, 23 Oct 2010 16:35:12 +0000 (12:35 -0400)]
b25b5b was sloppy, erlang:send_after/3 does not wrap w/ ok
Adam Kocoloski [Fri, 22 Oct 2010 01:48:39 +0000 (21:48 -0400)]
use erlang:send_after/3 instead of timer version
http://www.erlang.org/doc/efficiency_guide/commoncaveats.html#id52228
Adam Kocoloski [Wed, 20 Oct 2010 19:18:28 +0000 (15:18 -0400)]
let rebar manage the module lists
Brad Anderson [Sat, 28 Aug 2010 01:44:22 +0000 (21:44 -0400)]
add README.md back in for apps
Brad Anderson [Wed, 18 Aug 2010 02:04:15 +0000 (22:04 -0400)]
split some rexi utilities out from fabric
Adam Kocoloski [Fri, 27 Aug 2010 19:43:44 +0000 (15:43 -0400)]
Apache 2 license, Cloudant copyright when appropriate
Adam Kocoloski [Thu, 15 Jul 2010 01:18:57 +0000 (21:18 -0400)]
bah, missed a rename
Adam Kocoloski [Thu, 15 Jul 2010 01:13:48 +0000 (21:13 -0400)]
kill was looking up the wrong ref(), so it never found anything
Adam Kocoloski [Wed, 14 Jul 2010 20:32:29 +0000 (16:32 -0400)]
5 minute default timeout for sync_reply
Adam Kocoloski [Sat, 10 Jul 2010 19:50:17 +0000 (15:50 -0400)]
thank you dialyzer
Adam Kocoloski [Thu, 17 Jun 2010 14:18:29 +0000 (10:18 -0400)]
demonitor before killing the worker
Adam Kocoloski [Thu, 17 Jun 2010 14:17:40 +0000 (10:17 -0400)]
switch to ets for managing the workers
Adam Kocoloski [Fri, 11 Jun 2010 18:57:41 +0000 (14:57 -0400)]
update rexi.app to 1.2 to match dbcore 1.1.x numbering