23 months agoUpdating .auroraversion to release version 0.17.0. rel/0.17.0
Stephan Erb [Mon, 6 Feb 2017 17:54:24 +0000 (18:54 +0100)] 
Updating .auroraversion to release version 0.17.0.

23 months agoUpdating .auroraversion to 0.17.0-rc0. rel/0.17.0-rc0
Stephan Erb [Wed, 1 Feb 2017 08:35:14 +0000 (09:35 +0100)] 
Updating .auroraversion to 0.17.0-rc0.

23 months agoIncrementing snapshot version to 0.18.0-SNAPSHOT.
Stephan Erb [Wed, 1 Feb 2017 08:35:14 +0000 (09:35 +0100)] 
Incrementing snapshot version to 0.18.0-SNAPSHOT.

23 months agoUpdating CHANGELOG for 0.17.0 release.
Stephan Erb [Wed, 1 Feb 2017 08:35:14 +0000 (09:35 +0100)] 
Updating CHANGELOG for 0.17.0 release.

23 months agoPrepare release notes for 0.17.0
Stephan Erb [Wed, 1 Feb 2017 08:03:22 +0000 (09:03 +0100)] 
Prepare release notes for 0.17.0

Reviewed at

23 months agoSuppress role deprecation warning as replacement is not yet ready.
David McLaughlin [Wed, 1 Feb 2017 07:38:54 +0000 (08:38 +0100)] 
Suppress role deprecation warning as replacement is not yet ready.

The role field was prematurely deprecated in the Mesos project.

Reviewed at

23 months agoFixed starting cron jobs when using default_docker_parameters
Steve Niemitz [Tue, 31 Jan 2017 17:18:42 +0000 (18:18 +0100)] 
Fixed starting cron jobs when using default_docker_parameters

The code was previously attempting to re-sanitize the configuration read from
storage rather than just using it as is.  This causes issues if after
sanitization the job no longer passes sanitization (which is the case here w/

We've been running this in our branch forever.

Bugs closed: AURORA-1684

Reviewed at

23 months agoFix flapping TestRunnerKillProcessGroup test.
Stephan Erb [Mon, 30 Jan 2017 20:40:14 +0000 (21:40 +0100)] 
Fix flapping TestRunnerKillProcessGroup test.

The test was working when run in isolation, but failed when executing the
entire Thermos test suite.

Bugs closed: AURORA-1809

Reviewed at

23 months agoFix pendingTasks endpoint in case of multiple TaskGroups per job.
Stephan Erb [Mon, 30 Jan 2017 20:37:59 +0000 (21:37 +0100)] 
Fix pendingTasks endpoint in case of multiple TaskGroups per job.

Central idea of this patch is to change the return value of `getPendingReasons`
from a map keyed by JobKey to a map keyed by `TaskGroupKey`. This prevents the
`IllegalArgumentException` during the map construction.

Bugs closed: AURORA-1879

Reviewed at

23 months agoMove deprecated resource validations so they happen after the thrift backfill.
Nicolás Donatucci [Mon, 30 Jan 2017 19:18:59 +0000 (11:18 -0800)] 
Move deprecated resource validations so they happen after the thrift backfill.

As the validations for NumCpus, RamMb and DiskMb happened before the thrift
backfill, those values needed to be set, even though they are deprecated. In the
thrift backfill, if the Resources field is set, then NumCpus, RamMb and DiskMb
are set accordingly.

So by moving those validations, it is now possible to only set the Resources
field instead of having to set the deprecated fields. As the validations are
moved and not removed, the ckeck for the resource values being greater than 0
still happens. Furthermore, if the Resources field is set but there is no
Resource for Ram in the set, the thrift backfill will throw an

Some tests were slightly modified because of this, mostly by adding an
unsetResources() operation. This is because as the validations now happen after
the thrift backfill, during the thrift backfill the values in the deprecated
fields are replaced by those in the Resources field (if it is set). There are
also some new tests.

Related Issue: AURORA-1707

Testing Done:

Reviewed at

23 months agoExpose Thrift server request workload stats
Mehrdad Nurolahzade [Mon, 30 Jan 2017 13:00:15 +0000 (14:00 +0100)] 
Expose Thrift server request workload stats

This patch introduces a number of stats that measure the workload generated by
Thrift server requests.

Current Thrift server stats expose the number and timing of requests received
by the server. However, they fail to reflect the size of the requests. This is
limiting us in having an accurate view of the workload handled by the scheduler.
For example, every call to `restartShards()` is recorded as one event despite
the fact that a request might only restart one shard while another request might
seek to restart 1K shards.

Bugs closed: AURORA-1826

Reviewed at

23 months agoPreemption performance improvement and new metrics release notes entry
Mehrdad Nurolahzade [Sat, 28 Jan 2017 09:48:58 +0000 (10:48 +0100)] 
Preemption performance improvement and new metrics release notes entry

Reviewed at

23 months agoCapture health check output.
Dmitriy Shirchenko [Wed, 25 Jan 2017 21:21:37 +0000 (13:21 -0800)] 
Capture health check output.

Users really could really benefit from seeing the output of the shell health
check failure, so plumbing through the output.

Testing Done:
added unit tests
e2e tests
screenshot attached.

Bugs closed: AURORA-1881

Reviewed at

23 months agoExpose finer grained offer veto stats
Mehrdad Nurolahzade [Wed, 25 Jan 2017 19:26:56 +0000 (13:26 -0600)] 
Expose finer grained offer veto stats

Bugs closed: AURORA-1835

Reviewed at

23 months agoConsider reserving for multiple tasks per preemption round
Mehrdad Nurolahzade [Tue, 24 Jan 2017 18:20:37 +0000 (19:20 +0100)] 
Consider reserving for multiple tasks per preemption round

To be fair, PendingTaskProcessor interleaves tasks from different groups.
However, this fairness comes at the price of increasing reservation time.
Even if reservations are being made for the same task group, the processor
would still restart iterating through slaves for each task instance. This
results in reevaluating all slaves already rejected in a previous search
before it finds a new viable candidate.

This patch improves `PendingTaskProcessor` performance by reducing slave
search/evaluation time, at the cost of reduced fairness.
`PendingTaskProcessor` now does reservation for a configurable maximum of
_N_ candidates per task group in each iteration over the list of slaves.

Bugs closed: AURORA-1867

Reviewed at

23 months agoEvaluate multiple preemption proposals per round
Mehrdad Nurolahzade [Tue, 24 Jan 2017 16:07:09 +0000 (17:07 +0100)] 
Evaluate multiple preemption proposals per round

`TaskScheduler` makes an attempt to preempt already identified candidates
through `Preemptor` when it fails to schedule one or more tasks. However,
`Preemptor` currently evaluates only one proposal per invocation. A proposal
may get vetoed at this point by scheduling filters. If a proposal fails
validation the task group might get penalized by `TaskGroups` to give
`PendingTaskProcessor` some time to find new preemption candidates; despite
the fact that another proposal may already exist in `slotCache`. This penalty
might result in expiration of existing proposals in `slotCache`, hence slowing
down the overall preemption process.

This patch modifies `Preemptor` so that it evaluates all existing preemption
proposals before giving up.

Bugs closed: AURORA-1868

Reviewed at

23 months agoMake leader elections resilient to ZK disconnections.
Zameer Manji [Mon, 23 Jan 2017 22:38:56 +0000 (14:38 -0800)] 
Make leader elections resilient to ZK disconnections.

As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
leadership if the ZK connection is lost or if there is a timeout. This is not
compatible with the commons based implementation which would only abdicate
leadership if the ZK session timeout occurred.

This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
custom listener that only loses leadership if a connection loss occurs.

Bugs closed: AURORA-1669

Reviewed at

23 months agoAURORA-1876 Expose stats on scheduler rate limiter
Mehrdad Nurolahzade [Mon, 23 Jan 2017 20:58:19 +0000 (14:58 -0600)] 
AURORA-1876 Expose stats on scheduler rate limiter

This patch exposes stats on `rateLimiter.acquire()` blocking events in `TaskGroups`. Hence,
providing visibility into whether scheduling rate is above/below `MAX_SCHEDULE_ATTEMPTS_PER_SEC`.

Bugs closed: AURORA-1876

Reviewed at

23 months agoAURORA-1828 Expose stats on the number of offers evaluated before a task is assigned
Mehrdad Nurolahzade [Mon, 23 Jan 2017 20:56:17 +0000 (14:56 -0600)] 
AURORA-1828 Expose stats on the number of offers evaluated before a task is assigned

Bugs closed: AURORA-1828

Reviewed at

23 months agoFix command escaping when using the Mesos containerizer.
Stephan Erb [Mon, 23 Jan 2017 07:38:52 +0000 (08:38 +0100)] 
Fix command escaping when using the Mesos containerizer.

The important bit is the change to call the Mesos containerizer with
`shell=False`. Getting rid of manual json encoding and eliminating shlex
 might have helped as well, but was more motivated by clarity rather than

Bugs closed: AURORA-1782

Reviewed at

23 months agoMake announced scheduler endpoint name configurable.
Stephan Erb [Wed, 18 Jan 2017 09:25:54 +0000 (10:25 +0100)] 
Make announced scheduler endpoint name configurable.

We decided to co-deploy an HTTPS enabled reverse proxy in front of each of our
Aurora schedulers. The proxy instances bind to `public_ip:8081` and the
schedulers to `localhost:8081`. By announcing the scheduler endpoint as `https`
we can ensure the default Aurora [client connects via HTTPS](


    [zk: 5] get /aurora/scheduler/member_0000000011

When running with `-serverset_endpoint_name=https`:

    [zk: 0] get /aurora/scheduler/member_0000000019

Bugs closed: AURORA-343

Reviewed at

23 months agoEnsure Aurora thrift support js and html.
John Sirois [Tue, 17 Jan 2017 22:34:39 +0000 (15:34 -0700)] 
Ensure Aurora thrift support js and html.

We use these for the Aurora UI and the API docs.

Bugs closed: AURORA-1875

Reviewed at

2 years agoImprove `thriftw` robustness.
John Sirois [Tue, 17 Jan 2017 21:20:15 +0000 (14:20 -0700)] 
Improve `thriftw` robustness.

Now the selected thrift is checked both for the proper version and
support of the gen langs Aurora requires. In addition, all thrifts on
the `PATH` are and an existing locally built thrift is always verified
to protect Aurora thrift requirement changes (if we ever add a gen lang

Bugs closed: AURORA-1875

Reviewed at

2 years agoLog process sampling failures with debug severity
Stephan Erb [Tue, 17 Jan 2017 20:53:18 +0000 (21:53 +0100)] 
Log process sampling failures with debug severity

The observer's logs consist of lots of warnings about being unable to find PIDs.
This is expected when running with the PID isolator, or when checkpoints are out
of date (e.g. after processes were killed by the OOM).

    W0116 14:42:54.694221 3253] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=27727)
    W0116 14:42:54.717905 3253] Error during process sampling [pid=10960]: psutil.NoSuchProcess process no longer exists (pid=10960)
    W0116 14:42:54.718089 3253] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=10960)
    W0116 14:42:54.718245 3253] Error during process sampling [pid=10026]: psutil.NoSuchProcess process no longer exists (pid=10026)
    W0116 14:42:54.718334 3253] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=10026)

This change adopts the proposal of David Robinson to decrease the severity level
to debug.

Bugs closed: AURORA-1541

Reviewed at

2 years agoExposed stats on number of offers rescinded and number of slaves lost.
Pradyumna Kaushik [Fri, 13 Jan 2017 21:09:17 +0000 (13:09 -0800)] 
Exposed stats on number of offers rescinded and number of slaves lost.

Testing Done:
curl -w '\n' | grep offers_rescinded
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
offers_rescinded 0

curl -w '\n' | grep slaves_lost
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30970    0 30970    0     0  4323k      0 --:--:-- --:--:-- --:--:-- 5040k
slaves_lost 0


Reviewed at

2 years agoExpose stats on SlotSizeCounter runs.
Mehrdad Nurolahzade [Fri, 13 Jan 2017 19:43:56 +0000 (11:43 -0800)] 
Expose stats on SlotSizeCounter runs.

Bugs closed: AURORA-1874

Reviewed at

2 years agoExpose stats on statically banned offers
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:18:43 +0000 (23:18 +0100)] 
Expose stats on statically banned offers

Bugs closed: AURORA-1859

Reviewed at

2 years agoEliminate sequential scan in MemTaskStore.getJobKeys()
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:17:34 +0000 (23:17 +0100)] 
Eliminate sequential scan in MemTaskStore.getJobKeys()

If scheduler is configured to run with the `MemTaskStore` every hit on scheduler
landing page (`/scheduler`) causes a call to `MemTaskStore.getJobKeys()` through

The implementation of `MemTaskStore.getJobKeys()` is currently very inefficient
as it requires a sequential scan of the task store and mapping to their
respective job keys. In Twitter clusters this method is currently taking half a
second per call (`mem_storage_get_job_keys`).

This patch eliminates the sequential scan and mapping to job key by simply
returning an immutable copy of the key set of the existing secondary index `job`.

Bugs closed: AURORA-1847

Reviewed at

2 years agoExpose stats on deleted job updates in JobUpdateHistoryPruner
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:15:03 +0000 (23:15 +0100)] 
Expose stats on deleted job updates in JobUpdateHistoryPruner

Bugs closed: AURORA-1856

Reviewed at

2 years agoReduce logging by ChainedStatusChecker and StatusManager when they're on the happy...
Joshua Cohen [Wed, 11 Jan 2017 22:19:49 +0000 (16:19 -0600)] 
Reduce logging by ChainedStatusChecker and StatusManager when they're  on the happy path.

Bugs closed: AURORA-1878

Reviewed at

2 years agoClean up instances of loggers using a logger name from another class.
Bing-Qian Luan [Wed, 11 Jan 2017 16:00:32 +0000 (10:00 -0600)] 
Clean up instances of loggers using a logger name from another class.

Bugs closed: AURORA-1873

Reviewed at

2 years agoExpose stats on ZooKeeper connection state
Jing Chen [Tue, 10 Jan 2017 22:35:21 +0000 (23:35 +0100)] 
Expose stats on ZooKeeper connection state

* zk_connection_state_STATE shows 1 if STATE is current connection state, otherwise 0.
* zk_connection_state_STATE_counter represents occurence times of the STATE since scheduler state

Bugs closed: AURORA-1838

Reviewed at

2 years agoEnsure destination exists when mounting files into a filesystem image.
Joshua Cohen [Tue, 10 Jan 2017 22:11:54 +0000 (16:11 -0600)] 
Ensure destination exists when mounting files into a filesystem image.

When testing filesystem isolation internally, we ran into an issue where mounting a regular file
into the task filesystem failed with exit code 32 since the mount destination did not exist. To
account for this, we'll touch an empty file in the taskfs.

Reviewed at

2 years agoReduce storage write lock contention by adopting Double-Checked Locking pattern in
Mehrdad Nurolahzade [Wed, 4 Jan 2017 21:50:46 +0000 (15:50 -0600)] 
Reduce storage write lock contention by adopting Double-Checked Locking pattern in

`TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a
transient state. It then verifies after a default time-out period of 5 minutes if the task has
transitioned out of the transient state.

The verification step takes place while holding the storage write lock. In over 99% of cases the
logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns
task has transitioned out of the transient state.

This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in

Bugs closed: AURORA-1820

Reviewed at

2 years agoExpose stats on undelivered event bus events
Mehrdad Nurolahzade [Tue, 27 Dec 2016 22:32:26 +0000 (23:32 +0100)] 
Expose stats on undelivered event bus events

Bugs closed: AURORA-1834

Reviewed at

2 years agoExpose stats on JobUpdateAction transitions
Mehrdad Nurolahzade [Tue, 27 Dec 2016 13:19:40 +0000 (14:19 +0100)] 
Expose stats on JobUpdateAction transitions

Introduced new stats that exposes `JobUpdateAction` transitions.

Refactored away from `CachedCounters` for existing metric; it was dynamically
generating new String objects (through concatenation) per stats collection event.

Fixed for a mistake in a previous changeset (;
removed unnecessary checked `Exception` on `CacheLoader.load()`.

Bugs closed: AURORA-1851

Reviewed at

2 years agoExpose timing stats on PendingTaskProcessor runs
Mehrdad Nurolahzade [Tue, 27 Dec 2016 11:49:58 +0000 (12:49 +0100)] 
Expose timing stats on PendingTaskProcessor runs

Bugs closed: AURORA-1857

Reviewed at

2 years agoUpdate to Mesos 1.1.0.
Stephan Erb [Tue, 27 Dec 2016 11:36:44 +0000 (12:36 +0100)] 
Update to Mesos 1.1.0.

Included changes:

* Handle new task states introduced in the latest Mesos release.
* Prevent NullPointer exception when inspecting an empty/invalid executor config in a test.
  Probably this is due to a change in the Mesos protobufs.
* Fix bug preventing the teardown of Vagrant boxes started by the egg build.
* Increase resources for the Mesos egg builds. The build for all distribution now takes 2h in total.

Full Mesos changelog:;a=blob_plain;f=CHANGELOG;hb=1.1.0

Bugs closed: AURORA-1813

Reviewed at

2 years agoExpose ResponseCode stats on Thrift server calls
Mehrdad Nurolahzade [Fri, 23 Dec 2016 10:08:29 +0000 (02:08 -0800)] 
Expose ResponseCode stats on Thrift server calls

Bugs closed: AURORA-1848

Reviewed at

2 years agoExpose stats on deleted tasks in TaskHistoryPruner
Mehrdad Nurolahzade [Fri, 23 Dec 2016 10:07:08 +0000 (02:07 -0800)] 
Expose stats on deleted tasks in TaskHistoryPruner

Bugs closed: AURORA-1855

Reviewed at

2 years agoAURORA-1842 Expose stats on garbage collected rows in RowGarbageCollector
Mehrdad Nurolahzade [Thu, 22 Dec 2016 10:55:51 +0000 (02:55 -0800)] 
AURORA-1842 Expose stats on garbage collected rows in RowGarbageCollector

Bugs closed: AURORA-1842

Reviewed at

2 years agoRemove ignored snapshot stats. Add high-level timings on storage start-up lifecycle.
David McLaughlin [Mon, 19 Dec 2016 18:43:10 +0000 (10:43 -0800)] 
Remove ignored snapshot stats. Add high-level timings on storage start-up lifecycle.

Reviewed at

2 years agoAvoid double writing job updates to the Scheduler Snapshot
David McLaughlin [Thu, 15 Dec 2016 21:16:23 +0000 (13:16 -0800)] 
Avoid double writing job updates to the Scheduler Snapshot

Motivation: Thanks to the mybatis query metrics we added, we found that double writing Snapshot fields for H2 stores adds considerable overhead to our snapshot creation time.

Snapshots are also written as backups, and many operators choose to process backups offline for analytics, rather than query the live scheduler (due to not being able to scale reads horizontally). So this allows operators to enable/disable the hydrated fields as needed.

Bugs closed: AURORA-1861

Reviewed at

2 years agoAdd finer grained timings to the Snapshot process. I also added some log output,...
David McLaughlin [Thu, 15 Dec 2016 18:08:54 +0000 (10:08 -0800)] 
Add finer grained timings to the Snapshot process. I also added some log output, as I found those existing numbers handy when investigating our long snapshot times.

Related ticket:

Reviewed at

2 years agoFix thrift bootstrap to use python2.7.
Joshua Cohen [Mon, 12 Dec 2016 17:33:39 +0000 (11:33 -0600)] 
Fix thrift bootstrap to use python2.7.

Reviewed at

2 years agoFixup to work under modern bash.
John Sirois [Fri, 9 Dec 2016 04:17:13 +0000 (22:17 -0600)] 
Fixup to work under modern bash.

Previously pushd/popd were used and these emit data to stdout muddying
pants output and breaking setup of the thrift serve dir structure.

 build-support/thrift/ | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Reviewed at

2 years agoClean up the python prepare_binary script:
Joshua Cohen [Fri, 9 Dec 2016 02:31:31 +0000 (20:31 -0600)] 
Clean up the python prepare_binary script:

- Specify python2.7 when extracting options from pants.ini during thrift bootstrap.
- We must run pants from the repo root.

Reviewed at

2 years agoRevert BUILD changes in 0c177058.
John Sirois [Thu, 8 Dec 2016 04:55:23 +0000 (23:55 -0500)] 
Revert BUILD changes in 0c177058.

The changes caused `python_tests` target to lose their sources which in
turn caused tests not to run.

Bugs closed: AURORA-1853

Reviewed at

2 years agoExtend warm-up time by `max_consecutive_failures` attempts.
Santhosh Kumar Shanmugham [Thu, 8 Dec 2016 02:40:59 +0000 (20:40 -0600)] 
Extend warm-up time by `max_consecutive_failures` attempts.

It is possible to set the health checks such that a task can
continually fail health checks with intermittent successes and still
succeed an update. Essentially a task fails health checks during the
`initial_interval_secs` and an additional `max_consecutive_failures`,
and then perform a successful health check to become healthy.

To be backward compatible to the above configuration, include the
`max_consecutive_failures` when computing `max_attempts_to_running`.

Bugs closed: AURORA-1841

Reviewed at

2 years agoGet pants using the same thrift binary as gradle.
John Sirois [Thu, 8 Dec 2016 02:11:21 +0000 (19:11 -0700)] 
Get pants using the same thrift binary as gradle.

This required an upgrade to the latest pants dev release to correct
an issue with the binary packager we use to generate sdists.

This is for sanity sake, and, once the TODO in
`build-support/thrift/` is addressed, it will also
allow pants-patch-free addition of new platforms (thinking ARM).

Reviewed at

2 years agoFix invalid logging that was causing pmd to NPE.
Joshua Cohen [Tue, 6 Dec 2016 18:59:23 +0000 (12:59 -0600)] 
Fix invalid logging that was causing pmd to NPE.

Reviewed at

2 years agoImprove scheduling throughput via logging changes.
Zameer Manji [Fri, 2 Dec 2016 22:13:08 +0000 (14:13 -0800)] 
Improve scheduling throughput via logging changes.

This patch makes two logging performance changes.

First, it reduces the cost of logging by replacing the costly class and line
patterns with the cheaper logger pattern. We lose line numbers and inner class
information for much cheaper logging.

I1201 15:08:40.560 [AsyncProcessor-0, StateMachine$Builder:389] SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> ACTIVE

I1201 15:06:47.181 [AsyncProcessor-0, StateMachine] SchedulerLifecycle state machine transition LEADER_AWAITING_REGISTRATION -> ACTIVE

Second, it reduces the verbosity of the `TaskStateMachine` logging where it logs
the work command from the transition. I don't think there is any operator value
in logging this (unlike the task state transitions) so I have lowered it to

Performance Before:

Benchmark                                               (numPendingTasks)  (numTasksToDelete)   Mode  Cnt  Score   Error  Units                       N/A                1000  thrpt   10  2.510 ± 0.557  ops/s                       N/A               10000  thrpt   10  0.272 ± 0.030  ops/s                       N/A               50000  thrpt   10  0.053 ± 0.011  ops/s               1000                 N/A  thrpt   10  2.446 ± 0.698  ops/s              10000                 N/A  thrpt   10  0.246 ± 0.018  ops/s              50000                 N/A  thrpt   10  0.041 ± 0.006  ops/s

Performance After:
Benchmark                                               (numPendingTasks)  (numTasksToDelete)   Mode  Cnt   Score   Error  Units                       N/A                1000  thrpt   10  14.520 ± 5.696  ops/s                       N/A               10000  thrpt   10   1.290 ± 0.361  ops/s                       N/A               50000  thrpt   10   0.254 ± 0.097  ops/s               1000                 N/A  thrpt    5   7.303 ± 5.662  ops/s              10000                 N/A  thrpt    5   0.726 ± 0.624  ops/s              50000                 N/A  thrpt    5   0.124 ± 0.058  ops/s

There is a performance improvement in the smaller case and no noticable
degredation in the larger cases. I also verified on a small cluster that the
improvements exist for the larger cases. I am unable to reduce the error bars
locally on the `InsertPendingTasksBenchmark`. I suspect the bencmark needs to be
tweaked to be more consistent.

Bugs closed: AURORA-1831

Reviewed at

2 years agoChanges to intercept and time mybatis invocations
Reza Motamedi [Fri, 2 Dec 2016 04:01:35 +0000 (22:01 -0600)] 
Changes to intercept and time mybatis invocations

MyBatis allows us to intercept calls within the execution of a mapped statement. This allows us to
time various mapped statements and ultimately gain more insight on the performance of the database

This patch introduces an interceptor on MyBatis on `updates` and `query` mapped statements. I used
the following convention to create name for the newly collected stats:
mybatis.<<the id of the mapped statement>>

After interception the process is very similar to the one in @Timed-interceptor. SlidingStats can be
used to export interval averages, total milliseconds and the event counts.

__example stats (from ./vars.json)__
mybatis.create_tables_events 1
mybatis.create_tables_events_per_sec 0.0
mybatis.create_tables_nanos_per_event 0.0
mybatis.create_tables_nanos_total 592633784
mybatis.create_tables_nanos_total_per_sec 0.0 3 0.0 0.0 2858362 0.0 333 0.0 0.0 85745680 0.0

Reviewed at

2 years agoRevert removal of twitter/commons/zk based leadership code
David McLaughlin [Thu, 1 Dec 2016 17:01:33 +0000 (09:01 -0800)] 
Revert removal of twitter/commons/zk based leadership code

See discussion here:

Reviewed at

2 years agoSlaUtil::percentile percentiles interpolation
Reza Motamedi [Wed, 30 Nov 2016 22:45:06 +0000 (16:45 -0600)] 
SlaUtil::percentile percentiles interpolation

Modification to SlaUtil::percentile to compute percentiles by interpolation

The calculation of mttX (median-time-to-X) depends on the computation of percentile values. The
current implementation does not behave nicely with a small sample size. For instance, for a given
sample set of  {50, 150}, 50-percentile is reported to be 50. Although, 100 seems a more appropriate
return value.

This patch uses Guava's Quantile for interpolation.

Reviewed at

2 years agoUpdate PMD and checkstyle.
Stephan Erb [Wed, 30 Nov 2016 08:44:17 +0000 (09:44 +0100)] 
Update PMD and checkstyle.

This updates PMD to 5.5.2 and checkstyle to 7.3. The included code
change was reported by the latest PMD version, i.e. exceptions
will be picked up automatically. There is no need for a placeholder.



Reviewed at

2 years agoUse built-in string formating capabilities of Preconditions.checkArgument.
Stephan Erb [Wed, 30 Nov 2016 08:43:01 +0000 (09:43 +0100)] 
Use built-in string formating capabilities of Preconditions.checkArgument.

We can gain a little bit of performance if we only run the string formatting when necessary.

Reviewed at

2 years agoAdded the 'reason' to the /pendingTasks endpoint
Pradyumna Kaushik [Mon, 28 Nov 2016 21:05:56 +0000 (15:05 -0600)] 
Added the 'reason' to the /pendingTasks endpoint

Bugs closed: AURORA-1762

Reviewed at

2 years agoTear down the observer in case of on unhandled errors
Stephan Erb [Thu, 24 Nov 2016 15:37:51 +0000 (16:37 +0100)] 
Tear down the observer in case of on unhandled errors

I was not able to manually trigger the root cause of AURORA-1801 by altering the Mesos filesystem layout. I have therefore adopted the general teardown idea.

Example output (using a hardcoded throw):

Bottle v0.11.6 server starting up (using CherryPyServer())...
Listening on
Hit Ctrl-C to quit.

E1106 23:03:36.722500 8699] Unhandled error in thread Thread-1 [TID=8705]. Tearing down.
Traceback (most recent call last):
  File "apache/thermos/common/", line 37, in _excepting_run
    self.__real_run(*args, **kw)
  File "apache/thermos/observer/", line 135, in run
  File "apache/thermos/observer/", line 74, in refresh
  File "apache/thermos/observer/", line 58, in _refresh_detectors
    new_paths = set(self._path_detector.get_paths())
  File "apache/aurora/executor/common/", line 35, in get_paths
    return list(set(path for path in iterate() if os.path.exists(path)))
  File "apache/aurora/executor/common/", line 35, in <genexpr>
    return list(set(path for path in iterate() if os.path.exists(path)))
  File "apache/aurora/executor/common/", line 34, in iterate
    raise RuntimeError("Fail on purpose...")
RuntimeError: Fail on purpose...
I1106 23:03:42.513900 8728] detecting assets...
I1106 23:03:42.541809 8728]   detected asset: observer.js
I1106 23:03:42.542799 8728]   detected asset: bootstrap.css
I1106 23:03:42.543728 8728]   detected asset: jquery.pailer.js
I1106 23:03:42.544576 8728]   detected asset: jquery.js
I1106 23:03:42.548482 8728]   detected asset: favicon.ico
Bottle v0.11.6 server starting up (using CherryPyServer())...
Listening on
Hit Ctrl-C to quit.

Bugs closed: AURORA-1801

Reviewed at

2 years agoAdd benchmarks for `StateManagerImpl`.
Zameer Manji [Wed, 23 Nov 2016 22:14:03 +0000 (14:14 -0800)] 
Add benchmarks for `StateManagerImpl`.

`StateManagerImpl` is in the middle of every task state transition in the
scheduler. Performance improvements here could yield scheduling throughput
improvements across the board. This adds benchmarks for the two bulk APIs,
inserting pending tasks and deleting tasks. Sample output:

Benchmark                                               (numPendingTasks)  (numTasksToDelete)   Mode  Cnt  Score   Error  Units                       N/A                1000  thrpt   10  2.510 ± 0.557  ops/s                       N/A               10000  thrpt   10  0.272 ± 0.030  ops/s                       N/A               50000  thrpt   10  0.053 ± 0.011  ops/s               1000                 N/A  thrpt   10  2.446 ± 0.698  ops/s              10000                 N/A  thrpt   10  0.246 ± 0.018  ops/s              50000                 N/A  thrpt   10  0.041 ± 0.006  ops/s

Reviewed at

2 years agoFilter out calls to fromResource for resources that Aurora does not support yet to...
Renan DelValle [Wed, 23 Nov 2016 12:08:51 +0000 (13:08 +0100)] 
Filter out calls to fromResource for resources that Aurora does not support yet to avoid crashing

Added filters whenever fromResource is called for a Protos.Resource in order to avoid Aurora crashing.
Previously only bagFromMesosResources was using the SUPPORTED_RESOURCE filter.

Reviewed at

2 years agoFix performance regression in AttributeAggregate performance.
Stephan Erb [Tue, 22 Nov 2016 18:33:26 +0000 (19:33 +0100)] 
Fix performance regression in AttributeAggregate performance.

This commit ensures AttributeAggregate will only be computed if needed by
limit constraints. This is the case in 0.16 but broken on master since the
introduction of scheduling attempts with multiple tasks.

In order to better model the latter this patch also updates the the
benchmarks to schedule multipe tasks per scheduleTask call.

Without the fix:
SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  thrpt   10  404.446 ± 31.252  ops/s
SchedulingBenchmarks.FillClusterBenchmark.runBenchmark  thrpt   10  7.233 ± 3.058  ops/s

With the fix:
SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  thrpt   10  432.245 ± 16.963  ops/s
SchedulingBenchmarks.FillClusterBenchmark.runBenchmark  thrpt   10  87.560 ± 14.600  ops/s

Bugs closed: AURORA-1802

Reviewed at

2 years agoAdopt built-in string formatting in Preconditions.checkState and our logger.
Stephan Erb [Mon, 21 Nov 2016 09:45:56 +0000 (10:45 +0100)] 
Adopt built-in string formatting in Preconditions.checkState and our logger.

Inspired by this replaces many usages of
`String.format` with the built-in formatting in `Preconditions.checkState` and
our logger. This has the advantage that the formatting is only done when
necessary. A couple of other usages are replaced with `String.join` or simple
string concatenation which tends to be faster than the more powerful

Reviewed at

2 years agoSpeedup preemption by eliminating costly sting formatting.
Stephan Erb [Sun, 20 Nov 2016 17:40:53 +0000 (18:40 +0100)] 
Speedup preemption by eliminating costly sting formatting.

SchedulingBenchmarks.PreemptorSlotSearchBenchmark.runBenchmark                  1  thrpt   10  16.011 ± 0.323  ops/s
SchedulingBenchmarks.PreemptorSlotSearchBenchmark.runBenchmark                 10  thrpt   10  15.141 ± 3.774  ops/s
SchedulingBenchmarks.PreemptorSlotSearchBenchmark.runBenchmark                100  thrpt   10  16.178 ± 3.710  ops/s
SchedulingBenchmarks.PreemptorSlotSearchBenchmark.runBenchmark               1000  thrpt   10  16.358 ± 0.540  ops/s

SchedulingBenchmarks.PreemptorSlotSearchBenchmark.runBenchmark                  1  thrpt   10  41.832 ± 1.412  ops/s
SchedulingBenchmarks.PreemptorSlotSearchBenchmark.runBenchmark                 10  thrpt   10  46.049 ± 3.435  ops/s
SchedulingBenchmarks.PreemptorSlotSearchBenchmark.runBenchmark                100  thrpt   10  44.199 ± 1.272  ops/s
SchedulingBenchmarks.PreemptorSlotSearchBenchmark.runBenchmark               1000  thrpt   10  38.722 ± 0.773  ops/s

Reviewed at

2 years agoAdd benchmark for progressively filling a cluster.
Stephan Erb [Thu, 17 Nov 2016 23:10:14 +0000 (00:10 +0100)] 
Add benchmark for progressively filling a cluster.

The existing benchmarks were only exercising the code paths for
mismatching offers but not for successfully launched tasks. This new
benchmark fills this gap.

In addition, this commit changes the default from DBTaskStore to
MemTaskStore for all scheduling benchmarks. Without the switch all
scheduling actions will be dominated by task store operations. This
can (and has in the past) prevent the discovery of performance

MemTaskStore (now the default):
SchedulingBenchmarks.FillClusterBenchmark.runBenchmark  thrpt   10 4.912 ± 1.790  ops/s

DBTaskStore (former default):
SchedulingBenchmarks.FillClusterBenchmark.runBenchmark  thrpt   10 0.418 ± 0.076  ops/s

Related bug: AURORA-1802

Reviewed at

2 years agoChange job updates to rely on `health-checks` rather than on `watch_secs`.
Santhosh Kumar Shanmugham [Thu, 17 Nov 2016 21:59:35 +0000 (13:59 -0800)] 
Change job updates to rely on `health-checks` rather than on `watch_secs`.

Make RUNNING a first class state to indicate that the task is running
and is healthy. It is achieved by introducing a new configuration
parameter `min_consecutive_successes`, which will dictate when to move
a task into RUNNING state.

With this change, it is possible to set the `watch_secs` to 0, so that
updates are purely based on the task's health, rather than relying on
watching the task to in RUNNING state for a pre-determined timeout.

Testing Done:
sh ./src/test/sh/org/apache/aurora/e2e/

Bugs closed: AURORA-1225

Reviewed at

2 years agoMake scheduling benchmarks more realistic.
Stephan Erb [Thu, 17 Nov 2016 19:43:49 +0000 (20:43 +0100)] 
Make scheduling benchmarks more realistic.

This patch aims to strike a better balance between available offers, already
launched tasks, and tasks to be scheduled. Specifically, it ensures that we
only measure scheduling of tasks for which instances of the same job are
already running (e.g. as needed by limit constraints).

The following setup is now the default: Given N hosts

* there are 0.25 * N tasks of job A scheduled
* there are 0.25 * N tasks of job B scheduled
* there are 0.50 * N free offers available
* we try to schedule a task instance of job A

Hopefully this provides us with benchmark results closer to actual
production performance.


SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  thrpt   10  1646.245 ± 273.340  ops/s
SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  thrpt   10  12526.174 ± 814.677  ops/s


SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  thrpt   10  241.808 ± 51.952  ops/s
SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  thrpt   10  3447.655 ± 565.998  ops/s

Related bug: AURORA-1802

Reviewed at

2 years agoFilter out resources that Aurora does not yet support to avoid crashing.
Renan DelValle [Thu, 17 Nov 2016 18:36:34 +0000 (12:36 -0600)] 
Filter out resources that Aurora does not yet support to avoid crashing.

Placed filter when creating resource bags from Mesos resources in to avoid trying to conver
unsupported resources which leads to the scheduler crashing.

Reviewed at

2 years agoUpgrade guava to 20.0
Zameer Manji [Wed, 16 Nov 2016 16:25:26 +0000 (08:25 -0800)] 
Upgrade guava to 20.0

Release Notes:

It's the usual mix of new features and deprecations. The additions of
`Quantiles` and `Stats` could give us some quick improvements in our stats

Bugs closed: AURORA-1821

Reviewed at

2 years agoUpdate Curator to 2.11.1
Stephan Erb [Wed, 16 Nov 2016 08:19:51 +0000 (09:19 +0100)] 
Update Curator to 2.11.1

The latest version includes a couple of bug fixes but nothing major worth calling out.



Reviewed at

2 years agoaurora job inspect should have a --write-json option
Jing Chen [Tue, 15 Nov 2016 16:44:22 +0000 (10:44 -0600)] 
aurora job inspect should have a --write-json option

Bugs closed: AURORA-1504

Reviewed at

2 years agoUpdate vagrant to a working basebox.
Joshua Cohen [Wed, 9 Nov 2016 20:59:22 +0000 (14:59 -0600)] 
Update vagrant to a working basebox.

Reviewed at

2 years agoResolve docker tags to concrete identifiers for DockerContainerizer
Santhosh Kumar Shanmugham [Tue, 8 Nov 2016 22:27:05 +0000 (23:27 +0100)] 
Resolve docker tags to concrete identifiers for DockerContainerizer

Docker tags are mutable and can point to different different images
at different points in time. This makes a job launched with a Docker
image to be mutable across restarts of the job. This breaks Aurora's
guarantee of job immutability (except via job updates).

This change introduces a binding helper, that resolves docker name:tag
to a concrete registry/name@digest identifier. Identifying
docker images via a content-addressable digest is available via the
Docker Registry v2, that is a prerequisite for this feature.

Bugs closed: AURORA-1014

Reviewed at

2 years agoFix regression in 5410c22.
Zameer Manji [Sat, 5 Nov 2016 01:11:46 +0000 (18:11 -0700)] 
Fix regression in 5410c22.

The hard dependency on `prctl` broke thermos unit tests both on Apache Jenkins
and OS X. This adopts serb's suggestion and
wraps the `prcl(2)` call in a try except block.

This also exposed some flakyness in
`TestRunnerKillProcessGroup.test_pg_is_killed`. Marked the test as flaky and
filed AURORA-1809.

Testing Done:
./pants test.pytest --junit-xml-dir="$PWD/dist/test-results" src/{main,test}/python:: -- -v

Reviewed at

2 years agoSend SIGTERM to daemonized processes on shutdown.
Zameer Manji [Fri, 4 Nov 2016 20:41:25 +0000 (13:41 -0700)] 
Send SIGTERM to daemonized processes on shutdown.


Processes can deamonize and escape the supervision of a coordinator. Using the
Docker Containerizer or the Mesos Containerizer with pid isolation means that
the processes will be come reparented to the sh process that launches the
executor. For example:

root@aurora:/# ps xf
   48 ?        Ss     0:00 /bin/bash
   86 ?        R+     0:00  _ ps xf
    1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/va
    5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vag
   23 ?        S      0:00  _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152 --
   29 ?        Ss     0:00      _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
   32 ?        S      0:00      |   _ /bin/bash -c      while true; do       echo hello world       sleep 10     done
   81 ?        S      0:00      |       _ sleep 10
   31 ?        Ss     0:00      _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
   33 ?        S      0:00          _ /bin/bash -c      while true; do       echo hello world       sleep 10     done
   82 ?        S      0:00              _ sleep 10
   47 ?        S      0:00 python ./


Ensure processes that escape the supervision of the coordinator reparent to the
runner who can send signals to them on task tear down. We do this by using the
`PR_SET_CHILD_SUBREAPER` flag of `prctl(2)`.

After this change the process tree looks like:
root@aurora:/# ps xf
   66 ?        Ss     0:00 /bin/bash
   70 ?        R+     0:00  _ ps xf
    1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/va
    5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vag
   23 ?        S      0:00  _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849 --
   33 ?        Ss     0:00      _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
   40 ?        S      0:00      |   _ /bin/bash -c      while true; do       echo hello world       sleep 10     done
   63 ?        S      0:00      |       _ sleep 10
   36 ?        Ss     0:00      _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
   37 ?        S      0:00      |   _ /bin/bash -c      while true; do       echo hello world       sleep 10     done
   62 ?        S      0:00      |       _ sleep 10
   55 ?        S      0:00      _ python ./


Now the runner is aware of the reparented procesess can can tear it down cleanly
with a `SIGTERM`.

Testing Done:

Bugs closed: AURORA-1808

Reviewed at

2 years agoLog TaskInfo and Assigned Task on task startup.
Zameer Manji [Fri, 4 Nov 2016 18:45:45 +0000 (11:45 -0700)] 
Log TaskInfo and Assigned Task on task startup.

The executor logs `ExecutorInfo`, `FrameworkInfo`, `SlaveInfo` on startup. This
adds logging of `TaskInfo` and the Assigned Task object when it is received.

Testing Done:
Launched a task in vagrant and checked the logs. Example output:
I1103 09:55:40.991879 24713] Executor [None]: TaskInfo: name: "www-data/prod/hello"
task_id {
  value: "www-data-prod-hello-0-f33684f5-58a7-4dbe-af8c-a4fe08a862b6"
slave_id {
  value: "d8988ce6-c900-49a1-897d-bc141f390394-S0"
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 128.0
  role: "*"
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 0.5
  role: "aurora-role"
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 0.5
  role: "*"
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 128.0
  role: "aurora-role"
executor {
  executor_id {
    value: "thermos-www-data-prod-hello-0-f33684f5-58a7-4dbe-af8c-a4fe08a862b6"
  resources {
    name: "cpus"
    type: SCALAR
    scalar {
      value: 0.25
    role: "*"
  resources {
    name: "mem"
    type: SCALAR
    scalar {
      value: 128.0
    role: "aurora-role"
  command {
    uris {
      value: "/home/vagrant/aurora/dist/thermos_executor.pex"
      executable: true
    value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json --mesos-containerizer-path=/usr/libexec/mesos/mesos-containerizer"
  framework_id {
    value: "d8988ce6-c900-49a1-897d-bc141f390394-0000"
  name: "AuroraExecutor"
  source: ""
  labels {
    labels {
      key: "source"
      value: ""
data: "\013\000\001\000\000\000:www-data-prod-hello-0-f33684f5-58a7-4dbe-af8c-a4fe08a862b6\013\000\002\000\000\000\'d8988ce6-c900-49a1-897d-bc141f390394-S0\013\000\003\000\000\000\014192.168.33.7\014\000\004\002\000\007\001\004\000\010?\360\000\000\000\000\000\000\n\000\t\000\000\000\000\000\000\000\200\n\000\n\000\000\000\000\000\000\000\200\010\000\013\000\000\000\000\010\000\r\000\000\000\001\014\000\021\013\000\002\000\000\000\007vagrant\000\002\000\022\000\016\000\024\014\000\000\000\000\016\000\025\013\000\000\000\000\r\000\026\013\013\000\000\000\000\014\000\031\013\000\001\000\000\000\016AuroraExecutor\013\000\002\000\000\004\021{\"environment\": \"prod\", \"health_check_config\": {\"initial_interval_secs\": 15.0, \"health_checker\": {\"http\": {\"expected_response_code\": 0, \"endpoint\": \"/health\", \"expected_response\": \"ok\"}}, \"interval_secs\": 10.0, \"timeout_secs\": 1.0, \"max_consecutive_failures\": 0}, \"name\": \"hello\", \"service\": true, \"max_task_failures\": 1, \"cron_collision_policy\": \"KILL_EXISTING\", \"enable_hooks\": false, \"cluster\": \"devcluster\", \"task\": {\"processes\": [{\"daemon\": false, \"name\": \"hello\", \"ephemeral\": false, \"max_failures\": 1, \"min_duration\": 5, \"cmdline\": \"\n    while true; do\n      echo hello world\n      sleep 10\n    done\n  \", \"final\": false}], \"name\": \"hello\", \"finalization_wait\": 30, \"max_failures\": 1, \"max_concurrency\": 0, \"resources\": {\"gpu\": 0, \"disk\": 134217728, \"ram\": 134217728, \"cpu\": 1.0}, \"constraints\": [{\"order\": [\"hello\"]}]}, \"production\": false, \"role\": \"www-data\", \"tier\": \"preemptible\", \"lifecycle\": {\"http\": {\"graceful_shutdown_endpoint\": \"/quitquitquit\", \"port\": \"health\", \"shutdown_endpoint\": \"/abortabortabort\"}}, \"priority\": 0}\000\016\000\033\014\000\000\000\000\014\000\034\013\000\001\000\000\000\010www-data\013\000\002\000\000\000\004prod\013\000\003\000\000\000\005hello\000\014\000\035\014\000\001\017\000\002\014\000\000\000\000\000\000\013\000\036\000\000\000\013preemptible\016\000 \014\000\000\000\003\004\000\001?\360\000\000\000\000\000\000\000\n\000\003\000\000\000\000\000\000\000\200\000\n\000\002\000\000\000\000\000\000\000\200\000\016\000!\014\000\000\000\000\000\r\000\005\013\010\000\000\000\000\010\000\006\000\000\000\000\000"
labels {
  labels {
    key: "org.apache.aurora.tier"
    value: "preemptible"
discovery {
  visibility: CLUSTER
  name: ""
  environment: "prod"
  location: "devcluster"

I1103 09:55:40.991996 24713] Executor [None]: launchTask got task: www-data/prod/hello:www-data-prod-hello-0-f33684f5-58a7-4dbe-af8c-a4fe08a862b6
I1103 09:55:40.993160 24713] Executor [d8988ce6-c900-49a1-897d-bc141f390394-S0]: Assigned task: AssignedTask(task=TaskConfig(isService=True, contactEmail=None, taskLinks={}, tier='preemptible', mesosFetcherUris=set([]), executorConfig=ExecutorConfig(data='{"environment": "prod", "health_check_config": {"initial_interval_secs": 15.0, "health_checker": {"http": {"expected_response_code": 0, "endpoint": "/health", "expected_response": "ok"}}, "interval_secs": 10.0, "timeout_secs": 1.0, "max_consecutive_failures": 0}, "name": "hello", "service": true, "max_task_failures": 1, "cron_collision_policy": "KILL_EXISTING", "enable_hooks": false, "cluster": "devcluster", "task": {"processes": [{"daemon": false, "name": "hello", "ephemeral": false, "max_failures": 1, "min_duration": 5, "cmdline": "\n    while true; do\n      echo hello world\n      sleep 10\n    done\n  ", "final": false}], "name": "hello", "finalization_wait": 30, "max_failures": 1, "max_concurrency": 0, "resources": {"gpu": 0, "disk": 134217728, "ram": 134217728, "cpu": 1.0}, "constraints": [{"order": ["hello"]}]}, "production": false, "role": "www-data", "tier": "preemptible", "lifecycle": {"http": {"graceful_shutdown_endpoint": "/quitquitquit", "port": "health", "shutdown_endpoint": "/abortabortabort"}}, "priority": 0}', name='AuroraExecutor'), requestedPorts=set([]), maxTaskFailures=1, priority=0, ramMb=128, job=JobKey(environment='prod', role='www-data', name='hello'), production=False, diskMb=128, resources=set([Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=None, numCpus=1.0), Resource(ramMb=128, numGpus=None, namedPort=None, diskMb=None, numCpus=None), Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=128, numCpus=None)]), owner=Identity(user='vagrant'), container=Container(docker=None, mesos=MesosContainer(image=None, volumes=[])), metadata=set([]), numCpus=1.0, constraints=set([])), taskId='www-data-prod-hello-0-f33684f5-58a7-4dbe-af8c-a4fe08a862b6', instanceId=0, assignedPorts={}, slaveHost='', slaveId='d8988ce6-c900-49a1-897d-bc141f390394-S0')

Bugs closed: AURORA-1792

Reviewed at

2 years agoDocument how to create a custom CLI build
David McLaughlin [Fri, 4 Nov 2016 12:27:28 +0000 (13:27 +0100)] 
Document how to create a custom CLI build

Document how to create a custom pex that you can use to put deployment-specific customizations.

Reviewed at

2 years agoPopulate curator latches with scheduler information
Jing Chen [Fri, 4 Nov 2016 12:12:20 +0000 (13:12 +0100)] 
Populate curator latches with scheduler information


  (CONNECTED) /> get aurora/scheduler/member_0000000173
  (CONNECTED) /> get aurora/scheduler/_c_fe00a931-df92-4041-97db-9ef27a56e264-latch-0000000172

Bugs closed: AURORA-1785

Reviewed at

2 years agoUpdate h2 database to 1.4.193
Stephan Erb [Wed, 2 Nov 2016 22:51:33 +0000 (23:51 +0100)] 
Update h2 database to 1.4.193

There does not seem to be anything major in the changelog that is worth updating for. However, staying up to date probably does no harm either.


Reviewed at

2 years agoEnable per task volume mounts via scheduler API
Zameer Manji [Mon, 31 Oct 2016 23:11:49 +0000 (16:11 -0700)] 
Enable per task volume mounts via scheduler API

This allows users to specify volume mounts for tasks using the unified
containerizer if the operator permits them. This is analogous to enabling docker
parameters per task and using the `--volume` parameter.

This does not include the needed DSL changes or an e2e test which will be in a
subsequent diff.

Bugs closed: AURORA-1107

Reviewed at

2 years agoAdded short example to documentation how to use .thermos_profile.
Rogier Dikkes [Mon, 31 Oct 2016 17:50:55 +0000 (12:50 -0500)] 
Added short example to documentation how to use .thermos_profile.

Reviewed at

2 years agoRe-introduce --executor_registration_timeout.
John Sirois [Mon, 24 Oct 2016 15:49:41 +0000 (11:49 -0400)] 
Re-introduce --executor_registration_timeout.

This stabilizes the slave against image copies that take a long time
which is tracked here:

Reviewed at

2 years agoUpgrade pants to the 1.3.0 dev series.
John Sirois [Sun, 23 Oct 2016 18:13:20 +0000 (14:13 -0400)] 
Upgrade pants to the 1.3.0 dev series.

This gets rid of an impossible to work around warning emitted by every
pants invocation and prepares for consuming an upcoming pants change
that will allow Aurora to use the thrift binary selected by
`build-support/thrift/thriftw` consistently everywhere.

Reviewed at

2 years agoAdding logic to copy network files when using the Mesos containierizer with a Docker...
Justin Pinkul [Wed, 19 Oct 2016 19:13:28 +0000 (14:13 -0500)] 
Adding logic to copy network files when using the Mesos containierizer with a Docker image.

The networking files /etc/resolv.conf, /etc/hosts and /etc/hostname are now copied into the taskfs
when using the Mesos containierizer with a Docker image.

Bugs closed: AURORA-1798

Reviewed at

2 years agoCheck identity before comparing fields In immutable Thrift structs.
Stephan Erb [Tue, 18 Oct 2016 20:18:32 +0000 (22:18 +0200)] 
Check identity before comparing fields In immutable Thrift structs.

I saw THRIFT-3868 and thought we could apply the same micro-optimization
as well. Details:

Example of a generated equals method (in ITaskConfig):

    public boolean equals(Object o) {
      if (this == o) {
        return true;
      if (!(o instanceof ITaskConfig)) {
        return false;
      ITaskConfig other = (ITaskConfig) o;
      return Objects.equals(job, other.job)
          && Objects.equals(owner, other.owner)
          && Objects.equals(isService, other.isService)
          && Objects.equals(numCpus, other.numCpus)
          && Objects.equals(ramMb, other.ramMb)
          && Objects.equals(diskMb, other.diskMb)
          && Objects.equals(priority, other.priority)
          && Objects.equals(maxTaskFailures, other.maxTaskFailures)
          && Objects.equals(production, other.production)
          && Objects.equals(tier, other.tier)
          && Objects.equals(resources, other.resources)
          && Objects.equals(constraints, other.constraints)
          && Objects.equals(requestedPorts, other.requestedPorts)
          && Objects.equals(mesosFetcherUris, other.mesosFetcherUris)
          && Objects.equals(taskLinks, other.taskLinks)
          && Objects.equals(contactEmail, other.contactEmail)
          && Objects.equals(executorConfig, other.executorConfig)
          && Objects.equals(metadata, other.metadata)
          && Objects.equals(container, other.container);

Reviewed at

2 years agoHandle the case where content type header is null.
Zameer Manji [Tue, 18 Oct 2016 19:00:03 +0000 (12:00 -0700)] 
Handle the case where content type header is null.

Per the
`getContentType` can return `null`. This now handles that case gracefully.

Bugs closed: AURORA-1795

Reviewed at

2 years agoAdding an error message when the mesos_containerizer_path is not set correctly.
Justin Pinkul [Tue, 18 Oct 2016 18:52:22 +0000 (11:52 -0700)] 
Adding an error message when the mesos_containerizer_path is not set correctly.

Testing Done:
I verified the new error makes its way to the UI when mesos_containerizer_path
is set to a file that does not exist and also verified the executors complates
succesfully when mesos_containerizer_path is set to the correct location.

Unit tests:
./pants test src/test/python/apache/thermos::
./pants test src/test/python/apache/aurora/executor::

Bugs closed: AURORA-1789

Reviewed at

2 years agoFix the -enable_revocable_ram flag
Stephan Erb [Tue, 18 Oct 2016 07:08:57 +0000 (09:08 +0200)] 
Fix the -enable_revocable_ram flag

The mentioned flag has been introduced in
Unfortunately, as detailed in the bug report, my testing was not thorough enough.

Problem description:

* The flag is used the `ResourceType` enum constructor. This implies the flag value
  needs to be available during class loading.
* Values supplied via the scheduler command line are only set at runtime, right at
  the beginning `main` [1].
* Luckily, there is a check in our arg parsing library that warns if a value is
  changed after it has already been read. In other words: We get an exception if
  we change the flag, because it has already been read during class loading.

This patch corrects this issue by treating the arguments as a supplier which can
be read lazily at runtime. The patch also extends the existing e2e test for
revocable resources to also consider RAM.


Bugs closed: AURORA-1794

Reviewed at

2 years agoAdded Electronic arts to the adopter's list.
Jake Smullin [Fri, 14 Oct 2016 21:17:02 +0000 (23:17 +0200)] 
Added Electronic arts to the adopter's list.

Reviewed at

2 years agoIntroduce a --ip option to Thermos observer
Stephan Erb [Fri, 14 Oct 2016 21:00:03 +0000 (23:00 +0200)] 
Introduce a --ip option to Thermos observer

This enables operators to bind the observer to a specific interface, just like it is possible for the Aurora scheduler and Mesos.

Reviewed at

2 years agoBlank out executor config in startJobUpdate log messages.
Stephan Erb [Thu, 13 Oct 2016 07:51:59 +0000 (09:51 +0200)] 
Blank out executor config in startJobUpdate log messages.

We are already applying a blanking of `ExecutorConfig` objects for any API
call with a `JobConfiguration` argument. We now extend this filtering to also
consider `JobUpdateRequest` objects.

Example log message with blanked config.

I1012 08:32:59.841 [qtp713342922-40, LoggingInterceptor:78] startJobUpdate(JobUpdateRequest(taskConfig:TaskConfig(job:JobKey(role:www-data, environment:prod, name:hello), owner:Identity(user:vagrant), isService:true, numCpus:1.0, ramMb:128, diskMb:128, priority:0, maxTaskFailures:1, production:false, tier:preemptible, resources:[], constraints:[], requestedPorts:[], taskLinks:{}, executorConfig:ExecutorConfig(name:BLANKED, data:BLANKED), metadata:[], container:<Container mesos:MesosContainer()>), instanceCount:1, settings:JobUpdateSettings(updateGroupSize:1, maxPerInstanceFailures:0, maxFailedInstances:0, minWaitInInstanceRunningMs:45000, rollbackOnFailure:true, updateOnlyTheseInstances:null, waitForBatchCompletion:false), metadata:[Metadata(key:org.apache.aurora.client.update_id, value:773ae166-bd77-4b07-8368-62473aecd67e)]), null)

Reviewed at

2 years agoRevert "Add min_consecutive_health_checks in HealthCheckConfig"
David McLaughlin [Wed, 12 Oct 2016 22:54:20 +0000 (15:54 -0700)] 
Revert "Add min_consecutive_health_checks in HealthCheckConfig"

This reverts commit ed72b1bf662d1e29d2bb483b317c787630c26a9e.

Revert "Add support for receiving min_consecutive_successes in health checker"

This reverts commit e91130e49445c3933b6e27f5fde18c3a0e61b87a.

Revert "Modify executor state transition logic to rely on health checks (if enabled)."

This reverts commit ca683cb9e27bae76424a687bc6c3af5a73c501b9.

Bugs closed: AURORA-1793

Reviewed at

2 years agoUpgrade pystachio to 0.8.3
Santhosh Kumar Shanmugham [Wed, 12 Oct 2016 18:00:30 +0000 (11:00 -0700)] 
Upgrade pystachio to 0.8.3

Testing Done:

Reviewed at

2 years agoUpdate mybatis, h2, and jmh to their latest versions.
Stephan Erb [Tue, 11 Oct 2016 18:13:21 +0000 (20:13 +0200)] 
Update mybatis, h2, and jmh to their latest versions.

I have skimmed the changelogs and the following h2 entries stood out:

* "Garbage collection of unused chunks should now be faster."
* "Improve performance of cleaning up temp tables - patch from Eric Faulhaber."

Running our micro-benchmarks did not indicate any note-worthy performance difference.

Full changelogs:


Reviewed at

2 years agoUpgrade to pants 1.2.0rc0.
John Sirois [Mon, 10 Oct 2016 23:37:18 +0000 (17:37 -0600)] 
Upgrade to pants 1.2.0rc0.

This pulls in support for OSX Sierra.
Release notes are here:

Reviewed at

2 years agoUse the Thrift binary protocol in the Aurora client.
Stephan Erb [Thu, 6 Oct 2016 20:14:28 +0000 (22:14 +0200)] 
Use the Thrift binary protocol in the Aurora client.

Now that Aurora 0.16 has been released, we can assume all schedulers
are supporting the binary protocol.

Switching the protocol comes with a slight performance boost: Given
the example job `devcluster/www-data/prod/hello` with 500 instances
we can reduce the average request time by about 40%.

Using the TJSONProtocol:

    $ time aurora job status devcluster > /dev/null
    real    0m0.906s
    user    0m0.759s
    sys     0m0.078s

Using the new TBinaryProtocol:

    $ time aurora job status devcluster > /dev/null
    real    0m0.524s
    user    0m0.410s
    sys     0m0.079s

Reviewed at

2 years agoManually configure the private network interface in Vagrant
Andrew Jorgensen [Thu, 6 Oct 2016 19:49:01 +0000 (13:49 -0600)] 
Manually configure the private network interface in Vagrant

I am not sure of the specifics of why this happens but on vagrant 1.8.6 the network interface does not come up correctly and the private_network is attached to the `eth0` nat interface rather than the host-only interface. I tried a number of different parameters but none of them were able to configure the network appropriately. This change manually configures the static ip so that it is connected to the correct adapter. Without this change I could not access the aurora web interface.

Reviewed at

2 years agoMove common/zookeeper to the main aurora project.
John Sirois [Thu, 6 Oct 2016 17:44:07 +0000 (11:44 -0600)] 
Move common/zookeeper to the main aurora project.

Remove unused code and restrict visibility where possible. Also fix up
various warnings.

Bugs closed: AURORA-1669

Reviewed at

2 years agoRemove untested classes that no longer exist.
John Sirois [Thu, 6 Oct 2016 15:31:51 +0000 (09:31 -0600)] 
Remove untested classes that no longer exist.

Reviewed at

2 years agoUpdate to Gradle 3.1.
Stephan Erb [Thu, 6 Oct 2016 07:20:57 +0000 (09:20 +0200)] 
Update to Gradle 3.1.

I have skimmed the release notes and this change seems to be the most
important one:

    "This release (2.14.1) fixes a critical defect to incremental builds that
     may prevent Gradle from executing tasks when inputs or outputs are out of
     date. This affects the correctness of builds using Gradle 2.14."

The release notes talk about massive performance gains, but I have not noticed any.

Since the Gradle daemon is now enabled by default, we can drop the properties file.

Release notes:


Reviewed at