4 years agoUpdating .auroraversion to 0.18.1-rc0. rel/0.18.1-rc0
Bill Farner [Sun, 29 Oct 2017 17:23:12 +0000 (10:23 -0700)] 
Updating .auroraversion to 0.18.1-rc0.

4 years agoIncrementing snapshot version to 0.18.1-SNAPSHOT.
Bill Farner [Sun, 29 Oct 2017 17:23:12 +0000 (10:23 -0700)] 
Incrementing snapshot version to 0.18.1-SNAPSHOT.

4 years agoUpdating CHANGELOG for 0.18.1 release.
Bill Farner [Sun, 29 Oct 2017 17:23:12 +0000 (10:23 -0700)] 
Updating CHANGELOG for 0.18.1 release.

4 years agoUpdate to shiro 1.2.5
Bill Farner [Mon, 23 Oct 2017 16:41:50 +0000 (09:41 -0700)] 
Update to shiro 1.2.5

Reviewed at

5 years agoUpdating .auroraversion to release version 0.18.0. rel/0.18.0
Santhosh Kumar [Thu, 15 Jun 2017 22:45:25 +0000 (15:45 -0700)] 
Updating .auroraversion to release version 0.18.0.

5 years agoUpdating .auroraversion to 0.18.0-rc0. rel/0.18.0-rc0
Santhosh Kumar [Sat, 10 Jun 2017 00:11:52 +0000 (17:11 -0700)] 
Updating .auroraversion to 0.18.0-rc0.

5 years agoIncrementing snapshot version to 0.19.0-SNAPSHOT.
Santhosh Kumar [Sat, 10 Jun 2017 00:11:52 +0000 (17:11 -0700)] 
Incrementing snapshot version to 0.19.0-SNAPSHOT.

5 years agoUpdating CHANGELOG for 0.18.0 release.
Santhosh Kumar [Sat, 10 Jun 2017 00:11:52 +0000 (17:11 -0700)] 
Updating CHANGELOG for 0.18.0 release.

5 years agoPrepare release notes for 0.18.0
Santhosh Kumar [Fri, 9 Jun 2017 23:44:06 +0000 (16:44 -0700)] 
Prepare release notes for 0.18.0

5 years agoAdding gpg key for santhk
Santhosh Kumar Shanmugham [Fri, 9 Jun 2017 22:09:47 +0000 (15:09 -0700)] 
Adding gpg key for santhk

Reviewed at

5 years agoProcess rescinds in the same thread pool as offers.
Zameer Manji [Tue, 6 Jun 2017 21:21:22 +0000 (14:21 -0700)] 
Process rescinds in the same thread pool as offers.

In a a production environment I was able to observe the following:
I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]

Notice that the offer rescind was processed before the actual offer. This is
possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
processed in the executor (to prevent blocking) and the rescind is handled
directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
against the callback thread (`Thread-77638`).

In normal operation, there will be seconds to minutes between a rescind and an
offer, but in some cases an offer can be rescinded very quickly in clusters that
use oversubscription modules.

To fix this, we move the rescind processing into the same executor as the offer
processing to ensure they are processed in the order they are received. Without
fixing this, the rescinded offer exists in the offer manager and can be used
later to launch a task. This task will immediately fail to launch because the
offer is invalid.

In this patch, I have also added a metric and logging to record when we fail to
remove an offer from the offer manager, and cleaned up the logging to allow
operators to see when an offer was recieved. With this logging, an operator can
grep for the offer id and see the entire lifecycle of the offer in the

Bugs closed: AURORA-1933

Reviewed at

5 years agoPrioritize adding instances over updating instances during an update
Jordan Ly [Fri, 2 Jun 2017 20:58:42 +0000 (13:58 -0700)] 
Prioritize adding instances over updating instances during an update

Bugs closed: AURORA-1928

Reviewed at

5 years agoAllow custom OfferManager ordering to be injected via Guice modules
David McLaughlin [Fri, 2 Jun 2017 20:43:43 +0000 (13:43 -0700)] 
Allow custom OfferManager ordering to be injected via Guice modules

Reviewed at

5 years agoImprove task history pruning by batch deleting tasks.
Kai Huang [Fri, 2 Jun 2017 20:32:23 +0000 (13:32 -0700)] 
Improve task history pruning by batch deleting tasks.

Bugs closed: AURORA-1929

Reviewed at

5 years agoEnables scalable, high-performance bin-packing approximation by sorting offers. Can...
David McLaughlin [Wed, 31 May 2017 23:30:49 +0000 (16:30 -0700)] 
Enables scalable, high-performance bin-packing approximation by sorting offers. Can be controlled via Scheduler flags.

Reviewed at

5 years agoBump logback to 1.2.3 and SLF4J to 1.7.25
Stephan Erb [Tue, 30 May 2017 20:40:31 +0000 (22:40 +0200)] 
Bump logback to 1.2.3 and SLF4J to 1.7.25

The changelog entries that stand out the most:

* The ReentrantLock in OutputStreamAppender is now "unfair". In previous
  versions of logback, a fair lock was used. Fair locks are much slower.
  Just as importanly, logback has no mandate to influence thread scheduling.
* In PatternLayoutBase the same StringBuilder is used over and over to
  reduce memory allocation.

I am unable to observe any improvements in our micro-benchmarks. In any
case, I still think it is worth to stay up to date.

Full changelogs:


Reviewed at

5 years agoNormalize state endpoint to reduce API payload size.
David McLaughlin [Thu, 25 May 2017 17:37:59 +0000 (10:37 -0700)] 
Normalize state endpoint to reduce API payload size.

Reviewed at

5 years agoAdd the ability to customize scheduling logic.
David McLaughlin [Thu, 25 May 2017 05:51:54 +0000 (22:51 -0700)] 
Add the ability to customize scheduling logic.

Uses Guice module injection to enable replacing the first-fit scheduling algorithm and associated first-fit preemption logic.

See design/proposal document here:

Bugs closed: AURORA-1920

Reviewed at

5 years agoAdd cluster state debug endpoint to Scheduler HTTP servlet.
David McLaughlin [Wed, 24 May 2017 22:07:33 +0000 (15:07 -0700)] 
Add cluster state debug endpoint to Scheduler HTTP servlet.

Reviewed at

5 years agoFix SchedulingBenchmarks broken in Mesos Maintenance and Update Affinity patches.
David McLaughlin [Tue, 23 May 2017 21:49:16 +0000 (14:49 -0700)] 
Fix SchedulingBenchmarks broken in Mesos Maintenance and Update Affinity patches.

Reviewed at

5 years agoAdded 'aurora task scp' command for copying/retrieving files to the sandbox of a...
Jordan Ly [Thu, 18 May 2017 21:49:23 +0000 (14:49 -0700)] 
Added 'aurora task scp' command for copying/retrieving files to the sandbox of a task instance.

This command essentially mimics scp but expands task instances into their respective user@host:path
For 'aurora task scp' the sandbox is the relative root. However, you can still use absolute paths
(ex. /tmp or /var/log). Tilde expansion is not supported (ex. paths like ~/some/dir will not work)
as they will try to access home directories: in this case, the command will return an error.

Example usage:
From host to task sandbox folder: `aurora task scp ~/test.txt cluster/role/env/job/instance:`
From task sandbox folder to host: `aurora task scp cluster/role/env/job/instance:test.txt .`
From task tmp folder to host: `aurora task scp cluster/role/env/job/instance:/tmp/test.txt .`
From one task to another task: `aurora task scp cluster/role/env/job/instance:test.txt cluster/role/env/job/instance:some/dir/`

Testing Done:
`./pants test src/test/python/apache/aurora/client/cli:cli`
23:07:10 00:03       [run]
                     ============== test session starts ===============
                     platform darwin -- Python 2.7.10 -- py-1.4.33 -- pytest-2.6.4
                     plugins: cov, timeout
                     collected 179 items

                     src/test/python/apache/aurora/client/cli/ ...
                     src/test/python/apache/aurora/client/cli/ ........
                     src/test/python/apache/aurora/client/cli/ .
                     src/test/python/apache/aurora/client/cli/ .....
                     src/test/python/apache/aurora/client/cli/ .
                     src/test/python/apache/aurora/client/cli/ ..
                     src/test/python/apache/aurora/client/cli/ .....
                     src/test/python/apache/aurora/client/cli/ .....
                     src/test/python/apache/aurora/client/cli/ .......................................
                     src/test/python/apache/aurora/client/cli/ ..........
                     src/test/python/apache/aurora/client/cli/ .............
                     src/test/python/apache/aurora/client/cli/ ....
                     src/test/python/apache/aurora/client/cli/ ..
                     src/test/python/apache/aurora/client/cli/ ..........
                     src/test/python/apache/aurora/client/cli/ ..
                     src/test/python/apache/aurora/client/cli/ ......
                     src/test/python/apache/aurora/client/cli/ ...............
                     src/test/python/apache/aurora/client/cli/ ..............
                     src/test/python/apache/aurora/client/cli/ ......................
                     src/test/python/apache/aurora/client/cli/ ....
                     src/test/python/apache/aurora/client/cli/ ..
                     src/test/python/apache/aurora/client/cli/ ......

                     ========== 179 passed in 24.88 seconds ===========

23:07:37 00:30   [complete]

I've also compiled it within the local cluster with Vagrant and used the command to transfer a text file between the scheduler machine and job I created.

Bugs closed: AURORA-1925

Reviewed at

5 years agoFix update affinity cache name
Reza Motamedi [Sun, 14 May 2017 19:13:51 +0000 (21:13 +0200)] 
Fix update affinity cache name

In a previous review ( I introduced metrics
for BiCache explicit removals and expirations. There I changed the contract to
instead of passing _cache size metric name_, I require just the __cache name__.
That has already up applied to all usages of Bichae. The update affinitiy patch
had a merge conflict that did not pick this change, which leads to metric names
such as `update_affinity_cache_size_cache_expiration_removals`,
`update_affinity_cache_size_cache_removals`, `update_affinity_cache_size_cache_size`.

Reviewed at

5 years agoAdding metrics for removals from BiCache
Reza Motamedi [Mon, 8 May 2017 23:37:09 +0000 (16:37 -0700)] 
Adding metrics for removals from BiCache

Reviewed at

5 years agoAdd best-effort update affinity into the Scheduler.
David McLaughlin [Sat, 6 May 2017 02:11:57 +0000 (19:11 -0700)] 
Add best-effort update affinity into the Scheduler.

Reviewed at

5 years agoAURORA-1915 Add automatic browser tab open feature for aurora update start
Takuya Kuwahara [Wed, 3 May 2017 04:54:27 +0000 (21:54 -0700)] 
AURORA-1915 Add automatic browser tab open feature for aurora update start

Aurora client automatically opens a browser tab following `aurora job create` and `aurora cron schedule` commands. This patch provide similar functionality for `aurora update start`.

Reviewed at

5 years agoAURORA-1922 Expose stats on the number of jobs stored in MemCronJobStore
Mehrdad Nurolahzade [Wed, 3 May 2017 04:25:03 +0000 (21:25 -0700)] 
AURORA-1922 Expose stats on the number of jobs stored in MemCronJobStore

This patch exposes stats on the size of `jobs` map in `MemCronJobStore`.

Reviewed at

5 years agoMake sure we track scheduling penalty when no tasks are scheduled.
David McLaughlin [Tue, 2 May 2017 17:24:59 +0000 (10:24 -0700)] 
Make sure we track scheduling penalty when no tasks are scheduled.

Reviewed at

5 years agoAURORA-1923 Aurora client should not automatically retry non-idempotent operations
Mehrdad Nurolahzade [Tue, 2 May 2017 16:35:36 +0000 (09:35 -0700)] 
AURORA-1923 Aurora client should not automatically retry non-idempotent operations

Aurora client has a built in mechanism to automatically retry thrift API operations if the connection with scheduler times out, experiences transport exception, or encounters a transient exception on the scheduler side.

Retrying thrift calls due to scheduler connection timeout and transient exceptions (see AURORA-187) is safe. However, as Aurora has no concept of idempotency, its client can retry non-idempotent operations upon encountering transport exceptions which can lead to nondeterministic situations.

For example, if client requests go through a proxy to reach scheduler, client might consider a non-idempotent request failed and automatically retry it while the original request has been received and processed by the scheduler.

This patch changes Aurora client invocation semantics from "at least once" to "at most once" for non-idempotent operations.

Reviewed at

5 years agoFix for unnecessary object serializations
Mehrdad Nurolahzade [Fri, 28 Apr 2017 22:19:36 +0000 (15:19 -0700)] 
Fix for unnecessary object serializations

This patch provides a fix for some unnecessary object serilizations that happen on high frequency execution paths and contribute to scheduler's high object creation rate.

Reviewed at

5 years agoExtend operator documentation
Stephan Erb [Tue, 25 Apr 2017 21:26:43 +0000 (23:26 +0200)] 
Extend operator documentation

Included changes:

* new cluster upgrade instructions
* docs for several best practices collected on the mailinglist
* extracted and extended troubleshooting guide for new cluster operators
* several minor formatting fixes

Reviewed at

5 years agoBump initial_task_kill_retry_interval to 15s.
Stephan Erb [Tue, 25 Apr 2017 21:18:30 +0000 (23:18 +0200)] 
Bump initial_task_kill_retry_interval to 15s.

It is not very common that kills are dropped by Mesos and have to be retried
by Aurora. It therefore makes sense to slightly increase the retry timeout
so that we don't retry needlessly when Thermos is still busy executing
the lifecycle methods.

By default, Thermos uses the following kill escalation sequence:

  * /quitquitquit
  * wait 5s
  * /abortabortabort
  * wait 5s
  * wait up to 1 minute

Reviewed at

5 years agoImprove cleanup hints in release and release-candidate scripts
Stephan Erb [Fri, 21 Apr 2017 16:39:48 +0000 (18:39 +0200)] 
Improve cleanup hints in release and release-candidate scripts

Reviewed at

5 years agoUpdate to Mesos 1.2.0
Stephan Erb [Mon, 17 Apr 2017 20:31:17 +0000 (22:31 +0200)] 
Update to Mesos 1.2.0


Reviewed at

5 years agoFix schema to allow multiple task volumes per task.
Zameer Manji [Fri, 7 Apr 2017 12:18:33 +0000 (14:18 +0200)] 
Fix schema to allow multiple task volumes per task.

The original commit adding this feature added an artifical constraint to the
schema that prevented more than one task volume per task. This is because there
was a `UNIQUE` constraint between the volumes table and the task config table,
preventing a task config from being associated with more than one volume.

This patch removes that constraint. As a result some of the MyBatis mappers had
to change and a new migration was added.

Bugs closed: AURORA-1914

Reviewed at

5 years agoReliably subscribe to Mesos in the HTTP Driver.
Zameer Manji [Thu, 6 Apr 2017 08:40:41 +0000 (10:40 +0200)] 
Reliably subscribe to Mesos in the HTTP Driver.

As noted in AURORA-1911 the `V1Mesos` driver doesn't re try `SUBSCRIBE` calls if
they fail. This means that after a leader subscribes and disconnects, it is
possible for it to never re subscribe again if the Mesos Master is unhealthy.

To fix this, I have moved the subscription into the dedicated
`SchedulerExecutor` and it coninutes to attempt to subscribe using truncated
binary backoff. It only stops if we are disconnected or if we sucessfully

Bugs closed: AURORA-1911

Reviewed at

5 years agoFix Thermos Health Check for MesosContainerizer with `--nosetuid-health-checks`
Charles Raimbert [Wed, 5 Apr 2017 09:25:03 +0000 (11:25 +0200)] 
Fix Thermos Health Check for MesosContainerizer with `--nosetuid-health-checks`

With MesosContainerizer, the health check is performed using a "mesos-containerizer
launch" process, but there is actually a code bug in the way of getting the user
under which to run the health check process:
health_check_user = (os.getusername() if self._nosetuid_health_checks
            else assigned_task.task.job.role)

If the scheduler is configured with `--nosetuid-health-checks` then "os.getusername()"
is executed, but the "os" python module does not present any "getusername()" function,
which leads the Thermos execution to abort as follow:
D0323 01:08:15.453372 16] Task started.
E0323 01:08:15.571124 16] Traceback (most recent call last):
File "apache/aurora/executor/", line 119, in _run
self._start_status_manager(driver, assigned_task)
File "apache/aurora/executor/", line 168, in _start_status_manager
status_checker = status_provider.from_assigned_task(assigned_task, self._sandbox)
File "apache/aurora/executor/common/", line 370, in from_assigned_task
health_check_user = (os.getusername() if self._nosetuid_health_checks
AttributeError: 'module' object has no attribute 'getusername'

Following the existing unit testing pattern from, a test case
was added to cover the `--nosetuid-health-checks` case for MesosContainerizer.

Bugs closed: AURORA-1909

Reviewed at

5 years agoRemove use of deprecated fields in tests
Nicolás Donatucci [Mon, 3 Apr 2017 22:00:14 +0000 (00:00 +0200)] 
Remove use of deprecated fields in tests

Removed the usage of numCpus, ramMb and diskMb from tests and replaced them with
the Resource set when necessary. Also modified the thrift backfill so that it won't
backfill those resource fields anymore.

Related Issue: Aurora-1707

Reviewed at

5 years agoEnsure enum tables are complete after a snapshot restore.
Zameer Manji [Thu, 30 Mar 2017 18:38:36 +0000 (11:38 -0700)] 
Ensure enum tables are complete after a snapshot restore.

In our in memory database, we model enums as two column tables. The two columns
would be `id` which corresponds to the integer value in the thrift enum and
`name` which is the all caps string name of the enum. For example to model the
`JobUpdateStatus` enum we have a table called `job_update_statuses`. In there
the `ROLLING_FORWARD` enum is modeled as a row `(0, "ROLLING_FORWARD")`. Other
tables reference the enum table via the id.

When we prepare storage on startup the `DbStorage` starts up. It does two
1. Load in the schema.
2. Populate the enum tables.

This ensures that when we insert values into the database, the enum refernces
will be valid.

However, before we restore from a Snapshot with the `dbScript` field, we blow
all of that data away and restore what was in the snapshot:
try (Connection c = ((DataSource) store.getUnsafeStoreAccess()).getConnection()) {"Dropping all tables");
  try (PreparedStatement drop = c.prepareStatement("DROP ALL OBJECTS"))

This means that if we add a new enum value, and then restore from a snapshot,
that enum value will not exist in the table any more. We could address this by
saying that every enum value addition requires a migration. However instead I
propose not blowing away the work done by `DbStorage` instead and re-hydrating
the enum tables.

To do this I extracted the logic into a new class `EnumBackfill`. Restoring from
a snapshot calls this after the migrations are done. The underlying SQL was
changed from `INSERT` to `MERGE` to make this work.

Testing Done:
existing tests and e2e tests

I also added a new enum value to `JobUpdateStatus` and observed it was correctly
loaded in.

Bugs closed: AURORA-1912

Reviewed at

5 years agoSort the set objects inside TaskConfig during Job diff.
Santhosh Kumar Shanmugham [Thu, 30 Mar 2017 05:28:26 +0000 (22:28 -0700)] 
Sort the set objects inside TaskConfig during Job diff.

Sort the entires in `set` fields inside `TaskConfig` as strings before
shelling out to diff so that the output is consistent and meaningful.

Testing Done:

Bugs closed: AURORA-1913

Reviewed at

5 years agoReset `framework_registered` metric on disconnection.
Zameer Manji [Wed, 29 Mar 2017 20:35:21 +0000 (13:35 -0700)] 
Reset `framework_registered` metric on disconnection.

Previously the `framework_registered` metric only transitioned from 0 to 1 on
the first registration. On disconnection and registration loss, the metric was
not updated to reflect the loss of registration.

To make this metric more useful, I have moved this metric from the
`SchedulerLifecycle`, where it was tied to the boolean controlling the
LEADER_AWAITING_REGISTRATION -> ACTIVE transtion, to `MesosCallbackHandler`. In
`MesosCallbackHandler` it can easily be updated to reflect the current state
of registration.

Bugs closed: AURORA-1910

Reviewed at

5 years agoSupport Mesos Maintenance
Zameer Manji [Thu, 23 Mar 2017 21:17:40 +0000 (14:17 -0700)] 
Support Mesos Maintenance

This adds support for Mesos Maintenance per the design doc[1].

Per the design the scheduler gains another parameter,
`unavailability_threshold`. With this threshold the scheduler does the

1. Accept all inverse offers from Mesos.
2. Drain when accepting an inverse offer if the unavailability starts within the
3. Veto any offers with unavailability starting within the threshold.
4. Penalize offers that have unavailablity information

For readability and safety the time based code uses the new `java.time` package
in Java 8, primarily relying on the `Instant` class.


Testing Done:
e2e tests

Bugs closed: AURORA-1904

Reviewed at

5 years agoMake Thermos observer resource collection intervals configurable
Stephan Erb [Tue, 21 Mar 2017 08:29:41 +0000 (09:29 +0100)] 
Make Thermos observer resource collection intervals configurable

We have noticed that on hosts with lots of active tasks (~100) the observer UI
is not usable. Thermos fully utilizes one core but does not render any requests.

Dumping `/threads` indicates the observer might be backlogged by the hundred
concurrent `TaskResourceMonitor` threads. Due to the Python GIL only one can
make progress at a time though.

This patch is now adding options to control the resource collection interval,
giving operators a possibility to reduce the CPU pressure.

Testing Done:
./pants test.pytest src/{test,main}/python:: -- -v

Bugs closed: AURORA-1907

Reviewed at

5 years agoUse Process.oneshot() in latest psutils for faster stats retrieval.
Stephan Erb [Sun, 19 Mar 2017 15:01:50 +0000 (16:01 +0100)] 
Use Process.oneshot() in latest psutils for faster stats retrieval.

Without the Process.oneshot() decorator stats retrieval can lead to
multiple reads of the same `/proc` filesystem values. The oneshot
decorator enables caching to speed this up. It has been added in
psutils 5.0.

Oneshot docs:

Bugs closed: AURORA-1907

Reviewed at

5 years agoPopulate `host` and `webURL`fields of FrameworkInfo.
Zameer Manji [Sat, 18 Mar 2017 00:06:15 +0000 (17:06 -0700)] 
Populate `host` and `webURL`fields of FrameworkInfo.

This patch extracts out `FrameworkInfo` construction from the `DriverSettings`
data class to a factory class. This factory class combines the base info
constructed via CLI arguments with the HTTP server's host and port information.
This allows us to populate the `host` and `weburl` fields of framework info,
which enhance the Mesos UI.

This is necessary for users of the `V1_DRIVER` as the new driver does not
automatically populate the `host` field. Further, by using our own host and port
information, we ensure the information in ZooKeeper, the information used for
HTTP redirects and the information in the Mesos UI are all in sync.

Note that in vagrant, the hostname and URL are `aurora.local` because we set the
`--hostname` argument of the scheduler. By default Java will set it to the FQDN
or IP address of the host.

Testing Done:
e2e tests.

Bugs closed: AURORA-1905

Reviewed at

5 years agoUse --launch_info when invoking MesosContainerizer.
Santhosh Kumar Shanmugham [Tue, 14 Mar 2017 02:20:33 +0000 (19:20 -0700)] 
Use --launch_info when invoking MesosContainerizer.

MesosContainerizer has updated the command line parameters in 1.2.0 and
consolidated the individual arguments into a single ContainerLaunchInfo
proto buf message. Update ThermosExecutor to use the new `--launch_info`
parameter to be compatible with MesosContainerizer also check the
containerizer binary interface to determine to be backward-compatible.

Bugs closed: AURORA-1882

Reviewed at

5 years agoChange Resource Validation in ConfigurationManager so that it validates the Resource...
Nicolás Donatucci [Mon, 13 Mar 2017 20:59:42 +0000 (21:59 +0100)] 
Change Resource Validation in ConfigurationManager so that it validates the Resource Set instead of deprecated fields

The Resource validation in ConfigurationManager is now done against the Resource set instead of the NumCpus, RamMb and DiskMb fields.

Related Issue: AURORA-1707

Reviewed at

5 years agoReduce log output in `VersionedSchedulerDriverService`.
Zameer Manji [Wed, 8 Mar 2017 21:06:02 +0000 (13:06 -0800)] 
Reduce log output in `VersionedSchedulerDriverService`.

The `acceptOffers` log message outputs the entire `Operation` object which for
the `LAUNCH` type includes the entire `TaskInfo` protobuf. This makes the log
output massive. This reduces the logging to just the type of the operation.

Reviewed at

5 years agoRemove SerializableClock interface.
Zameer Manji [Tue, 7 Mar 2017 04:21:02 +0000 (20:21 -0800)] 
Remove SerializableClock interface.

This removes the `SerializableClock` interface. We are not serializing `Clock`
classes anywhere, so this should be safe to remove.

Reviewed at

5 years agoEnable Mesos HTTP API.
Zameer Manji [Thu, 2 Mar 2017 23:07:11 +0000 (15:07 -0800)] 
Enable Mesos HTTP API.

This patch completes the design doc[1] and enables operators to choose between
two V1 Mesos API implementations. The first is `V0Mesos` which offers the V1 API
backed by the scheduler driver and the second is `V1Mesos` which offers the V1
API backed by a new HTTP API implementation.

There are three sets of changes in this patch.

First, the V1 Mesos code requires a Scheduler callback with a different API. To
maximize code reuse, event handling logic was extracted into a
`MesosCallbackHandler` class. `VersionedMesosSchedulerImpl` was created to
implement the new callback interface. Both callbacks new use the handler class
for logic.

Second, a new driver implementation using the new API was created. All of the
logic for the new driver is encapsulated in the
`VersionedSchedulerDriverService` class.

Third, some wiring changes were done to allow for Guice to do it's work and
allow for operators to select between the different driver implementations.


Testing Done:
The e2e test has been run three times, each time with a different driver option.

Bugs closed: AURORA-1887, AURORA-1888

Reviewed at

5 years agoFix scheduler_framework_disconnects stat.
Ilya Pronin [Mon, 27 Feb 2017 19:04:54 +0000 (11:04 -0800)] 
Fix scheduler_framework_disconnects stat.

Refactoring in r/31550 has disabled incrementing scheduler_framework_disconnects
stats. This change brings it back.

Testing Done:
Added a check to `MesosSchedulerImplTest.testDisconnected()`. Manually verified
in Vagrant by starting/stopping mesos-master and querying `/vars` endpoint.

Bugs closed: AURORA-1860

Reviewed at

5 years agoCurrently snapshot times are exposed for the entire snapshot save/apply operation...
Mehrdad Nurolahzade [Sat, 25 Feb 2017 04:31:20 +0000 (20:31 -0800)] 
Currently snapshot times are exposed for the entire snapshot save/apply operation. This patch provides the means to collect finer grained metrics on individual fields in a snapshot.

Bugs closed: AURORA-1870

Reviewed at

5 years agoMove task conversion during reconciliation into the delayed closure.
David McLaughlin [Wed, 22 Feb 2017 16:41:01 +0000 (08:41 -0800)] 
Move task conversion during reconciliation into the delayed closure.

This is a small change to relieve GC pressure while explicit reconciliation runs. It moves the IScheduledTask -> TaskStatus conversion into the batch processing closure so that any object allocation and collection overhead is delayed until the batch is actually processed. It has a noticable effect on GC for large amounts of RUNNING tasks.

Reviewed at

5 years agoAdd best effort pulse timestamp recovery.
Zameer Manji [Thu, 16 Feb 2017 20:08:34 +0000 (12:08 -0800)] 
Add best effort pulse timestamp recovery.

Currently the scheduler causes all coordinated ("pulsed") updates into
startup/recovery. This is because the last pulse timestamp is not durably stored
and the timestamp of the last pulse is set to 0L (aka no pulse yet).

In cases where the pulse timeout is larger and the failover is fast or frequent,
this casues many updates to unnecessarily transition into a pulse related state
until the next pulse.

It is posible to avoid these uncessary transitons by traversing the job update
events and initializing the last pulse timestamp to the last event if the last
event was not a pulse event.

Bugs closed: AURORA-1890

Reviewed at

5 years agoAdd DSL and E2E changes for per task volume mounts.
Zameer Manji [Wed, 15 Feb 2017 01:09:37 +0000 (17:09 -0800)] 
Add DSL and E2E changes for per task volume mounts.

Enables the client DSL to set per task volume mounts. This also adds a E2E test
that tests per task volume mounting.

Testing Done:
sh ./src/test/sh/org/apache/aurora/e2e/

Bugs closed: AURORA-1107

Reviewed at

5 years agoExpose task pruning endpoint in aurora_admin. Useful for scale testing in order to...
David McLaughlin [Tue, 14 Feb 2017 21:33:01 +0000 (13:33 -0800)] 
Expose task pruning endpoint in aurora_admin. Useful for scale testing in order to 'clean up' after a test run, but also useful in production if you have a bad actor inflating the size of your task index.

Bugs closed: AURORA-1893

Reviewed at

5 years agoDisplaying update id after 'Killed for job update' message for the update that
Abhishek Jain [Mon, 13 Feb 2017 20:11:13 +0000 (12:11 -0800)] 
Displaying update id after 'Killed for job update' message for the update that
resulted in the task getting killed.

Testing Done:
aurora job create devcluster/www-data/devel/hello_world my_jobs/new_hello_world_job.aurora
aurora update start devcluster/www-data/devel/hello_world my_jobs/new_hello_world_job_update.aurora

Completed Task status information:
3 minutes ago - KILLED : Instructed to kill task.
02/09 19:52:53 LOCAL • PENDING
02/09 19:52:53 LOCAL • ASSIGNED
02/09 19:52:54 LOCAL • STARTING • Initializing sandbox.
02/09 19:52:55 LOCAL • RUNNING • No health-check defined, task is assumed healthy.
02/09 19:53:08 LOCAL • KILLING • Killed for job update : 900256bb-9cad-41d6-b330-d74a751239bf
02/09 19:53:10 LOCAL • KILLED • Instructed to kill task.

Build tests:

Bugs closed: AURORA-1806

Reviewed at

5 years agoAdd additional tests for the conversion of TaskStatus.
Zameer Manji [Wed, 8 Feb 2017 17:57:18 +0000 (09:57 -0800)] 
Add additional tests for the conversion of TaskStatus.

This adds additional testing for the `ProtosConversions` class, ensuring there
is the correct conversion between `SlaveID` and `AgentID`.

Reviewed at

5 years agoUpdate PMD to 5.5.3 with the for us relevant fixes:
Stephan Erb [Tue, 7 Feb 2017 22:12:45 +0000 (23:12 +0100)] 
Update PMD to 5.5.3 with the for us relevant fixes:

* [java] InvalidSlf4jMessageFormat: False positive with placeholder and exception
* [java] InvalidSlf4jMessageFormat: fails with NPE

Full changelog:

The increase of the heap size is not really related. However, given the hard to
trace out of memory errors we have seen in some Jenkins builds recently, it is
probably worth a shot.

Testing Done:
./gradlew -Pq build

Reviewed at

5 years agoMove Aurora to v1 Protobufs.
Zameer Manji [Mon, 6 Feb 2017 23:09:55 +0000 (15:09 -0800)] 
Move Aurora to v1 Protobufs.

This is the first step in moving Aurora to the V1 API from Mesos. This patch
moves most of the code to v1 Protobufs. This means all peices of code that do
not interact with Mesos now handle only v1 Protobufs.

Classes that interact with Mesos directly are:

* `org.apache.aurora.scheduler.mesos.SchedulerDriverService`
* `org.apache.aurora.scheduler.mesos.MesosSchedulerImpl`
* `org.apache.aurora.scheduler.mesos.DriverFactoryImpl`

These classes handle unversioned Protobufs and use the `ProtosConversion` class
to convert them to v1 Protobufs that can be safely passed to the rest of the

Bugs closed: AURORA-1886

Reviewed at

5 years agoAdd message parameter to killTasks
Cody Gibb [Mon, 6 Feb 2017 18:43:01 +0000 (10:43 -0800)] 
Add message parameter to killTasks

RPC's such as pauseJobUpdate include a parameter for "a user-specified message
to include with the induced job update state change." This diff provides a
similar optional parameter for the killTasks RPC, which allows users to indicate
the reason why a task was killed, and later inspect that reason when consuming
task events.

Example usage from Aurora CLI:
`$ aurora job killall devcluster/www-data/prod/hello --message "Some message"`

In the task event, the supplied message (if provided) is appended to the
existing template "Killed by <user>", separated by a newline. For the above
example, this looks like: "Killed by aurora\nSome message".

Testing Done:
Added a unit test in the scheduler, and a test in the client.

Also manually tested using the Vagrant environment.

Bugs closed: AURORA-1846

Reviewed at

5 years agoIncrementing snapshot version to 0.18.0-SNAPSHOT.
Stephan Erb [Wed, 1 Feb 2017 08:35:14 +0000 (09:35 +0100)] 
Incrementing snapshot version to 0.18.0-SNAPSHOT.

5 years agoUpdating CHANGELOG for 0.17.0 release.
Stephan Erb [Wed, 1 Feb 2017 08:35:14 +0000 (09:35 +0100)] 
Updating CHANGELOG for 0.17.0 release.

5 years agoPrepare release notes for 0.17.0
Stephan Erb [Wed, 1 Feb 2017 08:03:22 +0000 (09:03 +0100)] 
Prepare release notes for 0.17.0

Reviewed at

5 years agoSuppress role deprecation warning as replacement is not yet ready.
David McLaughlin [Wed, 1 Feb 2017 07:38:54 +0000 (08:38 +0100)] 
Suppress role deprecation warning as replacement is not yet ready.

The role field was prematurely deprecated in the Mesos project.

Reviewed at

5 years agoFixed starting cron jobs when using default_docker_parameters
Steve Niemitz [Tue, 31 Jan 2017 17:18:42 +0000 (18:18 +0100)] 
Fixed starting cron jobs when using default_docker_parameters

The code was previously attempting to re-sanitize the configuration read from
storage rather than just using it as is.  This causes issues if after
sanitization the job no longer passes sanitization (which is the case here w/

We've been running this in our branch forever.

Bugs closed: AURORA-1684

Reviewed at

5 years agoFix flapping TestRunnerKillProcessGroup test.
Stephan Erb [Mon, 30 Jan 2017 20:40:14 +0000 (21:40 +0100)] 
Fix flapping TestRunnerKillProcessGroup test.

The test was working when run in isolation, but failed when executing the
entire Thermos test suite.

Bugs closed: AURORA-1809

Reviewed at

5 years agoFix pendingTasks endpoint in case of multiple TaskGroups per job.
Stephan Erb [Mon, 30 Jan 2017 20:37:59 +0000 (21:37 +0100)] 
Fix pendingTasks endpoint in case of multiple TaskGroups per job.

Central idea of this patch is to change the return value of `getPendingReasons`
from a map keyed by JobKey to a map keyed by `TaskGroupKey`. This prevents the
`IllegalArgumentException` during the map construction.

Bugs closed: AURORA-1879

Reviewed at

5 years agoMove deprecated resource validations so they happen after the thrift backfill.
Nicolás Donatucci [Mon, 30 Jan 2017 19:18:59 +0000 (11:18 -0800)] 
Move deprecated resource validations so they happen after the thrift backfill.

As the validations for NumCpus, RamMb and DiskMb happened before the thrift
backfill, those values needed to be set, even though they are deprecated. In the
thrift backfill, if the Resources field is set, then NumCpus, RamMb and DiskMb
are set accordingly.

So by moving those validations, it is now possible to only set the Resources
field instead of having to set the deprecated fields. As the validations are
moved and not removed, the ckeck for the resource values being greater than 0
still happens. Furthermore, if the Resources field is set but there is no
Resource for Ram in the set, the thrift backfill will throw an

Some tests were slightly modified because of this, mostly by adding an
unsetResources() operation. This is because as the validations now happen after
the thrift backfill, during the thrift backfill the values in the deprecated
fields are replaced by those in the Resources field (if it is set). There are
also some new tests.

Related Issue: AURORA-1707

Testing Done:

Reviewed at

5 years agoExpose Thrift server request workload stats
Mehrdad Nurolahzade [Mon, 30 Jan 2017 13:00:15 +0000 (14:00 +0100)] 
Expose Thrift server request workload stats

This patch introduces a number of stats that measure the workload generated by
Thrift server requests.

Current Thrift server stats expose the number and timing of requests received
by the server. However, they fail to reflect the size of the requests. This is
limiting us in having an accurate view of the workload handled by the scheduler.
For example, every call to `restartShards()` is recorded as one event despite
the fact that a request might only restart one shard while another request might
seek to restart 1K shards.

Bugs closed: AURORA-1826

Reviewed at

5 years agoPreemption performance improvement and new metrics release notes entry
Mehrdad Nurolahzade [Sat, 28 Jan 2017 09:48:58 +0000 (10:48 +0100)] 
Preemption performance improvement and new metrics release notes entry

Reviewed at

5 years agoCapture health check output.
Dmitriy Shirchenko [Wed, 25 Jan 2017 21:21:37 +0000 (13:21 -0800)] 
Capture health check output.

Users really could really benefit from seeing the output of the shell health
check failure, so plumbing through the output.

Testing Done:
added unit tests
e2e tests
screenshot attached.

Bugs closed: AURORA-1881

Reviewed at

5 years agoExpose finer grained offer veto stats
Mehrdad Nurolahzade [Wed, 25 Jan 2017 19:26:56 +0000 (13:26 -0600)] 
Expose finer grained offer veto stats

Bugs closed: AURORA-1835

Reviewed at

5 years agoConsider reserving for multiple tasks per preemption round
Mehrdad Nurolahzade [Tue, 24 Jan 2017 18:20:37 +0000 (19:20 +0100)] 
Consider reserving for multiple tasks per preemption round

To be fair, PendingTaskProcessor interleaves tasks from different groups.
However, this fairness comes at the price of increasing reservation time.
Even if reservations are being made for the same task group, the processor
would still restart iterating through slaves for each task instance. This
results in reevaluating all slaves already rejected in a previous search
before it finds a new viable candidate.

This patch improves `PendingTaskProcessor` performance by reducing slave
search/evaluation time, at the cost of reduced fairness.
`PendingTaskProcessor` now does reservation for a configurable maximum of
_N_ candidates per task group in each iteration over the list of slaves.

Bugs closed: AURORA-1867

Reviewed at

5 years agoEvaluate multiple preemption proposals per round
Mehrdad Nurolahzade [Tue, 24 Jan 2017 16:07:09 +0000 (17:07 +0100)] 
Evaluate multiple preemption proposals per round

`TaskScheduler` makes an attempt to preempt already identified candidates
through `Preemptor` when it fails to schedule one or more tasks. However,
`Preemptor` currently evaluates only one proposal per invocation. A proposal
may get vetoed at this point by scheduling filters. If a proposal fails
validation the task group might get penalized by `TaskGroups` to give
`PendingTaskProcessor` some time to find new preemption candidates; despite
the fact that another proposal may already exist in `slotCache`. This penalty
might result in expiration of existing proposals in `slotCache`, hence slowing
down the overall preemption process.

This patch modifies `Preemptor` so that it evaluates all existing preemption
proposals before giving up.

Bugs closed: AURORA-1868

Reviewed at

5 years agoMake leader elections resilient to ZK disconnections.
Zameer Manji [Mon, 23 Jan 2017 22:38:56 +0000 (14:38 -0800)] 
Make leader elections resilient to ZK disconnections.

As documented in AURORA-1840 the Curator `LeaderLatch` recipe abdicates
leadership if the ZK connection is lost or if there is a timeout. This is not
compatible with the commons based implementation which would only abdicate
leadership if the ZK session timeout occurred.

This replaces the `LeaderLatch` recipe with the `LeaderSelector` recipe with a
custom listener that only loses leadership if a connection loss occurs.

Bugs closed: AURORA-1669

Reviewed at

5 years agoAURORA-1876 Expose stats on scheduler rate limiter
Mehrdad Nurolahzade [Mon, 23 Jan 2017 20:58:19 +0000 (14:58 -0600)] 
AURORA-1876 Expose stats on scheduler rate limiter

This patch exposes stats on `rateLimiter.acquire()` blocking events in `TaskGroups`. Hence,
providing visibility into whether scheduling rate is above/below `MAX_SCHEDULE_ATTEMPTS_PER_SEC`.

Bugs closed: AURORA-1876

Reviewed at

5 years agoAURORA-1828 Expose stats on the number of offers evaluated before a task is assigned
Mehrdad Nurolahzade [Mon, 23 Jan 2017 20:56:17 +0000 (14:56 -0600)] 
AURORA-1828 Expose stats on the number of offers evaluated before a task is assigned

Bugs closed: AURORA-1828

Reviewed at

5 years agoFix command escaping when using the Mesos containerizer.
Stephan Erb [Mon, 23 Jan 2017 07:38:52 +0000 (08:38 +0100)] 
Fix command escaping when using the Mesos containerizer.

The important bit is the change to call the Mesos containerizer with
`shell=False`. Getting rid of manual json encoding and eliminating shlex
 might have helped as well, but was more motivated by clarity rather than

Bugs closed: AURORA-1782

Reviewed at

5 years agoMake announced scheduler endpoint name configurable.
Stephan Erb [Wed, 18 Jan 2017 09:25:54 +0000 (10:25 +0100)] 
Make announced scheduler endpoint name configurable.

We decided to co-deploy an HTTPS enabled reverse proxy in front of each of our
Aurora schedulers. The proxy instances bind to `public_ip:8081` and the
schedulers to `localhost:8081`. By announcing the scheduler endpoint as `https`
we can ensure the default Aurora [client connects via HTTPS](


    [zk: 5] get /aurora/scheduler/member_0000000011

When running with `-serverset_endpoint_name=https`:

    [zk: 0] get /aurora/scheduler/member_0000000019

Bugs closed: AURORA-343

Reviewed at

5 years agoEnsure Aurora thrift support js and html.
John Sirois [Tue, 17 Jan 2017 22:34:39 +0000 (15:34 -0700)] 
Ensure Aurora thrift support js and html.

We use these for the Aurora UI and the API docs.

Bugs closed: AURORA-1875

Reviewed at

5 years agoImprove `thriftw` robustness.
John Sirois [Tue, 17 Jan 2017 21:20:15 +0000 (14:20 -0700)] 
Improve `thriftw` robustness.

Now the selected thrift is checked both for the proper version and
support of the gen langs Aurora requires. In addition, all thrifts on
the `PATH` are and an existing locally built thrift is always verified
to protect Aurora thrift requirement changes (if we ever add a gen lang

Bugs closed: AURORA-1875

Reviewed at

5 years agoLog process sampling failures with debug severity
Stephan Erb [Tue, 17 Jan 2017 20:53:18 +0000 (21:53 +0100)] 
Log process sampling failures with debug severity

The observer's logs consist of lots of warnings about being unable to find PIDs.
This is expected when running with the PID isolator, or when checkpoints are out
of date (e.g. after processes were killed by the OOM).

    W0116 14:42:54.694221 3253] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=27727)
    W0116 14:42:54.717905 3253] Error during process sampling [pid=10960]: psutil.NoSuchProcess process no longer exists (pid=10960)
    W0116 14:42:54.718089 3253] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=10960)
    W0116 14:42:54.718245 3253] Error during process sampling [pid=10026]: psutil.NoSuchProcess process no longer exists (pid=10026)
    W0116 14:42:54.718334 3253] Error during process sampling: psutil.NoSuchProcess process no longer exists (pid=10026)

This change adopts the proposal of David Robinson to decrease the severity level
to debug.

Bugs closed: AURORA-1541

Reviewed at

5 years agoExposed stats on number of offers rescinded and number of slaves lost.
Pradyumna Kaushik [Fri, 13 Jan 2017 21:09:17 +0000 (13:09 -0800)] 
Exposed stats on number of offers rescinded and number of slaves lost.

Testing Done:
curl -w '\n' | grep offers_rescinded
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
offers_rescinded 0

curl -w '\n' | grep slaves_lost
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30970    0 30970    0     0  4323k      0 --:--:-- --:--:-- --:--:-- 5040k
slaves_lost 0


Reviewed at

5 years agoExpose stats on SlotSizeCounter runs.
Mehrdad Nurolahzade [Fri, 13 Jan 2017 19:43:56 +0000 (11:43 -0800)] 
Expose stats on SlotSizeCounter runs.

Bugs closed: AURORA-1874

Reviewed at

5 years agoExpose stats on statically banned offers
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:18:43 +0000 (23:18 +0100)] 
Expose stats on statically banned offers

Bugs closed: AURORA-1859

Reviewed at

5 years agoEliminate sequential scan in MemTaskStore.getJobKeys()
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:17:34 +0000 (23:17 +0100)] 
Eliminate sequential scan in MemTaskStore.getJobKeys()

If scheduler is configured to run with the `MemTaskStore` every hit on scheduler
landing page (`/scheduler`) causes a call to `MemTaskStore.getJobKeys()` through

The implementation of `MemTaskStore.getJobKeys()` is currently very inefficient
as it requires a sequential scan of the task store and mapping to their
respective job keys. In Twitter clusters this method is currently taking half a
second per call (`mem_storage_get_job_keys`).

This patch eliminates the sequential scan and mapping to job key by simply
returning an immutable copy of the key set of the existing secondary index `job`.

Bugs closed: AURORA-1847

Reviewed at

5 years agoExpose stats on deleted job updates in JobUpdateHistoryPruner
Mehrdad Nurolahzade [Wed, 11 Jan 2017 22:15:03 +0000 (23:15 +0100)] 
Expose stats on deleted job updates in JobUpdateHistoryPruner

Bugs closed: AURORA-1856

Reviewed at

5 years agoReduce logging by ChainedStatusChecker and StatusManager when they're on the happy...
Joshua Cohen [Wed, 11 Jan 2017 22:19:49 +0000 (16:19 -0600)] 
Reduce logging by ChainedStatusChecker and StatusManager when they're  on the happy path.

Bugs closed: AURORA-1878

Reviewed at

5 years agoClean up instances of loggers using a logger name from another class.
Bing-Qian Luan [Wed, 11 Jan 2017 16:00:32 +0000 (10:00 -0600)] 
Clean up instances of loggers using a logger name from another class.

Bugs closed: AURORA-1873

Reviewed at

5 years agoExpose stats on ZooKeeper connection state
Jing Chen [Tue, 10 Jan 2017 22:35:21 +0000 (23:35 +0100)] 
Expose stats on ZooKeeper connection state

* zk_connection_state_STATE shows 1 if STATE is current connection state, otherwise 0.
* zk_connection_state_STATE_counter represents occurence times of the STATE since scheduler state

Bugs closed: AURORA-1838

Reviewed at

5 years agoEnsure destination exists when mounting files into a filesystem image.
Joshua Cohen [Tue, 10 Jan 2017 22:11:54 +0000 (16:11 -0600)] 
Ensure destination exists when mounting files into a filesystem image.

When testing filesystem isolation internally, we ran into an issue where mounting a regular file
into the task filesystem failed with exit code 32 since the mount destination did not exist. To
account for this, we'll touch an empty file in the taskfs.

Reviewed at

5 years agoReduce storage write lock contention by adopting Double-Checked Locking pattern in
Mehrdad Nurolahzade [Wed, 4 Jan 2017 21:50:46 +0000 (15:50 -0600)] 
Reduce storage write lock contention by adopting Double-Checked Locking pattern in

`TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a
transient state. It then verifies after a default time-out period of 5 minutes if the task has
transitioned out of the transient state.

The verification step takes place while holding the storage write lock. In over 99% of cases the
logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns
task has transitioned out of the transient state.

This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in

Bugs closed: AURORA-1820

Reviewed at

5 years agoExpose stats on undelivered event bus events
Mehrdad Nurolahzade [Tue, 27 Dec 2016 22:32:26 +0000 (23:32 +0100)] 
Expose stats on undelivered event bus events

Bugs closed: AURORA-1834

Reviewed at

5 years agoExpose stats on JobUpdateAction transitions
Mehrdad Nurolahzade [Tue, 27 Dec 2016 13:19:40 +0000 (14:19 +0100)] 
Expose stats on JobUpdateAction transitions

Introduced new stats that exposes `JobUpdateAction` transitions.

Refactored away from `CachedCounters` for existing metric; it was dynamically
generating new String objects (through concatenation) per stats collection event.

Fixed for a mistake in a previous changeset (;
removed unnecessary checked `Exception` on `CacheLoader.load()`.

Bugs closed: AURORA-1851

Reviewed at

5 years agoExpose timing stats on PendingTaskProcessor runs
Mehrdad Nurolahzade [Tue, 27 Dec 2016 11:49:58 +0000 (12:49 +0100)] 
Expose timing stats on PendingTaskProcessor runs

Bugs closed: AURORA-1857

Reviewed at

5 years agoUpdate to Mesos 1.1.0.
Stephan Erb [Tue, 27 Dec 2016 11:36:44 +0000 (12:36 +0100)] 
Update to Mesos 1.1.0.

Included changes:

* Handle new task states introduced in the latest Mesos release.
* Prevent NullPointer exception when inspecting an empty/invalid executor config in a test.
  Probably this is due to a change in the Mesos protobufs.
* Fix bug preventing the teardown of Vagrant boxes started by the egg build.
* Increase resources for the Mesos egg builds. The build for all distribution now takes 2h in total.

Full Mesos changelog:;a=blob_plain;f=CHANGELOG;hb=1.1.0

Bugs closed: AURORA-1813

Reviewed at

5 years agoExpose ResponseCode stats on Thrift server calls
Mehrdad Nurolahzade [Fri, 23 Dec 2016 10:08:29 +0000 (02:08 -0800)] 
Expose ResponseCode stats on Thrift server calls

Bugs closed: AURORA-1848

Reviewed at

5 years agoExpose stats on deleted tasks in TaskHistoryPruner
Mehrdad Nurolahzade [Fri, 23 Dec 2016 10:07:08 +0000 (02:07 -0800)] 
Expose stats on deleted tasks in TaskHistoryPruner

Bugs closed: AURORA-1855

Reviewed at

5 years agoAURORA-1842 Expose stats on garbage collected rows in RowGarbageCollector
Mehrdad Nurolahzade [Thu, 22 Dec 2016 10:55:51 +0000 (02:55 -0800)] 
AURORA-1842 Expose stats on garbage collected rows in RowGarbageCollector

Bugs closed: AURORA-1842

Reviewed at

5 years agoRemove ignored snapshot stats. Add high-level timings on storage start-up lifecycle.
David McLaughlin [Mon, 19 Dec 2016 18:43:10 +0000 (10:43 -0800)] 
Remove ignored snapshot stats. Add high-level timings on storage start-up lifecycle.

Reviewed at