aurora.git
2 months agoUpdating .auroraversion to release version 0.21.0. rel/0.21.0
Renan DelValle [Tue, 11 Sep 2018 03:57:15 +0000 (20:57 -0700)] 
Updating .auroraversion to release version 0.21.0.

2 months agoUpdating .auroraversion to 0.21.0-rc1. rel/0.21.0-rc1
Renan DelValle [Thu, 6 Sep 2018 20:56:09 +0000 (13:56 -0700)] 
Updating .auroraversion to 0.21.0-rc1.

2 months agoIncrementing snapshot version to 0.22.0-SNAPSHOT.
Renan DelValle [Thu, 6 Sep 2018 20:56:09 +0000 (13:56 -0700)] 
Incrementing snapshot version to 0.22.0-SNAPSHOT.

2 months agoUpdating CHANGELOG for 0.21.0 release.
Renan DelValle [Thu, 6 Sep 2018 20:56:09 +0000 (13:56 -0700)] 
Updating CHANGELOG for 0.21.0 release.

2 months agoChanging git repository location for release and release-candidate scripts.
Renan DelValle [Thu, 6 Sep 2018 20:54:36 +0000 (13:54 -0700)] 
Changing git repository location for release and release-candidate scripts.

2 months agoRevert "Updating CHANGELOG for 0.21.0 release."
Renan DelValle [Wed, 5 Sep 2018 18:40:06 +0000 (11:40 -0700)] 
Revert "Updating CHANGELOG for 0.21.0 release."

This reverts commit ed8a33b40778e76c7577437545572fe5bf28f611.

2 months agoRevert "Incrementing snapshot version to 0.22.0-SNAPSHOT."
Renan DelValle [Wed, 5 Sep 2018 18:39:37 +0000 (11:39 -0700)] 
Revert "Incrementing snapshot version to 0.22.0-SNAPSHOT."

This reverts commit d9495e52815837f6ff2740fb5c3db7ae9fe53972.

2 months agoFix sandbox permission errors with Mesos 1.6.0
Stephan Erb [Thu, 9 Aug 2018 10:34:06 +0000 (12:34 +0200)] 
Fix sandbox permission errors with Mesos 1.6.0

Mesos 1.6.0 creates sandboxes with permissions 750 rather than 755. This
breaks the assumption of Thermos that non-privileged processes can still
read the sandbox content.

This change is necessary even when container images are used via the
Mesos containerizer, as certain processes such as the stdout log
rotation continue to run outside of the image.

2 months agoRefactor DirectorySandbox to take the Mesos sandbox path
Stephan Erb [Thu, 9 Aug 2018 10:33:22 +0000 (12:33 +0200)] 
Refactor DirectorySandbox to take the Mesos sandbox path

Rather than taking "/path/to/mesos_task_dir/sandbox" as a constructor
argument, the sandbox objects are now passed just "/path/to/mesos_task_dir"
as this simplifies a few operations that need the folder created by
Mesos.

The "sandbox" subfolder is still created with the same name, but this is
now only handled internally. There is no user-visible behaviour change.

2 months agoLog full stack traces when a Thermos process fails
Stephan Erb [Thu, 9 Aug 2018 10:09:41 +0000 (12:09 +0200)] 
Log full stack traces when a Thermos process fails

For example, rather than just a "Permission denied" one gets a full
stacktrace of which access or operation caused the issue.

2 months agoSwitch Thermos runner to simple disk log layout
Stephan Erb [Thu, 9 Aug 2018 10:07:09 +0000 (12:07 +0200)] 
Switch Thermos runner to simple disk log layout

This simplifies debugging of Thermos runner issues significantly, as
there is just one relevant log file to look at.

For further details, please see the prior discussion at
https://reviews.apache.org/r/58609/

2 months agoIncrementing snapshot version to 0.22.0-SNAPSHOT.
Renan DelValle [Tue, 4 Sep 2018 19:19:54 +0000 (12:19 -0700)] 
Incrementing snapshot version to 0.22.0-SNAPSHOT.

2 months agoUpdating CHANGELOG for 0.21.0 release.
Renan DelValle [Tue, 4 Sep 2018 19:19:53 +0000 (12:19 -0700)] 
Updating CHANGELOG for 0.21.0 release.

2 months agoRemoving MD5 check from verify release candidate script and release script (#36)
Renan DelValle [Tue, 4 Sep 2018 19:16:27 +0000 (12:16 -0700)] 
Removing MD5 check from verify release candidate script and release script (#36)

* Removing MD5 check from release verification script.

* Replacing MD5 with SHA-512 on release script.

2 months agoRemoving MD5 hash from releases as required by new Apache policy.
Renan DelValle [Wed, 29 Aug 2018 23:39:07 +0000 (16:39 -0700)] 
Removing MD5 hash from releases as required by new Apache policy.

Changing location of checksum and type of checksum for release.

2 months agoFixup the pycharm setup script.
John Sirois [Tue, 21 Aug 2018 19:03:15 +0000 (13:03 -0600)] 
Fixup the pycharm setup script.

With modern PyCharm (2018+), virtualenv support for wheels and namespace
packages works, so kill the obsolete `--egg` in our pip installs. Also
leverage `./pants options` to grab the appropriate `pytest`
requirements.

2 months agoMerge pull request #33 from jordanly/jly/fix-npe-in-sla-aware-updates
David McLaughlin [Fri, 17 Aug 2018 05:49:52 +0000 (22:49 -0700)] 
Merge pull request #33 from jordanly/jly/fix-npe-in-sla-aware-updates

Fix possible NullPointerException in InstanceActionHandler

2 months agoBetter naming in e2e test 33/head
Jordan Ly [Thu, 16 Aug 2018 23:15:05 +0000 (16:15 -0700)] 
Better naming in e2e test

2 months agoFix possible NullPointerException in InstanceActionHandler, add e2e tests around...
Jordan Ly [Thu, 16 Aug 2018 22:30:02 +0000 (15:30 -0700)] 
Fix possible NullPointerException in InstanceActionHandler, add e2e tests around feature

3 months agoAdd Travis file for Aurora tests (#29)
Stephan Erb [Sat, 11 Aug 2018 19:18:26 +0000 (21:18 +0200)] 
Add Travis file for Aurora tests (#29)

3 months agoDeleting message discouraging PRs on github now that we are able to take them.
Renan DelValle [Wed, 1 Aug 2018 00:28:56 +0000 (17:28 -0700)] 
Deleting message discouraging PRs on github now that we are able to take them.

3 months agoUpdate URLS for git repositories.
Jordan Ly [Tue, 31 Jul 2018 23:04:56 +0000 (16:04 -0700)] 
Update URLS for git repositories.

Updating the references from the old git URL to the new gitbox one.

Reviewed at https://reviews.apache.org/r/68110/

3 months agoPrune updates that have no surviving job keys in the TaskStore.
David McLaughlin [Thu, 26 Jul 2018 23:50:12 +0000 (16:50 -0700)] 
Prune updates that have no surviving job keys in the TaskStore.

We are running into a situation where we have a lot of short-lived ad-hoc services launched and their updates are sticking around for 30 days, even though the tasks are garbage collected much sooner. This change picks up those updates and prunes them as soon as the tasks are gone.

Reviewed at https://reviews.apache.org/r/68071/

3 months agoAdd size metric for memory stores, add MemSchedulerStoreTest
Jordan Ly [Thu, 26 Jul 2018 23:01:16 +0000 (16:01 -0700)] 
Add size metric for memory stores, add MemSchedulerStoreTest

Currently, we only track the size metrics for:
- # of tasks via `task_store_index_(host|job)`
- # of crons via `mem_storage_cron_size`

I am hoping to add:
- # of attributes via `mem_storage_attributes_size`
- # of maintenance requests via `mem_storage_maintenance_size`
- # of job updates via `mem_storage_update_size`
- # of quotas via `mem_storage_quota_size`

This will help us track the growth of stores over time. Additionally, I added
a `MemSchedulerStoreTest` since one did not exist previously and nothing was
extending the abtract version of the test.

Reviewed at https://reviews.apache.org/r/68047/

3 months agoEnable SLA-aware updates
Jordan Ly [Thu, 19 Jul 2018 21:28:40 +0000 (14:28 -0700)] 
Enable SLA-aware updates

This patch enables SLA-aware updates.

Following https://reviews.apache.org/r/66716/, tasks may now specify custom SLA
policies that will be respected by the scheduler during maintenance. This patch
integrates into the same system to allow users to specify if they want their
updates to also respect SLA. Please see
https://docs.google.com/document/d/1lCoDyoX26qrGrptrgO7vJHqYR_L2CBRGFIywsAd8uQo/edit?usp=sharing
for a more detailed description.

This patch adds two optional Thrift fields, `slaAware` to `JobUpdateSettings`
and `message` to `JobInstanceUpdateEvent`. These should be forward and
backwards compatible.

Reviewed at https://reviews.apache.org/r/67696/

3 months agoUnhandled exception should not strand runner in STARTING state.
Santhosh Kumar Shanmugham [Wed, 18 Jul 2018 22:23:27 +0000 (15:23 -0700)] 
Unhandled exception should not strand runner in STARTING state.

If the ThermoTaskRunner encounters an Exception when trying to
fork the process, it bubbles this up to the Executor which does
not handle execptions other than TaskError. This leads to the
executor leaving the task in STARTING state and we end up with
tasks that get stranded in this state.

Fix it so that any unknown expection that is thrown when starting
a runner leads to task failure and get marked as FAILED.

Testing Done:
./gradlew test
./pants test src/test/python/apache::

Reviewed at https://reviews.apache.org/r/67967/

3 months agoTaskQuery struct needs to be optional
Ezequiel Torres [Tue, 17 Jul 2018 08:40:04 +0000 (10:40 +0200)] 
TaskQuery struct needs to be optional

In languages like Go, types are not optionals by default.
The actual api.thift don't let create queries with just a
few fields in Go since all the fields are required

Bugs closed: https://issues.apache.org/jira/browse/AURORA-1991

Reviewed at https://reviews.apache.org/r/67757/

4 months agoUpdated restore instructions to reflect using offline rehydration tool.
Renan DelValle [Fri, 29 Jun 2018 22:36:03 +0000 (15:36 -0700)] 
Updated restore instructions to reflect using offline rehydration tool.

Rewrote the instructions for recovering from backup based upon using Bill's tool to recover with all instances offline.

Reviewed at https://reviews.apache.org/r/67705/

4 months agoDisplay negation of constraint in TaskConfigSummary.
Santhosh Kumar Shanmugham [Tue, 26 Jun 2018 00:21:42 +0000 (17:21 -0700)] 
Display negation of constraint in TaskConfigSummary.

Testing Done:
./gradlew test

Reviewed at https://reviews.apache.org/r/67734/

4 months agoFix style of TaskConfigSummary.
Santhosh Kumar [Fri, 22 Jun 2018 19:54:08 +0000 (12:54 -0700)] 
Fix style of TaskConfigSummary.

4 months agoIntroduce a `countdown-ms` param in Coordinator request.
Santhosh Kumar Shanmugham [Thu, 21 Jun 2018 00:27:48 +0000 (17:27 -0700)] 
Introduce a `countdown-ms` param in Coordinator request.

With the introduction of `timeoutSecs` for HostMaintenanceRequest
and the `CoordinatorSlaPolicy`, it will be beneficial to expose the
time remaining until forced maintenance to the Coordinator. Send
the time remaining until force task maintenance as an extra query
param to the Coordinator.

Testing Done:
./gradlew test
./build-support/jenkins/build.sh

**Tested on Vagrant**

***Logs from Coordinator***
Request received for {'task': ['devcluster/vagrant/test/coordinator/0']}
{
  "forceMaintenanceCountdownMs": "604755646",
  "task": "devcluster/vagrant/test/coordinator/0",
  "taskConfig": {
    "assignedTask": {
      "assignedPorts": {},
      "instanceId": 0,
      "slaveHost": "192.168.33.7",
      "slaveId": "f0336813-864b-4c8f-914c-80f8cef3b61d-S0",
      "task": {
      ...<SNIPPED>
}
Responded: True

Reviewed at https://reviews.apache.org/r/67657/

4 months agoExport count-down to forceful Maintenace as a metric.
Santhosh Kumar Shanmugham [Tue, 19 Jun 2018 17:31:50 +0000 (10:31 -0700)] 
Export count-down to forceful Maintenace as a metric.

Since the scheduler enforces a maximum timeout on each
maintenance request and we now allow CoordinatorSlaPolicy
to block maintenance, we need to know which tasks are
running into the force maintenance timeout. Export maintenace
count down time as a metric brokwen down by task keys.

Testing Done:
./gradlew test

**Tested in Vagrant**
sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | grep maintenance_countdown
######################################################################## 100.0%
maintenance_countdown_ms_vagrant/test/coordinator/0 264523
maintenance_countdown_ms_vagrant/test/coordinator/1 24476
sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | grep maintenance_countdown
######################################################################## 100.0%
maintenance_countdown_ms_vagrant/test/coordinator/0 264523
maintenance_countdown_ms_vagrant/test/coordinator/1 24476
sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | grep maintenance_countdown
######################################################################## 100.0%
maintenance_countdown_ms_vagrant/test/coordinator/0 264523
maintenance_countdown_ms_vagrant/test/coordinator/1 0

Reviewed at https://reviews.apache.org/r/67639/

4 months agoExport number of tasks lost per dedicated role.
Santhosh Kumar Shanmugham [Mon, 18 Jun 2018 23:39:34 +0000 (16:39 -0700)] 
Export number of tasks lost per dedicated role.

When there are 100s of dedicated roles in a cluster
the task_LOST_<job> metric is not enough. Introduce
per dedicated role metric for easier diagnosis.

Testing Done:
./gradlew test

**Tested on Vagrant**
tasks_lost_dedicated____web.multi 0
tasks_lost_dedicated_vagrant 2

Reviewed at https://reviews.apache.org/r/67638/

4 months agoUpdate to Mesos 1.5
Stephan Erb [Sat, 16 Jun 2018 03:02:44 +0000 (20:02 -0700)] 
Update to Mesos 1.5

This is an upgrade to Mesos 1.5.0. While 1.5.1 is already released, there are no Debian
packages and jars available yet.

The mesos.interface Python package has a requirement on a newer protobuf version.
I applied the same update to Java for consistency.

* Mesos 1.5 changelog: https://mesos.apache.org/blog/mesos-1-5-0-released/
* Mesos update instructions: https://mesos.apache.org/documentation/latest/upgrades/#upgrading-from-1-4-x-to-1-5-x
* Protobuf changelog: https://github.com/google/protobuf/blob/3.5.x/CHANGES.txt

Reviewed at https://reviews.apache.org/r/67584/

4 months agoClose AsyncHttpClient on scheduler shutdown.
Santhosh Kumar Shanmugham [Fri, 15 Jun 2018 22:08:22 +0000 (15:08 -0700)] 
Close AsyncHttpClient on scheduler shutdown.

Convert SlaManager into an AbstractIdleService and explicitly
close the AsyncHttpClient on scheduler shutdown. Otherwise
we run the rise of having a stuck scheduler JVM that is unable
to shutdown due to any on the remaining non-daemon http client
threads.

Testing Done:
./gradlew test

**Tested in vagrant:**
Jun 15 20:48:53 aurora aurora-scheduler[8719]: I0615 20:48:53.456 [BlockingDriverJoin, StateMachine] SchedulerLifecycle state machine transition DEAD -> DEAD
Jun 15 20:48:53 aurora aurora-scheduler[8719]: I0615 20:48:53.457 [BlockingDriverJoin, SchedulerLifecycle] Shutdown already invoked, ignoring extra call.
Jun 15 20:48:53 aurora aurora-scheduler[8719]: I0615 20:48:53.458 [TearDownShutdownRegistry STOPPING, StateMachine] storage state machine transition READY -> STOPPED
Jun 15 20:48:53 aurora aurora-scheduler[8719]: I0615 20:48:53.459 [TearDownShutdownRegistry STOPPING, Lifecycle] Shutting down application
Jun 15 20:48:53 aurora aurora-scheduler[8719]: I0615 20:48:53.459 [TearDownShutdownRegistry STOPPING, ShutdownRegistry$ShutdownRegistryImpl] Action controller has already completed, subsequent calls ignored.
Jun 15 20:48:53 aurora aurora-scheduler[8719]: I0615 20:48:53.461 [main, SchedulerMain] Stopping scheduler services.
**Jun 15 20:48:53 aurora aurora-scheduler[8719]: I0615 20:48:53.470 [SlaManager$$EnhancerByGuice$$40d3047 STOPPING, SlaManager] Shutting down SlaManager async http client.**
Jun 15 20:48:53 aurora aurora-scheduler[8719]: I0615 20:48:53.475 [CronLifecycle STOPPING, CronLifecycle] Shutting down Quartz cron scheduler.
...
Jun 15 20:48:56 aurora aurora-scheduler[8719]: I0615 20:48:56.167 [main, SchedulerMain] Application run() exited.

Bugs closed: AURORA-1990

Reviewed at https://reviews.apache.org/r/67613/

4 months agoSpeedup regular Thermos observer checkpoint refresh
Stephan Erb [Thu, 14 Jun 2018 13:02:48 +0000 (13:02 +0000)] 
Speedup regular Thermos observer checkpoint refresh

Profiling indicates that a significant part of the refresh time os spend in `os.path.realpath`.
This was introduced in https://reviews.apache.org/r/35580/ to properly handle the `latest`
symlink in the Mesos folder layout.

This patch takes a slightly different approach to solve this problem based on `os.path.islink`.
The latter is faster as it just needs to look at a single folder rather than an entire path.

Testing Done:
I have tested this build on a node with 55 running tasks and 2004 finished ones.

Before this patch:

    D0320 22:20:44.887248 25771 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.92s
    D0320 22:20:50.746316 25771 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.93s
    D0320 22:20:56.590157 25771 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.89s

With this patch:

    D0320 22:18:53.545236 16250 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.48s
    D0320 22:18:59.031919 16250 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.49s
    D0320 22:19:04.512358 16250 task_observer.py:142] TaskObserver: finished checkpoint refresh in 0.48s

Reviewed at https://reviews.apache.org/r/66139/

4 months agoRemove resource properties from ResourceAggregate
Jing Chen [Thu, 14 Jun 2018 11:35:50 +0000 (13:35 +0200)] 
Remove resource properties from ResourceAggregate

Bugs closed: AURORA-1975

Reviewed at https://reviews.apache.org/r/67077/

4 months agoUpdate Pants to 1.6.0 and Virtualenv to 16.0.0
Stephan Erb [Wed, 13 Jun 2018 19:50:27 +0000 (19:50 +0000)] 
Update Pants to 1.6.0 and Virtualenv to 16.0.0

Beyond a regular version bump, this fixes the build on older versions of MacOS.

Testing Done:
./build-support/jenkins/build.sh

Reviewed at https://reviews.apache.org/r/67326/

5 months agoRemove maintenance request after a host is drained.
Santhosh Kumar Shanmugham [Thu, 7 Jun 2018 00:16:11 +0000 (17:16 -0700)] 
Remove maintenance request after a host is drained.

Delete the `HostMaintenaceRequest` once the host has been
`DRAINED`.

Testing Done:
./build-support/jenkins/build.sh
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Reviewed at https://reviews.apache.org/r/67479/

5 months agoEnable `Tasks` to specify their own custom maintenance SLA.
Santhosh Kumar Shanmugham [Tue, 5 Jun 2018 23:15:52 +0000 (16:15 -0700)] 
Enable `Tasks` to specify their own custom maintenance SLA.

`Tasks` can specify custom SLA requirements as part of
their `TaskConfig`. One of the new features is the ability
to specify an external coordinator that can ACK/NACK
maintenance requests for tasks. This will be hugely
beneficial for onboarding services that cannot satisfactorily
specify SLA in terms of running instances.

Maintenance requests are driven from the Scheduler to
improve management of nodes in the cluster.

Testing Done:
./build-support/jenkins/build.sh
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Bugs closed: AURORA-1978

Reviewed at https://reviews.apache.org/r/66716/

5 months agoIntroduce structs to enable specifying custom SLA.
Santhosh Kumar Shanmugham [Mon, 21 May 2018 23:40:16 +0000 (16:40 -0700)] 
Introduce structs to enable specifying custom SLA.

Add `SlaPolicy` and `HostMaintenanceRequest` structs
to the thrift definition and introduce a new `HostMaintenanceStore`
for tracking maintenance requests. These changes will be used in
https://reviews.apache.org/r/66716 for implementing custom SLA
and scheduler driven maintenance.

This RB splits the storage related changes from https://reviews.apache.org/r/66716
for better rollback story.

Tested rollback on the vagrant.

Testing Done:
./build-support/jenkins/build.sh

Bugs closed: AURORA-1977

Reviewed at https://reviews.apache.org/r/67141/

5 months agoFix flaky Webhook test by ensuring proper error condition
Jordan Ly [Mon, 21 May 2018 21:16:49 +0000 (14:16 -0700)] 
Fix flaky Webhook test by ensuring proper error condition

Attempt #3 at fixing the flaky Webhook test once and for all.

Previously, I was testing the error condition by hitting a bad url with a port
of -1. I believe this was erroneous (I am assuming the -1 overflowed into
a valid port). Additionally, there was a timing associated with the test which
could make it flaky as well.

I ensured that the test hit a bad host url and removed the timing for a more
deterministic test.

Reviewed at https://reviews.apache.org/r/67219/

6 months agoaurora update info command should print out update metadata
Jing Chen [Mon, 7 May 2018 18:49:31 +0000 (11:49 -0700)] 
aurora update info command should print out update metadata
* Metadata is represented as a list of key value pair

Bugs closed: AURORA-1906

Reviewed at https://reviews.apache.org/r/66980/

6 months agoChanging Vagrant requirements to latest version for launching our local dev box.
Renan DelValle [Fri, 4 May 2018 17:47:49 +0000 (10:47 -0700)] 
Changing Vagrant requirements to latest version for launching our local dev box.

Reviewed at https://reviews.apache.org/r/66922/

6 months agoBreakdown resource stats by role
David McLaughlin [Thu, 26 Apr 2018 23:51:23 +0000 (16:51 -0700)] 
Breakdown resource stats by role

Currently Aurora exports total quota and resource reservation over time. This can be very useful to see changes in trends of production and free tier capacity. One challenge (particularly in a self-serve capacity environment) is identifying and tracking where large deltas came from. This change exports both quota and resource usage per role to help with this.

Reviewed at https://reviews.apache.org/r/66806/

6 months agoUpgrade to psutil with optimized Process.children()
Stephan Erb [Sun, 22 Apr 2018 19:53:44 +0000 (21:53 +0200)] 
Upgrade to psutil with optimized Process.children()

The changelog claims: `Process.children() is 2x faster on UNIX and 2.4x faster
on Linux.`

This is needed for all stats retrieved via `ProcessTreeCollector`. An update
therefore seems worthwhile.

https://github.com/giampaolo/psutil/blob/master/HISTORY.rst

Reviewed at https://reviews.apache.org/r/66186/

6 months agoFix the json endpoints in thermos
Reza Motamedi [Fri, 20 Apr 2018 20:29:24 +0000 (13:29 -0700)] 
Fix the json endpoints in thermos

# Fixing the json endpoints in thermos

`TaskObserverJSONBindings` is mixin that includes a few routes that serve info about tasks and processes in pure JSON format. The functions are overridden in the main bottle server, so the routes are not accessible. This patch fixes it by renaming those methods.

Check here:
https://github.com/apache/aurora/blob/master/src/main/python/apache/thermos/observer/http/http_observer.py#L72

Testing Done:
There was no unit test affected.

After fixing the routes server the expected content.
```
? curl http://192.168.33.7:1338/j/task_ids
{"type": "all", "tasks": [{"status": "sleeping", "ram": 3727360, "state_timestamp": 1523728477, "threads": 2, "user": 0.24, "disk": 10117120, "launch_timestamp": 1523728477, "vms": 22990848, "rss": 3727360, "name": "hello", "task_id": "www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098", "system": 0.34, "ports": {}, "state": "ACTIVE", "role": "www-data", "cpu": 0.0, "nice": 0}], "num": 20, "task_count": 1, "offset": 0}%

? curl http://192.168.33.7:1338/j/task/www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098
{"www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098": {"task": {"processes": [{"daemon": false, "name": "hello", "max_failures": 1, "ephemeral": false, "min_duration": 5, "cmdline": "\n    while true; do\n      echo hello world\n      sleep 10\n    done\n  ", "final": false}], "name": "hello", "finalization_wait": 30, "max_failures": 1, "max_concurrency": 0, "resources": {"gpu": 0, "disk": 134217728, "ram": 134217728, "cpu": 1.0}, "constraints": [{"order": ["hello"]}]}, "name": "hello", "task_id": "www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098", "processes": {"failed": [], "running": ["hello"], "killed": [], "success": [], "waiting": []}, "state_timestamp": 1523728477, "state": "ACTIVE", "resource_consumption": {"status": "sleeping", "disk": 10113024, "ram": 3719168, "system": 0.33, "vms": 22990848, "threads": 2, "user": 0.24, "rss": 3719168, "cpu": 0.0, "nice": 0}, "user": "www-data", "launch_timestamp": 1523728477, "ports": {}}}%

? curl http://192.168.33.7:1338/j/task\?task_id\=www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098
{"www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098": {"task": {"processes": [{"daemon": false, "name": "hello", "max_failures": 1, "ephemeral": false, "min_duration": 5, "cmdline": "\n    while true; do\n      echo hello world\n      sleep 10\n    done\n  ", "final": false}], "name": "hello", "finalization_wait": 30, "max_failures": 1, "max_concurrency": 0, "resources": {"gpu": 0, "disk": 134217728, "ram": 134217728, "cpu": 1.0}, "constraints": [{"order": ["hello"]}]}, "name": "hello", "task_id": "www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098", "processes": {"failed": [], "running": ["hello"], "killed": [], "success": [], "waiting": []}, "state_timestamp": 1523728477, "state": "ACTIVE", "resource_consumption": {"status": "sleeping", "disk": 10141696, "ram": 3731456, "system": 0.35, "vms": 22994944, "threads": 2, "user": 0.24, "rss": 3731456, "cpu": 0.0, "nice": 0}, "user": "www-data", "launch_timestamp": 1523728477, "ports": {}}}%

? curl http://192.168.33.7:1338/j/process/www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098/hello/0
{"state": "RUNNING", "process_name": "hello", "used": {"status": "sleeping", "ram": 3735552, "system": 0.34, "vms": 22990848, "threads": 2, "user": 0.24, "rss": 3735552, "cpu": 0.0, "nice": 0}, "start_time": 1523728477.867429, "process_run": 0}%

? curl http://192.168.33.7:1338/j/processes\?task_id\=www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098
{"www-data-prod-hello-0-00e58d09-a67f-4a46-94a0-15bcad26a098": {"hello": {"state": "RUNNING", "process_name": "hello", "used": {"status": "sleeping", "ram": 3735552, "system": 0.35, "vms": 22994944, "threads": 2, "user": 0.25, "rss": 3735552, "cpu": 0.0005000061512750167, "nice": 0}, "start_time": 1523728477.867429, "process_run": 0}}}%
```

Reviewed at https://reviews.apache.org/r/66623/

6 months agoAdd --pid-file flag to `aurora task ssh` to write the PID of the underlying SSH comma...
Sameer Brenn [Fri, 20 Apr 2018 20:26:08 +0000 (13:26 -0700)] 
Add --pid-file flag to `aurora task ssh` to write the PID of the underlying SSH command to a specified file.

My team has some scripts to start devel shards which create tunnels:

```
aurora task ssh -L 8002:http --ssh-options "-f -N" "$DC/$USER/devel/proxyapp/0"
aurora task ssh -L 9002:health --ssh-options "-f -N" "$DC/$USER/devel/proxyapp/0"
```

We use fixed local port numbers because that way we can run dependent services locally that look for locally-running copies of the
same service on a fixed port, but then those requests get tunnelled through to the devel shard.

When the devel shard is restarted, however, the tunnel is still running so the subsequent call to create a new tunnel fails because
it can't bind to the fixed port.

If we save the SSH process PID to a file, we can then kill existing tunnel to the old instance before starting up the new tunnel to the
new instance.

Testing Done:
```
$ ./pants test src/test/python/apache/aurora/client::
```

And when applying the same patch to our local repo at Twitter:

```
$ ./pants run twitter/src/main/python/twitter/aurora/client/cli_internal:aurora_internal -- task ssh -L 8005:http --ssh-options "-n -N" --pid-file /tmp/p "smf1/sbrenn/devel/proxyapp/0" &
$ ps -p `cat /tmp/p`
  PID TTY           TIME CMD
34729 ttys000    0:00.05 ssh -t -n -N -L 8005:smf1-aki-27-sr1.prod.twitter.com:31794 sbrenn@smf1-aki-27-sr1.prod.twitter.com cd /var/lib/mesos/slaves/*/frameworks/*/exec
```

Reviewed at https://reviews.apache.org/r/66697/

7 months agoAdd initial interval before searching for preemption slots
Jordan Ly [Thu, 12 Apr 2018 20:28:48 +0000 (13:28 -0700)] 
Add initial interval before searching for preemption slots

Between failovers, tasks that normally would not require preemption could be in
a PENDING state for an extended period of time and become eligible for
preemption. Thus, when the scheduler starts, offers could not have been
processed yet and the tasks can preempt other tasks needlessly.

Added an initial delay to preemption slot searching on scheduler startup so
PENDING tasks have a chance to be scheduled before preempting.

Reviewed at https://reviews.apache.org/r/66573/

7 months agoRemove flaky test/assertion in PendingTaskProcessorTest
Jordan Ly [Wed, 11 Apr 2018 23:46:59 +0000 (16:46 -0700)] 
Remove flaky test/assertion in PendingTaskProcessorTest

I realized I added a flaky assertion in `PendingTaskProcessorTest` in
https://reviews.apache.org/r/66536/

I got extremely unlucky and every time I ran the tests it passed until after
I merged :( The stat `preemptor_slot_search_[success|failed]_for_[name]` will
not appear unless the job slot search actually succeeds or fails (i.e. it
cannot be 0 since it is dynamically generated). We were getting lucky where the
test would search for JOB_A slots first and create the stat. However, when
JOB_B gets searched first, the JOB_A stat is never created because there are no
slaves to search through anymore.

I removed the assertion because there is a sufficient assertion directly above,
and the stat is tested in multiple other tests.

The assertion would result in a `NullPointerException`.

Reviewed at https://reviews.apache.org/r/66570/

7 months agoAdd more preemption metrics (jobs preempted, preemptors) and logging statements
Jordan Ly [Wed, 11 Apr 2018 21:41:52 +0000 (14:41 -0700)] 
Add more preemption metrics (jobs preempted, preemptors) and logging statements

Added additional metrics: ```
1. preemptor_tasks_preempted_[JOB_NAME] - The number of times [JOB_NAME] has
   been preempted for another task.
2. preemptor_tasks_preemptor_[JOB_NAME] - The number of times [JOB_NAME] has
   preempted another task.
3. preemptor_slot_search_[success|failed]_for_[JOB_NAME] - The number of times
   [JOB_NAME] has or hasn't found a slot for preemption.
4. preemptor_slot_validation_[success|failed]_for_[JOB_NAME] - The number of
   times [JOB_NAME] succeeded to or failed to validate a slot before
preemption.  ```

Additionally, added some `LOG.info` statements for better visibility into
preemption/preemption slot finding.

Did a little bit of code refactoring as well.

Reviewed at https://reviews.apache.org/r/66536/

7 months agoChaning default reviewers as well as reflecting the new address of gorealis.
Renan DelValle [Sun, 8 Apr 2018 00:01:56 +0000 (17:01 -0700)] 
Chaning default reviewers as well as reflecting the new address of gorealis.

Thank you Joshua and Zameer for previously taking up this role.

Reviewed at https://reviews.apache.org/r/66491/

7 months agoIncrementing snapshot version to 0.21.0-SNAPSHOT.
Renan DelValle [Wed, 28 Mar 2018 19:20:05 +0000 (12:20 -0700)] 
Incrementing snapshot version to 0.21.0-SNAPSHOT.

7 months agoUpdating CHANGELOG for 0.20.0 release.
Renan DelValle [Wed, 28 Mar 2018 19:20:05 +0000 (12:20 -0700)] 
Updating CHANGELOG for 0.20.0 release.

7 months agoPreparing RELEASE-NOTES.md for release.
Renan DelValle [Wed, 28 Mar 2018 19:15:39 +0000 (12:15 -0700)] 
Preparing RELEASE-NOTES.md for release.

7 months agoReverting changes done by release script due to failed
Renan DelValle [Tue, 27 Mar 2018 21:18:22 +0000 (14:18 -0700)] 
Reverting changes done by release script due to failed
vote for 0.20.0 RC0

Revert "Preparing RELEASE-NOTES.md for release."

This reverts commit a60367626e71786fa7a23a49510a437986a9a074.

Revert "Updating CHANGELOG for 0.20.0 release."

This reverts commit 5a26413f72b8a428f286bed286aa4612c4123884.

Revert "Incrementing snapshot version to 0.21.0-SNAPSHOT."

This reverts commit a12b84444e0ba4227e6a41b0f7e82045b2dcc016.

7 months agoEnd to end tests bugfix
Renan DelValle [Tue, 27 Mar 2018 21:12:16 +0000 (14:12 -0700)] 
End to end tests bugfix

* Fixing kerberos end to end test. Previous version had it's signing key revoked resulting in the test failing.

* Excluding kerberos unit file from being copied on provision as it's later copied and deleted by the end to end test.

* Bypass leader redirect changed from upstart to systemd. This test wasn't being run because the kerberos test was failing.

* Changing docker image to slim-stretch in docker aurora tests to address AURORA-1974.

* Added daemon-reload to aurorabuild whenever the daemons are restarted.

Bugs closed: AURORA-1974

Reviewed at https://reviews.apache.org/r/66269/

7 months agoIntroduce mesos disk collector
Reza Motamedi [Mon, 26 Mar 2018 20:47:13 +0000 (13:47 -0700)] 
Introduce mesos disk collector

When disk isolation is enabled in a Mesos agent it calculates the disk usage for each container.
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, essentially repeating the work already done by the agent. In practice, we see that disk monitoring is one of the most expensive resource monitoring tasks. For instance, when there are deeply nested directories, the CPU utilization of the observer process can easily reach 1.5 CPUs. It would be ideal if we delegate the disk monitoring task to the agent and do it only once. With this approach, when disk collection has improved in the agent (for instance by implementing XFS isolation), we can simply benefit from it without any code change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API endpoint to lookup disk_used_bytes. Note that there is also resource monitoring in thermos executor. Currently, I left the disk collector there to use the `du` implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and `etc_mesos-slave/isolation` for testing. They can be left as is. I included them in this patch to show how this would work e2e.

Testing Done:
- I added unit tests.
- Tested in vagrant and it works as intenced.
- I also built and deployed in our test enviroment. In order to measure imporoved performance I created jobs with nested folders and noticed reduction in CPU utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU cores)

Here is one specific test setup: On two hosts I created a two tasks. Each task creates identical nested directory structures and files in them. The overall size is 30GB. test_host_1 runs the current version of observer and test_host_2 runs Observer with this patch and also has mesos_disk_collection enabled. The results are as follows:

```
rezam[7]TEST_HOST_1 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:17 UTC 2018
observer.observer_cpu 108.9
Thu Mar 22 04:36:27 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:38 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:48 UTC 2018
observer.observer_cpu 123.2
Thu Mar 22 04:36:58 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:08 UTC 2018
observer.observer_cpu 111.0
Thu Mar 22 04:37:18 UTC 2018
observer.observer_cpu 111.0

rezam[7]TEST_HOST_2 ~ $ while true; do echo `date`; curl localhost:1338/vars -s | grep cpu; sleep 10; done
Thu Mar 22 04:36:20 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:30 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:40 UTC 2018
observer.observer_cpu 1.3
Thu Mar 22 04:36:50 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:00 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:10 UTC 2018
observer.observer_cpu 1.2
Thu Mar 22 04:37:20 UTC 2018
observer.observer_cpu 1.8
```

Reviewed at https://reviews.apache.org/r/66103/

7 months agoFix 'PreemptorSlotSearchBenchmark', remove 'isProduction' references in benchmark
Jordan Ly [Fri, 23 Mar 2018 20:30:41 +0000 (13:30 -0700)] 
Fix 'PreemptorSlotSearchBenchmark', remove 'isProduction' references in benchmark

This benchmark was using the deprecated `production` flag when building
the tasks for the cluster. `PendingTaskProcessor` depends on `tier`
instead, so this benchmark ended up not testing the correct codepath.

Removed references to `production` and added `tier` instead.
Additionally, removed some unused options.

Reviewed at https://reviews.apache.org/r/66190/

7 months agoRemove unused LOST_LOCK_MESSAGE variable in JobUpdateControllerImpl
Jordan Ly [Fri, 23 Mar 2018 17:13:30 +0000 (10:13 -0700)] 
Remove unused LOST_LOCK_MESSAGE variable in JobUpdateControllerImpl

We no longer use locks for updates (context: https://reviews.apache.org/r/63130/). This was a legacy variable.

Reviewed at https://reviews.apache.org/r/66199/

7 months agoAdding support for using custom executors via the Aurora DSL
Renan DelValle [Wed, 21 Mar 2018 01:26:49 +0000 (18:26 -0700)] 
Adding support for using custom executors via the Aurora DSL

Bugs closed: AURORA-1981

Reviewed at https://reviews.apache.org/r/66154/

7 months agoSwitch Thermos to lazy log formatting
Stephan Erb [Mon, 19 Mar 2018 09:10:03 +0000 (09:10 +0000)] 
Switch Thermos to lazy log formatting

This is the first part of a small series of Thermos observer performance
improvements.

As a first iteration, this switches all logging to use the logger-embedded
formatting rather than doing it eager up front. This has the advantage that
we produce less garbage if debug logging is disabled.

Reviewed at https://reviews.apache.org/r/66136/

7 months agoPersist scheduler/observer logs to /var/log/aurora/[FILE].log
Jordan Ly [Mon, 19 Mar 2018 20:58:29 +0000 (13:58 -0700)] 
Persist scheduler/observer logs to /var/log/aurora/[FILE].log

`journalctl -u aurora-[executor|scheduler]` still works.

Reviewed at https://reviews.apache.org/r/65896/

7 months agoRemove unused module in RecoveryTool, move TaskTestUtil to test folder
Jordan Ly [Mon, 19 Mar 2018 17:07:30 +0000 (10:07 -0700)] 
Remove unused module in RecoveryTool, move TaskTestUtil to test folder

Removing an unused `TierModule` from `RecoveryTool`.

Additionally, resolved an old TODO and moved `TaskTestUtil` to the test folder.
It seems that the old version of JMH could not see test sources but
https://github.com/melix/jmh-gradle-plugin/issues/31 and the upgrade to 0.4.4
seems to fix that.

Reviewed at https://reviews.apache.org/r/65769/

7 months agoRefactor ClusterState to more appropriate package, move binding to StateModule
Jordan Ly [Sun, 18 Mar 2018 05:00:21 +0000 (22:00 -0700)] 
Refactor ClusterState to more appropriate package, move binding to StateModule

Browsing through the code and I noticed that if preemption is turned off, the
`/state` endpoint will not work since `ClusterState` is not bound.

I moved `ClusterState` and `ClusterStateImpl` to a more suitable package, and
bind `ClusterState` in `StateModule` no matter what.

Reviewed at https://reviews.apache.org/r/66074/

7 months agoUpgrade RBT to 0.7.11
Renan DelValle [Fri, 16 Mar 2018 17:40:48 +0000 (10:40 -0700)] 
Upgrade RBT to 0.7.11

Reviewed at https://reviews.apache.org/r/65873/

8 months agoIncrementing snapshot version to 0.21.0-SNAPSHOT.
Renan DelValle [Fri, 2 Mar 2018 00:35:48 +0000 (16:35 -0800)] 
Incrementing snapshot version to 0.21.0-SNAPSHOT.

8 months agoUpdating CHANGELOG for 0.20.0 release.
Renan DelValle [Fri, 2 Mar 2018 00:35:48 +0000 (16:35 -0800)] 
Updating CHANGELOG for 0.20.0 release.

8 months agoPreparing RELEASE-NOTES.md for release.
Renan DelValle [Fri, 2 Mar 2018 00:31:48 +0000 (16:31 -0800)] 
Preparing RELEASE-NOTES.md for release.

8 months agoUse StandardCharset instead of Charset.forName in ApiModule
Jordan Ly [Wed, 28 Feb 2018 18:41:32 +0000 (10:41 -0800)] 
Use StandardCharset instead of Charset.forName in ApiModule

Minor nit, use the StandardCharset constant for UTF-8 as opposed to creating it ourselves.

Reviewed at https://reviews.apache.org/r/65761/

8 months agoRevert "Fix cron id collision bug by avoiding state in Quartz jobs"
Jordan Ly [Tue, 27 Feb 2018 18:18:46 +0000 (10:18 -0800)] 
Revert "Fix cron id collision bug by avoiding state in Quartz jobs"

This reverts commit e2ea191473397691605602c6e40c6aad8a56d81a.

A bug was found where jobs that were killed via the KILL_EXISTING flag would set `path` as `null` in
`JobDataMap` that would block concurrent runs, but that value would never be set to `key` after the
the delayed run finished because it would run outside of the `Job` execution.

The issue in https://reviews.apache.org/r/65680/ will occur again, but it is rare and has been
around for a few years.

This bug was not caught in the unit test `testKillExisting` because `executeWithReplay` is mocked
and runs synchronously within the `Job` execution, allowing the `key` to be persisted.

Reviewed at https://reviews.apache.org/r/65810/

8 months agoAdd GPUs as resources in the Aurora CLI.
Franck Cuny [Wed, 21 Feb 2018 18:55:41 +0000 (10:55 -0800)] 
Add GPUs as resources in the Aurora CLI.

Reviewed at https://reviews.apache.org/r/65735/

8 months agoUpgrading Vagrant setup from Ubuntu Trusty to Ubuntu Xenial.
Renan DelValle [Wed, 21 Feb 2018 17:43:45 +0000 (09:43 -0800)] 
Upgrading Vagrant setup from Ubuntu Trusty to Ubuntu Xenial.

Deleted Thrift install from build script as it's no longer needed.

Changing docker images in e2e tests from Debian Jessie to Debian Stretch for ABI compatibility.

Bugs closed: AURORA-1964

Reviewed at https://reviews.apache.org/r/65565/

8 months agoFix cron id collision bug by avoiding state in Quartz jobs
Jordan Ly [Tue, 20 Feb 2018 05:33:21 +0000 (21:33 -0800)] 
Fix cron id collision bug by avoiding state in Quartz jobs

There is a pretty rare situation that can occur that will cause the scheduler to crash.

The steps are:

1. Schedule and start a cron (runs every minute, graceful shutdown period
   > 1 minute)
2. Perform 2 runs of the cron
3. Deschedule the cron
4. Reschedule the cron
5. Perform 3 runs of the cron
6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and
a new cron trying to start

The reason for this bug is that some state is persisted between cron
scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress"
token, while the `killFollowups` set indicates "completion" in order to ensure there are no
concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the
"completion" token in `killFollowups`. Later, a "work in progress" token may be added and
a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.

For the example above, the runs in step 2 will add the key to the set to show that all runs are
finished and another run can start. The 3rd run in step
5 will mistakenly see that the 2nd run has started and finished since the "completion" token was
preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run
causing a ID collision.

We should not preserve any state between cron scheduling/descheduling outside of the given Quartz
`JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as
`killFollowups`.

Reviewed at https://reviews.apache.org/r/65680/

8 months agoMake charset parsing in HTTP headers case insensitve
Jordan Ly [Mon, 19 Feb 2018 07:26:15 +0000 (23:26 -0800)] 
Make charset parsing in HTTP headers case insensitve

Users have reported that the UI does not load in Firefox. Investigating the issue shows that Chrome
and Safari will format charset into an uppercase `UTF-8` which is accepted by the servlet. However,
Mozilla will leave the charset as lowercase `utf-8` which causes a 415 response.

Charsets should be case-insensitive but the default Java `MediaType` class does not take this into
account when parsing/comparing. I propose switching to Guava's `MediaType` class which does smarter
comparisons.

Reviewed at https://reviews.apache.org/r/65690/

8 months agoDisable pytest-fast mode as a workaround for failing health checker tests.
Stephan Erb [Thu, 15 Feb 2018 19:19:28 +0000 (11:19 -0800)] 
Disable pytest-fast mode as a workaround for failing health checker tests.

Bugs closed: AURORA-1972

Reviewed at https://reviews.apache.org/r/65598/

8 months agoDo not reschedule a PARTITIONED task if it was in KILLING state.
David McLaughlin [Wed, 14 Feb 2018 19:18:31 +0000 (11:18 -0800)] 
Do not reschedule a PARTITIONED task if it was in KILLING state.

Reviewed at https://reviews.apache.org/r/65648/

8 months agoAdd GPG key for jordanly@apache.org
Jordan Ly [Wed, 14 Feb 2018 18:50:36 +0000 (10:50 -0800)] 
Add GPG key for jordanly@apache.org

Reviewed at https://reviews.apache.org/r/65650/

8 months agoAdding support for Thrift JSON requests which defines UTF-8 as the charset for the...
Renan DelValle [Wed, 14 Feb 2018 16:28:52 +0000 (08:28 -0800)] 
Adding support for Thrift JSON requests which defines UTF-8 as the charset for the Content-Type in the Request Headers

This fixes the current UI brakage as Thrift is incorrectly rejected by the scheduler servlet as an unsupported media type.

Test added to prevent regressions.

Reviewed at https://reviews.apache.org/r/65649/

9 months agoUpdate Javascript Thrift to 0.10
Stephan Erb [Fri, 9 Feb 2018 15:10:53 +0000 (15:10 +0000)] 
Update Javascript Thrift to 0.10

We missed this change when bumping the Thrift version used by Python and
Java. https://reviews.apache.org/r/64290/

List of changes: https://github.com/apache/thrift/commits/master/lib/js/src/thrift.js

Reviewed at https://reviews.apache.org/r/65433/

9 months agoUse overflow to prevent overlapping config summary tables
David McLaughlin [Tue, 6 Feb 2018 22:13:14 +0000 (14:13 -0800)] 
Use overflow to prevent overlapping config summary tables

Previous approach seemed almost random with the way it used line breaks for words. Caused some weirdness for edge cases. This approach just caps the size of table cells, since only metadata can grow unbounded.

Reviewed at https://reviews.apache.org/r/65537/

9 months agoAdd task page to Scheduler UI (without inbound links yet - this is for external refer...
David McLaughlin [Tue, 6 Feb 2018 01:26:11 +0000 (17:26 -0800)] 
Add task page to Scheduler UI (without inbound links yet - this is for external referencing).

Reviewed at https://reviews.apache.org/r/65494/

9 months agoShow cron job preview when no active tasks.
David McLaughlin [Mon, 5 Feb 2018 18:13:01 +0000 (10:13 -0800)] 
Show cron job preview when no active tasks.

Reviewed at https://reviews.apache.org/r/65501/

9 months agoAdding gpg key for renan
Renan DelValle [Sat, 3 Feb 2018 20:25:40 +0000 (12:25 -0800)] 
Adding gpg key for renan

Reviewed at https://reviews.apache.org/r/65488/

9 months agoFix UI table layout issue on Config Summaries
David McLaughlin [Fri, 2 Feb 2018 18:52:45 +0000 (10:52 -0800)] 
Fix UI table layout issue on Config Summaries

Reviewed at https://reviews.apache.org/r/65477/

9 months agoAdd PartitionPolicy to config summary when defined
David McLaughlin [Fri, 2 Feb 2018 18:05:04 +0000 (10:05 -0800)] 
Add PartitionPolicy to config summary when defined

Reviewed at https://reviews.apache.org/r/65476/

9 months agoEnsure primary_port warning respects announcer portmap
Stephan Erb [Wed, 31 Jan 2018 21:25:57 +0000 (21:25 +0000)] 
Ensure primary_port warning respects announcer portmap

This eliminates false-positive warnings in the client: It used to complain
about unbound primary ports if those where bound via the portmap.

Bugs closed: AURORA-1233

Reviewed at https://reviews.apache.org/r/65434/

9 months agoImprove performance of MemTaskStore queries
Bill Farner [Wed, 31 Jan 2018 22:59:30 +0000 (14:59 -0800)] 
Improve performance of MemTaskStore queries

Use `ArrayDeque` rather than `HashSet` for fetchTasks, and use imperative style
rather than functional.  I arrived at this result after running benchmarks with
some of the other usual suspects (`ArrayList`, `LinkedList`).

This patch also enables stack and heap profilers in jmh (more details
[here](http://hg.openjdk.java.net/codetools/jmh/file/25d8b2695bac/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_35_Profilers.java)),
providing insight into the heap impact of changes.  I started this change with a
heap profiler as the primary motivation, and ended up using it to guide this
improvement.

Reviewed at https://reviews.apache.org/r/65303/

9 months agoSupport PARTITIONED state in SLA calculations. Also added a test to protect against...
David McLaughlin [Tue, 30 Jan 2018 22:06:39 +0000 (14:06 -0800)] 
Support PARTITIONED state in SLA calculations. Also added a test to protect against this test failing in the future.

Reviewed at https://reviews.apache.org/r/65281/

9 months agoFix infinite loop in Task State Machine due to TASK_UNKNOWN handling
David McLaughlin [Tue, 30 Jan 2018 22:02:09 +0000 (14:02 -0800)] 
Fix infinite loop in Task State Machine due to TASK_UNKNOWN handling

This patch cleans up the logic. The two main changes:

1) Do not allow ASSIGNED -> PARTITIONED. This is not really related to this bug, but I found this logic error during debugging. ASSIGNED is a transient state and is subject to the transient task timeout in the Scheduler, so we should not attempt to move to PARTITIONED during that window.
2) Do not try to kill tasks we think are terminal when Mesos tells us they are unknown. Originally we did this because "manageTerminalTasks" is also used for restarting tasks - but in both cases it never makes sense to respond  to "I don't know about that task" with a request to kill it.

Bugs closed: AURORA-1966

Reviewed at https://reviews.apache.org/r/65339/

9 months agoFix error handling logic for launch failures
David McLaughlin [Thu, 25 Jan 2018 18:50:45 +0000 (10:50 -0800)] 
Fix error handling logic for launch failures

Bugs closed: AURORA-1965

Reviewed at https://reviews.apache.org/r/65338/

9 months agoAllow for injection of custom OfferSets, removed OfferOrder and OfferSelector
Jordan Ly [Sat, 20 Jan 2018 10:27:54 +0000 (11:27 +0100)] 
Allow for injection of custom OfferSets, removed OfferOrder and OfferSelector

The goal of this patch is to provide a more reasonable abstraction for custom
scheduling.

Currently, we have `OfferSelector`, `OfferOrder`, and recently I proposed
`FilterableOfferCollection` (https://reviews.apache.org/r/65225/). These were
all created in order to provide more customization within the scheduling loop.
However, they suffer from being leaky and too disparate. This patch hopes to
combine all of those components into a single `OfferSet` which can be injected
and used by HostOffers. This interface allows for custom scheduling logic to
be co-located with custom data structures for `offers` as opposed to being
passed around as different components.

The following options will be removed from the last 0.19 to now:
```
-offer_order_modules
```

Reviewed at https://reviews.apache.org/r/65233/

9 months agoPrint command line parameters when the scheduler starts
Bill Farner [Fri, 19 Jan 2018 04:51:42 +0000 (20:51 -0800)] 
Print command line parameters when the scheduler starts

Realized i never added this in the command line parser change.
Note that this output differs from the original code in one important
way - it uses `toString()` on the parameter type rather than printing
the raw value from the command line.  Unfortunately jcommander does not
make that possible.  `shiro_ini_path` is one example of an arg that
would ideally print differently here.

Reviewed at https://reviews.apache.org/r/65234/

9 months agoGitHub Pull Request template to discourage folks from making PRs
Renan DelValle [Fri, 19 Jan 2018 03:29:40 +0000 (19:29 -0800)] 
GitHub Pull Request template to discourage folks from making PRs

Simple pull request template encouraging potential contributors
to submit via ReviewBoard instead of opening up a PR on GitHub.

Normally I wouldn't want to add something to the repo that is
platform specific, but given the fact that we have a lot of exposure
on GitHub and that there's no way to disallow PRs, it might be a
good idea to point potential contributors in the right direction
from the get go.

Reviewed at https://reviews.apache.org/r/65222/

9 months agoAdded ExclusionStrategy to Gson formatter in structdump
Juan Manuel Fresia [Wed, 17 Jan 2018 23:02:26 +0000 (15:02 -0800)] 
Added ExclusionStrategy to Gson formatter in structdump

This hides thrift metadata fields such as "__isset_bitfield"
and the "optionals" enum.
Added a FieldNamingStrategy to rename "value_" and "setName_"
fields to "value" and "key" on map formatting.

Reviewed at https://reviews.apache.org/r/65076/

10 months agoUpdate discovery info documentation, when using mesos-dns
Ruben D. Porras [Wed, 10 Jan 2018 13:39:27 +0000 (14:39 +0100)] 
Update discovery info documentation, when using mesos-dns

Reviewed at https://reviews.apache.org/r/65068/

10 months agoRefactor scheduling code to split matching and assigning phases
Bill Farner [Tue, 9 Jan 2018 22:50:51 +0000 (14:50 -0800)] 
Refactor scheduling code to split matching and assigning phases

This patch sets the stage for performing the bulk of scheduling work in a
separate call path, without holding the write lock.

Also included is a mechanical refactor pushing the `revocable` flag into
`ResourceRequest` (which was ~always needed as a sibling parameter).

Reviewed at https://reviews.apache.org/r/64954/

10 months agoCustom converter to allow the -thermos_executor_resources
Renan DelValle [Tue, 9 Jan 2018 21:27:28 +0000 (13:27 -0800)] 
Custom converter to allow the -thermos_executor_resources
flag to take an empty string and parse it to an empty list

Fixes the issue that caused the voting to fail for the 0.19.0 Aurora packages.
Fix cribbed from: https://github.com/cbeust/jcommander/pull/422
Implemented as a custom converter as suggested here:
https://reviews.apache.org/r/64824/

Reviewed at https://reviews.apache.org/r/64934/

10 months agoAdd a test to detect incompatible storage changes
Bill Farner [Thu, 4 Jan 2018 16:02:55 +0000 (08:02 -0800)] 
Add a test to detect incompatible storage changes

This is intended as a safeguard against future compatibility regressions like
[AURORA-1959](https://issues.apache.org/jira/browse/AURORA-1959).

I approached this with a few goals:

  - golden files should be text-based and human-readable.  This allows for
    non-opaque code reviews, and simpler remedy when it's necessary to update
    the goldens (i.e. copy-pasteable)
  - guidance for schema evolution should be included directly in test failures
  - separate detection of 'what the scheduler _can_ read' and 'what the
    scheduler writes'
  - reasonably-complete schema coverage with minimal manual labor.  These tests
    auto-generate structs to mitigate maintenance burden of test code as
    schemas evolve.

This is not a replacement for vigilance with data compatibility, but it should
at least

1. mitigate unintentional breakages in compatibility, especially for new
   contributors
2. draw code reviewer attention to compatibility changes in a patch (signaled by
   changes to golden files)

Reviewed at https://reviews.apache.org/r/64519/

10 months agoAdd metadata field to Job object in DSL
Jing Chen [Sun, 17 Dec 2017 16:26:33 +0000 (08:26 -0800)] 
Add metadata field to Job object in DSL

Bugs closed: AURORA-1898

Reviewed at https://reviews.apache.org/r/64341/