airflow-ci-infra.git
5 months agoNew version of runner
Jarek Potiuk [Sat, 19 Mar 2022 18:51:04 +0000 (19:51 +0100)] 
New version of runner

5 months agoMerge pull request #50 from apache/add-qemu
Jarek Potiuk [Wed, 9 Mar 2022 19:11:13 +0000 (20:11 +0100)] 
Merge pull request #50 from apache/add-qemu

Add qemu installation

5 months agoAdd qemu installation add-qemu 50/head
Jarek Potiuk [Wed, 9 Mar 2022 16:31:59 +0000 (17:31 +0100)] 
Add qemu installation

5 months agoMerge pull request #49 from apache/add-multi-platform-buildx-support
Jarek Potiuk [Wed, 9 Mar 2022 15:39:48 +0000 (16:39 +0100)] 
Merge pull request #49 from apache/add-multi-platform-buildx-support

Add support for multi-platform buildx builds

5 months agoAdd support for multi-platform buildx builds add-multi-platform-buildx-support 49/head
Jarek Potiuk [Wed, 9 Mar 2022 15:03:09 +0000 (16:03 +0100)] 
Add support for multi-platform buildx builds

5 months agoRunner v2.288.1-airflow1
Ash Berlin-Taylor [Tue, 1 Mar 2022 10:28:25 +0000 (10:28 +0000)] 
Runner v2.288.1-airflow1

6 months agoInstall entropy deamon better suited to running on VMs
Ash Berlin-Taylor [Fri, 4 Feb 2022 14:44:15 +0000 (14:44 +0000)] 
Install entropy deamon better suited to running on VMs

> Haveged was created to remedy low-entropy conditions in the Linux
> random device that can occur under some workloads, especially on
> headless servers

6 months agoUpgrade to v2.287.1
Jarek Potiuk [Fri, 28 Jan 2022 10:33:35 +0000 (11:33 +0100)] 
Upgrade to v2.287.1

6 months agoDisable shipping logs to CloudWatch
Ash Berlin-Taylor [Fri, 21 Jan 2022 20:15:12 +0000 (20:15 +0000)] 
Disable shipping logs to CloudWatch

This was useful when we were debugging the "communication lost with
instance" but we haven't seen those in months, and CloudWatch Logs
accounts for 10% of our monthly AWS spend(!) and we just don't need it
anymore

I have included some previously un-pushed changes to the Vector config
to drop a few of the more common and less interesting lines. Useful for
posterity.

6 months agoMerge pull request #48 from apache/really-proper-permissions-for-buildx-plugin
Jarek Potiuk [Fri, 21 Jan 2022 15:25:05 +0000 (16:25 +0100)] 
Merge pull request #48 from apache/really-proper-permissions-for-buildx-plugin

Sets up proper ownership and permissions for buildx plugin

6 months agoSets up proper ownership and permissions for buildx plugin 48/head
Jarek Potiuk [Fri, 21 Jan 2022 12:54:31 +0000 (13:54 +0100)] 
Sets up proper ownership and permissions for buildx plugin

6 months agofix directory for the buildx plugin
Jarek Potiuk [Fri, 21 Jan 2022 08:31:02 +0000 (09:31 +0100)] 
fix directory for the buildx plugin

6 months agoMerge pull request #47 from apache/add-docker-buildx-plugin
Jarek Potiuk [Thu, 20 Jan 2022 19:36:08 +0000 (20:36 +0100)] 
Merge pull request #47 from apache/add-docker-buildx-plugin

Add Docker Buildx plugin

6 months agoAdd Docker Buildx plugin add-docker-buildx-plugin 47/head
Jarek Potiuk [Thu, 20 Jan 2022 18:42:47 +0000 (19:42 +0100)] 
Add Docker Buildx plugin

7 months agoUpdated versions 286.1
Jarek Potiuk [Tue, 18 Jan 2022 23:32:07 +0000 (00:32 +0100)] 
Updated versions 286.1

7 months agoUpdate runner to 2.286.0
Ash Berlin-Taylor [Tue, 4 Jan 2022 12:50:42 +0000 (12:50 +0000)] 
Update runner to 2.286.0

8 months agoUpdate actions runner to 2.285.1-airflow1
Ash Berlin-Taylor [Tue, 7 Dec 2021 10:53:08 +0000 (10:53 +0000)] 
Update actions runner to 2.285.1-airflow1

8 months agoMove to latest version
Jarek Potiuk [Tue, 30 Nov 2021 15:17:15 +0000 (16:17 +0100)] 
Move to latest version

8 months agoMerge pull request #46 from apache/restart_note
Jarek Potiuk [Wed, 24 Nov 2021 16:56:11 +0000 (17:56 +0100)] 
Merge pull request #46 from apache/restart_note

Add note to restart runners when updating committers

8 months agoAdd note to restart runners when updating committers restart_note 46/head
Jed Cunningham [Tue, 23 Nov 2021 22:28:35 +0000 (15:28 -0700)] 
Add note to restart runners when updating committers

8 months agoAdd script to list out committers (#45)
Jed Cunningham [Tue, 23 Nov 2021 21:15:55 +0000 (14:15 -0700)] 
Add script to list out committers (#45)

8 months agoMerge pull request #44 from apache/add-latest-git-version
Jarek Potiuk [Fri, 19 Nov 2021 10:06:58 +0000 (11:06 +0100)] 
Merge pull request #44 from apache/add-latest-git-version

Add latest git version

8 months agoAdd latest git version add-latest-git-version 44/head
Jarek Potiuk [Fri, 19 Nov 2021 03:06:40 +0000 (04:06 +0100)] 
Add latest git version

Git should be updated to latest version to handle
--ignore-matching-lines command

9 months agoUpdating to runner v 2.284.0-airflow1 (#43)
Ash Berlin-Taylor [Tue, 2 Nov 2021 17:24:53 +0000 (17:24 +0000)] 
Updating to runner v 2.284.0-airflow1 (#43)

10 months agoUpdate to 2.283.3
Ash Berlin-Taylor [Tue, 5 Oct 2021 10:22:03 +0000 (11:22 +0100)] 
Update to 2.283.3

10 months agoMerge pull request #42 from apache/fix-docker-compose-incompatibilities
Jarek Potiuk [Tue, 5 Oct 2021 00:04:05 +0000 (02:04 +0200)] 
Merge pull request #42 from apache/fix-docker-compose-incompatibilities

Temporary fix docker-compose to 1.29.2

10 months agoTemporary fix docker-compose to 1.29.2 42/head
Jarek Potiuk [Mon, 4 Oct 2021 23:35:32 +0000 (01:35 +0200)] 
Temporary fix docker-compose to 1.29.2

Docker-compose 2 breaks kerberos integration and we need to
hard-code 1.29.2 temporarily until
https://github.com/docker/compose/issues/8742 is solved

10 months agoDocker-compose uses lowercase Linux now :(
Jarek Potiuk [Sun, 3 Oct 2021 17:12:05 +0000 (19:12 +0200)] 
Docker-compose uses lowercase Linux now :(

10 months agoMove to 283.2
Jarek Potiuk [Sun, 3 Oct 2021 16:53:27 +0000 (18:53 +0200)] 
Move to 283.2

10 months agoMerge pull request #41 from apache/update-version-to-282.0
Jarek Potiuk [Tue, 21 Sep 2021 09:21:45 +0000 (11:21 +0200)] 
Merge pull request #41 from apache/update-version-to-282.0

Update Version to latest

10 months agoUpdate Version to latest update-version-to-282.0 41/head
Jarek Potiuk [Tue, 21 Sep 2021 09:21:05 +0000 (11:21 +0200)] 
Update Version to latest

11 months agoInstall `gh` CLI in runners (#40)
Ash Berlin-Taylor [Wed, 8 Sep 2021 12:57:23 +0000 (13:57 +0100)] 
Install `gh` CLI in runners (#40)

11 months agoNew runner version
Ash Berlin-Taylor [Fri, 3 Sep 2021 17:41:40 +0000 (18:41 +0100)] 
New runner version

11 months agoMerge pull request #39 from apache/update-to-2.280.3
Jarek Potiuk [Tue, 24 Aug 2021 17:13:08 +0000 (19:13 +0200)] 
Merge pull request #39 from apache/update-to-2.280.3

Update runner to 2.280.3

11 months agoUpdate runner to 2.280.3 update-to-2.280.3 39/head
Jarek Potiuk [Fri, 20 Aug 2021 09:14:38 +0000 (11:14 +0200)] 
Update runner to 2.280.3

12 months agoMerge pull request #38 from apache/upgrade-to-latest-runner
Jarek Potiuk [Wed, 18 Aug 2021 16:40:19 +0000 (18:40 +0200)] 
Merge pull request #38 from apache/upgrade-to-latest-runner

Upgrade to latest released runner version

12 months agoUpgrade to latest released runner version upgrade-to-latest-runner 38/head
Jarek Potiuk [Wed, 11 Aug 2021 20:07:12 +0000 (22:07 +0200)] 
Upgrade to latest released runner version

12 months agoMerge pull request #37 from apache/disable-packer-role
Jarek Potiuk [Thu, 12 Aug 2021 23:17:59 +0000 (01:17 +0200)] 
Merge pull request #37 from apache/disable-packer-role

Remove unnecessary packer role setting

12 months agoUpdate github-runner-ami/packer/ubuntu2004.pkr.hcl 37/head
Jarek Potiuk [Thu, 12 Aug 2021 23:07:17 +0000 (01:07 +0200)] 
Update github-runner-ami/packer/ubuntu2004.pkr.hcl

Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
12 months agoRemove unnecessary packer role setting
Jarek Potiuk [Wed, 11 Aug 2021 20:05:57 +0000 (22:05 +0200)] 
Remove unnecessary packer role setting

12 months agoUpdate runner version
Ash Berlin-Taylor [Wed, 28 Jul 2021 13:29:08 +0000 (14:29 +0100)] 
Update runner version

12 months agoUpdate to vector 0.15 syntax (#36)
Ash Berlin-Taylor [Wed, 28 Jul 2021 12:59:09 +0000 (13:59 +0100)] 
Update to vector 0.15 syntax (#36)

Vector 0.15 had a breaking change that stopped our config working, so
rather than only noticing at run time, fail at build time

12 months agoIncrease logging from runner supervisor. (#34)
Ash Berlin-Taylor [Wed, 28 Jul 2021 12:33:52 +0000 (13:33 +0100)] 
Increase logging from runner supervisor. (#34)

This will be useful in tracking down cases where instances terminate
"too early".

12 months agoAllow more time for runner to stop before terminating (#35)
Ash Berlin-Taylor [Wed, 28 Jul 2021 12:33:41 +0000 (13:33 +0100)] 
Allow more time for runner to stop before terminating (#35)

With this timeout at 5 minutes, it means that any running jobs only had
5 minutes to complete before the runner would be killed and the instance
terminated.

Hopefully by increasing this we should reduce/entirely remove the
"communication lost with runner" errors in our self-hosted builds

12 months agoClean up MSSQL data folders between runs (#33)
Ash Berlin-Taylor [Wed, 28 Jul 2021 10:20:12 +0000 (11:20 +0100)] 
Clean up MSSQL data folders between runs (#33)

14 months agoFix syntax error in stop-runner-if-no-job.sh (#32)
Ash Berlin-Taylor [Tue, 8 Jun 2021 16:03:31 +0000 (17:03 +0100)] 
Fix syntax error in stop-runner-if-no-job.sh (#32)

This trailing `fi` on the end would likely cause the Runner.Listener to
be terminated to early.

15 months agoPeriodically try to complete the lifecycle hook if we are Pending (#31)
Ash Berlin-Taylor [Tue, 18 May 2021 11:18:57 +0000 (12:18 +0100)] 
Periodically try to complete the lifecycle hook if we are Pending (#31)

Thanks to now using the AMI, the instance can come up a lot quicker than
previously, meaning it is possible the supervisor might start up when
the instance state is still "Pending", which results in the
complete_lifecycle_action call failing.

Rather than blocking the actions.runner service from starting (and not
running jobs) we continue, and check later if we are still pending, and
try to complete the lifecycle action again.

15 months agoDon't kill Runner.Listener when steps running (#30)
Ash Berlin-Taylor [Tue, 18 May 2021 09:48:09 +0000 (10:48 +0100)] 
Don't kill Runner.Listener when steps running (#30)

Because of the `--once` flag we already pass to the deamon, _if_ there
were any jobs running, then the main actions-runner deamon (the
Runner.Listener process) will shut it self down once it's _actually_
finished.

So we want to avoid the `pkill Runner.Listener` in the case when a step
was executing, as otherwise we will likely kill the deamon before it has
a chance to complete the job/report back to GitHub/run the next step.

15 months agoMerge pull request #29 from apache/switch-to-ami
Jarek Potiuk [Mon, 17 May 2021 09:05:11 +0000 (11:05 +0200)] 
Merge pull request #29 from apache/switch-to-ami

Switch to using the built AMI instead of cloud-init

15 months agoSwitch to using the built AMI instead of cloud-init 29/head
Ash Berlin-Taylor [Mon, 17 May 2021 08:29:23 +0000 (09:29 +0100)] 
Switch to using the built AMI instead of cloud-init

We have packer configued to build an AMI that has everything
pre-installed, so all we have left to do in the cloud init is drop the
env var so that we know which region we are in (without having to query
it every time) and start the runner and log shipper services.

15 months agoStart vector log-shipping later, once env var is configured (#28)
Ash Berlin-Taylor [Mon, 17 May 2021 08:35:46 +0000 (09:35 +0100)] 
Start vector log-shipping later, once env var is configured (#28)

Since we have to restart it in the cloud-init anyway (once we know which
region we are in) I have disabled it in the packer build scripts so it
doesn't try to start up on boot too early

15 months agoPerform a docker login before starting the actions runner script (#27)
Ash Berlin-Taylor [Mon, 10 May 2021 10:47:42 +0000 (11:47 +0100)] 
Perform a docker login before starting the actions runner script (#27)

This was done in the cloud-init, but missed from the migration to packer
build scripts.

15 months agoInstall node in the AMI (#22)
Ash Berlin-Taylor [Mon, 10 May 2021 08:40:45 +0000 (09:40 +0100)] 
Install node in the AMI (#22)

Despite not being in the cloud-init script, it was _somehow_ not causing
a problem, but it not being present in the AMI made production builds
fails

15 months agoSend logs to Cloudwatch in the same region, not always to Frankfurt (#25)
Ash Berlin-Taylor [Fri, 7 May 2021 17:57:04 +0000 (18:57 +0100)] 
Send logs to Cloudwatch in the same region, not always to Frankfurt (#25)

15 months agoMake AMI available in eu-central-1 and us-east-2 regions (#26)
Ash Berlin-Taylor [Fri, 7 May 2021 17:56:48 +0000 (18:56 +0100)] 
Make AMI available in eu-central-1 and us-east-2 regions (#26)

15 months agoCleanup logs and "build state" from the AMI (#23)
Ash Berlin-Taylor [Fri, 7 May 2021 14:00:50 +0000 (15:00 +0100)] 
Cleanup logs and "build state" from the AMI (#23)

Not doing this doesn't cause any harm, but it is cleaner to not have
this state included in the AMI

15 months agoUse the cheaper ASG in Ohio (#24)
Ash Berlin-Taylor [Fri, 7 May 2021 14:00:29 +0000 (15:00 +0100)] 
Use the cheaper ASG in Ohio (#24)

15 months agoMerge pull request #21 from apache/fix-custom-metric-cron
Jarek Potiuk [Thu, 6 May 2021 09:34:16 +0000 (11:34 +0200)] 
Merge pull request #21 from apache/fix-custom-metric-cron

Fix the custom-Cloudwatch metric cron job in the AMI

15 months agoFix the custom-Cloudwatch metric cron job in the AMI 21/head
Ash Berlin-Taylor [Thu, 6 May 2021 09:22:51 +0000 (10:22 +0100)] 
Fix the custom-Cloudwatch metric cron job in the AMI

15 months agoUpdate requirements (#18)
Jarek Potiuk [Tue, 4 May 2021 11:30:40 +0000 (13:30 +0200)] 
Update requirements (#18)

Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
15 months agoDon't encrypt the AMI's root snapshot (#17)
Ash Berlin-Taylor [Fri, 23 Apr 2021 10:46:49 +0000 (11:46 +0100)] 
Don't encrypt the AMI's root snapshot (#17)

We are an open-source project, so we don't need to pay the cost or
complexity of having this, but mainly having an ASG launch this AMI
needs we need to set up a more complex "Service-Linked" IAM role, which
is complexity we just don't need.

15 months agoFix runner AMI so it (#16)
Ash Berlin-Taylor [Fri, 23 Apr 2021 09:25:04 +0000 (10:25 +0100)] 
Fix runner AMI so it (#16)

- Update to the latest runner version
- Install the vector.toml config file
- Install stop-runner-if-no-job in to correct path
- Don't enable actions.runner service at boot (do it slightly later in
  user data)

15 months agoUser Packer to build a pre-built AMI with everything we need (#15)
Mike Hewitt [Thu, 22 Apr 2021 10:41:02 +0000 (06:41 -0400)] 
User Packer to build a pre-built AMI with everything we need (#15)

* initial packer and tf

* packer added files a scripts from Ashs repo

* add new folder structure and terraform

* updateing packer files

* added dependencies file permission and apt source repos

* bootstrap and user data

* prepare packer provisioners and set up all files to be executed

* update tinder

* terraform to create packer roles, starting to fill in packer variables

* packer roles added aws backends, terraform reformed and added iam roles as well as autoscaling cloudwatch alarm and policy

* fixed iam role and removed policy attatchments

* first run of packer_roles, terraform add gitignore for terraform

* update packer code from results of validate

* update runner max size of asg

* packer updated to run and terraform roles for packer updated

* Apply suggestions from code review

* Update for pre-commit checks

Add licenses, and remove trailing whitespace

* archieve lambda before upload

* remove terraform for ci infra

* Make the packer build produce a working image.

Summary of changes:

- Files need to be copied to a "staging" folder and then moved in place
- Use the built-in upload ability of the shell provisioner
- Have shell provisioner run scripts with sudo, rather than using sudo
  10s of times in the scripts
- Don't set up tmpfs mounts in the AMI -- these have to happen at
  instance boot time, not AMI creation
- Preseed the install options for iptables-persistent so that it
  installs without asking questions or replacing the rules we already
  placed.
- Install the runner-supervisor script from local file, not S3.

Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
16 months agoDo not pre-bake images in the instance (#13)
Jarek Potiuk [Fri, 2 Apr 2021 17:26:37 +0000 (19:26 +0200)] 
Do not pre-bake images in the instance (#13)

The images are cleaned with docker system prune --all anyway
and we save very little (10-20 seconds) and no cost (it's free)
to pull the images as needed from the registry.

16 months agoRunners more resilient to docker login failure (#12)
Jarek Potiuk [Tue, 23 Mar 2021 11:35:15 +0000 (12:35 +0100)] 
Runners more resilient to docker login failure (#12)

Login to docker registry is now done in PreExec and in case it
fails, it also fails the whole service (leading to subsequent
service restart).

Also added `set -eu -o pipefail` to be better protected against
any silent failures.

17 months agoUpdate actions.runner to 2.277.1-airflow3 (#11)
Ash Berlin-Taylor [Fri, 19 Mar 2021 21:52:10 +0000 (21:52 +0000)] 
Update actions.runner to 2.277.1-airflow3 (#11)

This included extra logging and uses `github.actor`, rather than
`github.pull_request.author` for decisions (to match what we use in our
CI.yml file).

17 months agoIncrease logging from actions.runner-supervisor service (#10)
Ash Berlin-Taylor [Fri, 19 Mar 2021 21:52:00 +0000 (21:52 +0000)] 
Increase logging from actions.runner-supervisor service (#10)

This allows us to have in the logs (and thus searchable in the
CloudWatch Logs) the InstanceId

17 months agoStrip ANSI escape codes from logs in CloudWatch (#9)
Ash Berlin-Taylor [Fri, 19 Mar 2021 21:51:52 +0000 (21:51 +0000)] 
Strip ANSI escape codes from logs in CloudWatch (#9)

Now that we are included step logs, we need to strip the colour escape
sequences.

17 months agoUpload job output logs to Cloudwatch too (#8)
Ash Berlin-Taylor [Mon, 15 Mar 2021 14:47:09 +0000 (14:47 +0000)] 
Upload job output logs to Cloudwatch too (#8)

We have some cases where logs aren't being uploaded to Github, which
makes debugging failures hard.

This is a problem with GitHub's hosted runners too, but for self-hosted
runners we can at least do something about it.

17 months agoAdd an environment variable to let runners know where they are running (#7)
Ash Berlin-Taylor [Thu, 11 Mar 2021 12:18:07 +0000 (12:18 +0000)] 
Add an environment variable to let runners know where they are running (#7)

This makes it easier to set runs-on in our ci.yml workflow

17 months agoAdds gnu parallel - required to implement semaphores for parallel tests (#6)
Jarek Potiuk [Wed, 10 Mar 2021 10:09:57 +0000 (11:09 +0100)] 
Adds gnu parallel - required to implement semaphores for parallel tests (#6)

17 months agoRemove left-over docker containers before fixing permissions (#5)
Ash Berlin-Taylor [Mon, 1 Mar 2021 11:10:58 +0000 (11:10 +0000)] 
Remove left-over docker containers before fixing permissions (#5)

If the docker container is still running and creating files (as might be
the case for the prod image builds) then some files could be left
uncleaned, causing the next job to fail.

17 months agoUser-data script to bootstrap self-hosted runner on ASG (#4)
Ash Berlin-Taylor [Mon, 1 Mar 2021 10:56:24 +0000 (10:56 +0000)] 
User-data script to bootstrap self-hosted runner on ASG (#4)

This runner-supervisor script has been manually uploaded to S3 (it was too big
to include in the userdata)

The cloud init script has been manually uploaded by running, and the ASG
is configured to pick the Latest version already, so new instances will
start using the new script.

```
aws --profile airflow ec2 create-launch-template-version \
    --launch-template-name GithubRunner \
    --launch-template-data UserData="$(base64 -w0 cloud-init.yml)" \
    --source-version='$Latest'
```

18 months agoLambda function to scale ASG based on Github webhooks (#2)
Ash Berlin-Taylor [Thu, 18 Feb 2021 09:55:49 +0000 (09:55 +0000)] 
Lambda function to scale ASG based on Github webhooks (#2)

19 months agoMerge pull request #1 from apache/register-runner-script
Ash Berlin-Taylor [Fri, 15 Jan 2021 14:29:26 +0000 (14:29 +0000)] 
Merge pull request #1 from apache/register-runner-script

Add script to help store self-hosted runner creds in AWS SSM

19 months agofixup! Add script to help store self-hosted runner creds in AWS SSM 1/head
Ash Berlin-Taylor [Fri, 15 Jan 2021 12:47:28 +0000 (12:47 +0000)] 
fixup! Add script to help store self-hosted runner creds in AWS SSM

19 months agoAdd script to help store self-hosted runner creds in AWS SSM
Ash Berlin-Taylor [Tue, 12 Jan 2021 11:54:51 +0000 (11:54 +0000)] 
Add script to help store self-hosted runner creds in AWS SSM

We can't create self-hosted runners "on-demand", so we need to
pre-create a "pool" of them for use by the auto-scaled nodes.

This script automated the process of converting the short-lived token in
to long-lived credentials (by using the runner binaries in a temporary
directory) and then storing the resulting files in AWS's ParameterStore

19 months agoAdd readme
Ash Berlin-Taylor [Mon, 4 Jan 2021 16:10:59 +0000 (16:10 +0000)] 
Add readme