27 June 2021 update: Since we first wrote this blog post back in 2015, a lot has changed! While the advice below will still work, it’s no longer the recommended approach. There are more resilient and cost efficient ways of running scheduled tasks than having a dedicated EC2 instance running 24/7. We recommend migrating your tasks to Lambda functions. You can trigger them in response to any kind of event, or you can use EventBridge (previously CloudWatch Events) to trigger them on a regular schedule.
If you use Elastic Beanstalk to run and manage your web apps, at some point you’ll want to setup some scheduled tasks, or cronjobs. Today’s blog post aims to take you through the best way of achieving this, whether you’re running in single instance mode, or load balancing.
Scheduled tasks are typically run overnight when load on your application is low, and are used for all sorts of things: everything from system maintenance (like database tidy-ups) to more compute-intensive jobs that may take several minutes to complete.
At Rotaready, we have many scheduled tasks that run at various times of the day and night. If you’re a recipient of our weekly rota digest SMS (the one that tells you when you’re working each week) then you’ve been contacted by one of our scheduled tasks!
Without further ado, I’m going to dive into the technical details and take you through a couple of ways to do this, depending on how your Elastic Beanstalk environment is configured.
Scheduling tasks in a Single-instance Environment
In this mode, a single EC2 instance is provisioned for you and there’s no load balancing, regardless how busy your app gets. It’s cheap and ideal for a development environment or for production apps that get low traffic. It’s likely you’ll start out like this and look towards a load-balancing autoscaling environment when your app gets popular and you need to scale.
Seeing as there will only ever be one running instance of your app in this environment, setting up scheduled tasks is quite straightforward. In fact, there’s two ways to do it:
With .ebextensions
- If it doesn’t already exist, create a folder at the root of your application called .ebextensions
- Create a config file inside that folder, here we’ll call it cron.config
- Add the following text to the file.
Note that this is YAML configuration file. You must preserve the spaces at the beginning of each line or it won’t parse. Tabs aren’t allowed either.
[code]
container_commands:
01_remove_crontab:
command: “crontab -r || exit 0”
02_add_crontab:
command: “cat .ebextensions/crontab | crontab”
[/code]
Then create a second file in your .ebextensions folder, this time called crontab
Add the following text to the file:
[code]00 00 * * * /usr/local/bin/node /home/example/script.js
[/code]
So what’s happening in step 3? As described here, commands are processed in alphabetical order, so we’re prefixing our command names with 01 and 02 to ensure they execute in this order. The first command wipes any pre-existing crontab, and the second adds our crontab file to the machine.
And what about step 5? In this example, I’m using Node.js to execute a JavaScript file every night at midnight. This article clearly describes what the 0’s and *’s do and how to structure a command. Customise to your liking!
Note! You must have a blank line at the end of your crontab file or it won’t run.
With a package (language-dependent)
If you’re using Node.js, try the brilliant node-schedule package. It allows you to do cron-style scheduled tasks in code, and there’s no need to mess about with any config like the previous example.
[code language=”js”]
schedule = require(‘node-schedule’);
schedule.scheduleJob(‘0 0 * * *’, function () {
// Do stuff here
});
[/code]
Scheduling tasks in a Load-balancing, Autoscaling Environment
This is where things get interesting. If you converted your single-instance environment to one that auto scales, Elastic Beanstalk will automatically spin up additional instances of your app in response to higher demand. This is good stuff and exactly what we want, but what effect will this have on our scheduled tasks?
Let’s say you have a task scheduled to run at midnight every night. Imagine that it’s nearly midnight right now and there’s sufficient demand on your app for there to be 3 running instances. Your task will be run three times, once by each instance! This isn’t ideal.
So how do we fix it?
If you use the .ebextensions example for single-instance environments (detailed above), there’s an extra property that can be added to the config file called leader_only. It has been suggested that adding this property and setting it to true will ensure only the designated ‘leader instance’ (in your auto scaling group) will run the commands, and therefore your scheduled tasks will only be run once. Even the official docs allude to this too.
It turns out, however, that this doesn’t work as expected. The leader instance can be terminated, leaving you with nobody to run your tasks. An instance is nominated as the leader at deployment, and this is the shortcoming. This sounds silly, but let me explain:
Imagine you had one running instance (this is designated as the leader). Suddenly, load on your app spikes, Elastic Beanstalk spins up a second instance (not the leader), sweet. Only the leader will run your tasks. But now imagine load drops. Elastic Beanstalk realises you don’t need two running instances and terminates one. There’s a chance it’ll choose the first running instance to terminate (in fact I think it favours this choice), and not the second. And if that happens, you’re left without a leader.
So how do we fix that?
This is turning into a bit of a headache. But fear not, I have a solution. I ignore the .ebextensions config-style approach entirely, and deal with things in code instead:
[code language=”js”]
var logger = require(‘../log’),
async = require(‘async’),
http = require(‘http’),
AWS = require(‘aws-sdk’);
AWS.config.update({region: ‘eu-west-1’}); // change to your region
var elasticbeanstalk = new AWS.ElasticBeanstalk();
function runTaskOnMaster(name, taskToRun, callback) {
logger.info(‘Beginning task: ‘ + name);
async.waterfall([
function (callback) {
var options = {
host: ‘169.254.169.254’,
port: 80,
path: ‘/latest/meta-data/instance-id’,
method: ‘GET’
};
var request = http.request(options, function (response) {
response.setEncoding(‘utf8’);
var str = ”;
response.on(‘data’, function (chunk) {
str += chunk;
});
response.on(‘end’, function () {
callback(null, str);
});
});
request.on(‘socket’, function (socket) {
socket.setTimeout(5000);
socket.on(‘timeout’, function() {
request.abort();
});
});
request.on(‘error’, function (e) {
callback(e);
});
request.end();
},
function (currentInstanceId, callback) {
var params = {
// Note! You’ll need to set this env variable in AWS to the name of your environment
EnvironmentName: process.env.AWS_ENV_NAME
};
elasticbeanstalk.describeEnvironmentResources(params, function (err, data) {
if (err) return callback(err);
if (currentInstanceId != data.EnvironmentResources.Instances[0].Id) {
callback(null, false);
}
callback(null, true);
});
},
function (isMaster, callback) {
if (!isMaster) {
logger.warn(‘Not running task as not master EB instance.’);
callback();
} else {
logger.info(‘Identified as master EB instance. Running task.’);
taskToRun(callback);
}
}
], function (err) {
if (err) {
logger.error(‘Error occurred during task.’, err);
} else {
logger.info(‘Successfully finished task: ‘ + name);
}
callback();
});
}
[/code]
This example is in Node.js but you could rewrite it in almost any language, as it uses the AWS SDK (which is available in most of the popular languages/platforms). I call the runTaskOnMaster() function on all my instances using node-schedule, but you could easily call it from crontab instead.
So how does it work?
There’s a handy little Instance Metadata web service that runs on all EC2 instances. It’s available via an IP address (169.254.169.254) or a hostname (instance-data). I use it to get the Instance ID of the machine.
I then use the AWS SDK to ‘describe environment resources’ for a given Elastic Beanstalk environment (in my example I pass this in as an environment variable, but you could hard code it in). This returns me a list, amongst other things, of all the running instances in the environment. We know there will always be at least one instance in this list, so I simply check if the instance that’s running this code has the same ID as the first instance in the list. And if it is, I deem that to be the master/leader, and we run the scheduled tasks. Boom!
Note! You’ll need to grant the IAM role (that your instances run under) some extra permissions for this to work. The actions to add to a new/existing policy are:
- elasticbeanstalk:DescribeEnvironmentResources
- autoscaling:DescribeAutoScalingGroups
- autoscaling:DescribeAutoScalingInstances
- cloudformation:ListStackResources (…possibly optional!)
In summary
I’ve found this approach works great; only one instance ever runs my scheduled tasks, regardless of how many times my app has been scaled up and down and regardless what instances were terminated.
As an improvement, we could sort the Instances list returned from the SDK. While I’ve always found it to always have consistent ordering, this would be a sensible thing to do.
Do let me know your thoughts, whether you’ve found any faults in this method or if you’ve found an even easier way to do it!