Moving rundeck from one server to another

Rundeck people just released an new version, la 1.5. Upgrading to this version is not as simple as usual (yum update or apt-get upgrade) because they’ve changed the database schema and that’s why they recommend following the backup/recovery protocol to upgrade.

I’ve been trying to move our rundeck service to another server with more resources for a while, but never found the moment. This upgrade was the perfect excuse, so I moved it and this post explains the steps for moving rundeck.

First we find all the 4 parts we want to move:
– Rundeck configuration
– Rundeck user keys
– Project definitions
– Job definitions
– Execution logs

Rundeck configuration is in /etc/rundeck/. To find the project definitions we should look in the file /etc/rundeck/project.properties for the project.dir value (default is /var/rundeck/projects/). We will find the path to project ssh keys in the file etc/project. properties of each project directory, in the project.ssh-keypath value. The job definitions is in the database, and we can see the path to the executions logs in the file /etc/rundeck/framework.properties, with the framework.logs.dir value (usually /var/lib/rundeck/logs).

Once we’ve located everything, we can make “the package” we will move from server to server. We start with the text files (rundeck config, project definition and execution logs):


mkdir rundeck-backup
cp -a /etc/rundeck/ rundeck-backup/
cp -a /var/rundeck/projects/ rundeck-backup
cp -a /var/lib/rundeck/logs/ rundeck-backup

To copy the projects ssh keys we should check inside each project directory for its project.properties and copy that file. The projects may share the key or may not, and the keys may have the same filename or not. That’s why we’ll save them inside each project directory:


for project in rundeck-backup/projects/*;do cp grep project.ssh-keypath $project/etc/project.properties|cut -d"=" -f 2 $project;done

For the jobs definition extraction, we need to call rd-jobs list for each project, exporting this way the xml definition:


for project in rundeck-backup/projects/*;do rd-jobs list -f rundeck-backup/basename $project.xml -p basename $project;done

And it would be fine to keep the “know_hosts” file for rundeck user:

cp getent passwd rundeck|cut -d":" -f6/.ssh/known_hosts rundeck-backup

Now we have a package with a full backup of our installation. Now we send this rundeck-backup directory to the new server (I know, it’s obvious, but there you go :P)

scp -r rundeck-backup [email protected]:.

Now we ssh to the new server. We assume rundeck is installed in the new server (if not, we talk about that
in an older post), so we just need to put the files where they belong. First the keys:


for project in rundeck-backup/projects/*;do filename=grep project.ssh-keypath $project/etc/project.properties|cut -d"=" -f 2;cp $project/basename $filename $filename;done

Then the rest of the files:

cp -a rundeck-backup/rundeck/ /etc/
cp -a rundeck-backup/projects/ /var/rundeck/
cp -a rundeck-backup/logs/ /var/lib/rundeck/
cp rundeck-backup/known_hosts getent passwd rundeck|cut -d":" -f6/.ssh/known_hosts

Now we have rundeck configuration and projects definition, but the jobs are still missing. We shold keep in mind the old server is still running, and we don’t want our jobs executed twice at the same time. We don’t want to disable in the old server until we make sure the new one is running ok, because we don’t want any of the executions to be missed. To achieve both we will make the new server to fake the executions, not really running anything, changing service.NodeExecutor.default.provider value in the file /var/rundeck/projects/$PROJECT/etc/project.properties, from jsch-ssh to stub. In a single line, it would be:

sed /var/rundeck/projects/*/etc/project.properties -e 's/jsch-ssh/stub/g' -i

Now we are sure no job will be executed until we say so, so we can riskless import the jobs:

for project in rundeck-backups/projects/*;do rd-jobs load -f rundeck-backup/basename $project.xml -p basename $project;done

With the jobs loaded we have everything we need. Now we can log in the web interface and check everything is ok: users can acess to their projects, jobs are correctly configured, etc, etc. When we are sure, we can move one project at a time (or all of them at the same time, as you wish) just changing the former value (service.NodeExecutor.default.provider). In the old server we change “jsch-ssh” to “stub” and the other way arounf in the new server, from “stub” to “jsch-ssh”. Playing with those values we are confident if we find any problem with some project, we can move this project (or all of them, just to be sure) back to the old server while we solve it.

And that’s it! Now we could change the DNS to keep the old rundeck URL, but that’s your choice.

Varnish basic configuration with http cache and stale-while

In high-demand environments, we can reach the point where the number of PHP queries (or CGI queries) we want to serve through apache httpd is higher than our servers can handle. To solve that we can do the simplest thing: adding more servers, lowering this way the servers load (the queries are spreaded along more servers). But the simplest way isn’t necesarily proficient. Instead of distributing the load, can our servers handle more queries?

Of course. We can speed up the PHP (or general CGI) processing with FastCGI. We can also make our http server faster, exchanging it for a lighter one, nginx for instance. We can approach the problem from other perspective, which we will discuss here: mantaining a cache where we store content, instead of processing it each time, avoiding CPU time and speeding up the time it takes to serve it. We will do that using varnish .

Maintaining a cache is a delicate matter because you should look for a lot of things. You shouldn’t cache a page if it has cookies involved, for instance, or if the http query is a POST. But all of this is app-related. Developers should be able to say what is safe to cache and what is not, and sysadmins should take this decisions to the servers. So we will assume we start from scratch, we have nothing in the varnish cache, and we want to begin with a particular URL, which we know implies no risk. We will do that here, caching just one URL.

For our tests we will use a simple PHP file. It takes 10 seconds to return the result, and it has a header expiring after 5 seconds. We will name it sleep.php:

If we query it, we can check it do take 10 seconds to return:

$ curl http://localhost/sleep.php -w %{time_total}
10,001

The first thing we should do is to install varnish with our package manager (apt-get install varnish, yum install varnish, whatever). After that we want varnish listening in port 80, instead of apache. So we move apache to 8080 for instance (“Listen: ” directive), and then varnish to 80 (VARNISH_LISTEN_PORT= directive, usually in /etc/default/varnish or /etc/sysconfig/varnish, depends on your distro). We also need to tell varnish the servers it will have behind, to forward the queries (backend servers). For that we will create /etc/varnish/default.vcl file with the following contents:


backend default {
.host = "127.0.0.1";
.port = "8080";
}

With all this we restart apache and varnish, and we check they are running:


$ curl http://localhost/sleep.php -IXGET
HTTP/1.1 200 OK
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.4
Cache-control: max-age=5, must-revalidate
Vary: Accept-Encoding
Content-Type: text/html
Transfer-Encoding: chunked
Date: Fri, 30 Nov 2012 13:56:33 GMT
X-Varnish: 1538615861
Age: 0
Via: 1.1 varnish
Connection: keep-alive

$ curl http://localhost:8080/sleep.php -IXGET
HTTP/1.1 200 OK
Date: Fri, 30 Nov 2012 13:56:59 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.4
Cache-control: max-age=5, must-revalidate
Vary: Accept-Encoding
Content-Length: 0
Content-Type: text/html

We can see different headers in each query. When we query varnish there are “Via: 1.1 varnish” and “Age: 0”, among others apache doesn’t show. If we have it like this, we have our baseline.

The default behaviour is to cache everything:

$ curl http://localhost/sleep.php -w %{time_total}
10,002
$ curl http://localhost/sleep.php -w %{time_total}
0,001

But we don’t want to cache everything, just a particular URL, avoiding cache of cookies and things like that. So we will change sub vcl_recv to not cache anything, adding this to the file /etc/varnish/default.vcl:

sub vcl_recv {
return(pass);
}

We check it:

$ curl http://localhost/sleep.php -w %{time_total}
10,002
$ curl http://localhost/sleep.php -w %{time_total}
10,001

Now we cache just sleep.php, adding this to default.vcl:

sub vcl_recv {
if (req.url == "/sleep.php")
{
return(lookup);
}
else
{
return(pass);
}
}

We can check it:

$ cp /var/www/sleep.php /var/www/sleep2.php
$ curl http://localhost/sleep.php -w %{time_total}
10,002
$ curl http://localhost/sleep.php -w %{time_total}
0,001
$ curl http://localhost/sleep2.php -w %{time_total}
10,002
$ curl http://localhost/sleep2.php -w %{time_total}
10,001

Also we check the “Age:” header is increasing, and when it reaches 5 (max-age we wrote), it takes 10 seconds again:

$ curl http://localhost/sleep.php -IXGET -w %{time_total}
HTTP/1.1 200 OK
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.4
Cache-control: max-age=5, must-revalidate
Vary: Accept-Encoding
Content-Type: text/html
Transfer-Encoding: chunked
Date: Mon, 03 Dec 2012 10:53:54 GMT
X-Varnish: 500945303
Age: 0
Via: 1.1 varnish
Connection: keep-alive

10,002
$ curl http://localhost/sleep.php -IXGET -w %{time_total}
HTTP/1.1 200 OK
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.4
Cache-control: max-age=5, must-revalidate
Vary: Accept-Encoding
Content-Type: text/html
Transfer-Encoding: chunked
Date: Mon, 03 Dec 2012 10:53:56 GMT
X-Varnish: 500945305 500945303
Age: 2
Via: 1.1 varnish
Connection: keep-alive

0,001
$ curl http://localhost/sleep.php -IXGET -w %{time_total}
HTTP/1.1 200 OK
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.4
Cache-control: max-age=5, must-revalidate
Vary: Accept-Encoding
Content-Type: text/html
Transfer-Encoding: chunked
Date: Mon, 03 Dec 2012 10:53:59 GMT
X-Varnish: 500945309 500945303
Age: 5
Via: 1.1 varnish
Connection: keep-alive

0,001
$ curl http://localhost/sleep.php -IXGET -w %{time_total}
HTTP/1.1 200 OK
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.4
Cache-control: max-age=5, must-revalidate
Vary: Accept-Encoding
Content-Type: text/html
Transfer-Encoding: chunked
Date: Mon, 03 Dec 2012 10:54:09 GMT
X-Varnish: 500945310
Age: 0
Via: 1.1 varnish
Connection: keep-alive

10,002

We can see when the content expires, varnish ask for it agains and it takes 10 seconds. But what happens during this time? The rest of queries must wait, too? No, they don’t. There is a 10 seconds grace period, and during this period varnish will keep serving the old content (stale content). We can check it if we run two curl at the same time, and we will see one of them stops while the other keeps serving content fast, with the header “Age” above the 5 seconds we assigned:

$ while :;do curl http://localhost/sleep.php -IXGET;sleep 1;done
(...)
HTTP/1.1 200 OK
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.4
Cache-control: max-age=5, must-revalidate
Vary: Accept-Encoding
Content-Type: text/html
Transfer-Encoding: chunked
Date: Mon, 03 Dec 2012 11:16:29 GMT
X-Varnish: 500952300 500952287
Age: 8
Via: 1.1 varnish
Connection: keep-alive

We can also check it with siege, with two concurrent users, and we will see for a while just one of the threads, while the other is stopped, waiting for the content.:

$ siege -t 30s -c 2 -d 1 localhost/sleep.php

If we think 10 seconds is a low value, we can change it with the beresp.grace directive, in the sub vcl_fetch in default.vcl file. We can set a minute, for instance:

sub vcl_fetch {
set beresp.grace = 60s;
}

What if the backend server is down? Will it keep serving stale content? Not as we have it right now. Because varnish has no way of knowing if a backend server is healthy or not, so it will consider all servers healthy. So, if the server is down, and the content expires, it will give error 503:

$ $ sudo /etc/init.d/apache2 stop
[sudo] password:
* Stopping web server apache2 apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1 for ServerName
... waiting [OK]
$ sudo /etc/init.d/apache2 status
Apache2 is NOT running.
$ while :;do curl http://localhost/sleep.php -IXGET;sleep 1;done
(...)
HTTP/1.1 200 OK
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: PHP/5.3.10-1ubuntu3.4
Cache-control: max-age=5, must-revalidate
Vary: Accept-Encoding
Content-Type: text/html
Transfer-Encoding: chunked
Date: Fri, 30 Nov 2012 14:19:15 GMT
X-Varnish: 1538616905 1538616860
Age: 5
Via: 1.1 varnish
Connection: keep-alive

HTTP/1.1 503 Service Unavailable
Server: Varnish
Content-Type: text/html; charset=utf-8
Retry-After: 5
Content-Length: 419
Accept-Ranges: bytes
Date: Fri, 30 Nov 2012 14:19:15 GMT
X-Varnish: 1538616906
Age: 0
Via: 1.1 varnish
Connection: close

To make the grace period apply in this situation, we just need to tell varnish how should it check if apache is up or down (healthy), just setting “probe” directive in the backend:

backend default {
.host = "127.0.0.1";
.port = "8080";
.probe = {
.url = "/";
.timeout = 100 ms;
.interval = 1s;
.window = 10;
.threshold = 8;
}
}

This way it keeps serving stale content when the backend is down, and it will keep serving it until it comes up and varnish can ask for the content again.

Testing with siege and curl, we can see there is always a thread that is “screwed”. The first time varnish finds an expired content, it asks for it to the backend and waits for the answer. Meanwhile, the rest of the threads will get the stale content, but this thread is “screwed”. The same thing happens when the server is down. There is a lot of literature trying to avoid this, you can read a lot about it, but bottomline: there is no way to avoid it. It just happens. One thread must be sacrificed.

Until now we are covering two scenarios where we will keep serving stale content:
– There is no backend server available, so we serve stale content.
– There is backends available, and a thread has asked for new content. While this content comes from the backend, varnish keeps serving stale content to the rest of the threads.

What if we want these two scenarios to have different timeouts? For instance, we could need the stale content to stop serving after certaing time (could be minutes). After this time, we stop and wait for the backend answer, forcing the content to be fresh. But at the same sime we could need to serve stale content when the servers are down (so there’s no way to get fresh content), because normally that’s better than serve a 503 error page. This can be configured at sub vcl_recv in default.vcl file, this way:

sub vcl_recv {
if (req.backend.healthy) {
set req.grace = 30s;
} else {
set req.grace = 1h;
}
}

sub vcl_fetch {
set beresp.grace = 1h;
}

Per tant, el nostre fitxer default.vcl complet tindra el seguent contingut:

$ cat /etc/varnish/default.vcl
backend default {
.host = "127.0.0.1";
.port = "8080";
.probe = {
.url = "/";
.timeout = 100 ms;
.interval = 1s;
.window = 10;
.threshold = 8;
}
}

sub vcl_recv {
if (req.backend.healthy) {
set req.grace = 30s;
} else {
set req.grace = 1h;
}
if (req.url == "/sleep.php")
{
return(lookup);
}
else
{
return(pass);
}
}
sub vcl_fetch {
set beresp.grace = 30s;
}