So why did thomascook.com go down?


A reasonable amount of speculation coming into Travolution Towers yesterday after it emerged the main online portal for thomascook.com was down.

We first heard of a problem mid-afternoon and the site was still out of action at around 5.30pm, when we spoke to the Thomas Cook press team.

It quickly emerged that those in the press office could not see what we and others could - a rather odd page from the domain firm Netnames.

[We sent them a screengrab]


netnames- thomascook.jpg
So rather than the usual "web maintenance" banner, which normally appears on such occasions, many users would have thought the site didn't exist at all and Netnames was simply touting the fact that it had registered the domain name.

We were informed this morning that the site was only down for 20-30 minutes, which didn't exactly chime with what we and other people were seeing, despite being urged to refresh our browsers, etc.

The 20-30 minute seems a remarkably short space of time given that Google and Yahoo also happened to have crawled the Netnames site holding page in the meantime and were displayed content accordingly (as our grab from this morning indicates).

thomascooknetnamesyahoo.JPGSince then Thomas Cook has issued us the following statement:

"Following an IT related issue at our hosting area in Birstall we needed to switch our website onto a backup system. Due to the nature of the issue we were unable to follow a standard switch and reverted to forwarding all web traffic to our backup domain. This worked fine and the site was down for a short period of time, with bookings continuing to come in.

"The reversal of the domain forward (via NetNames) was then kicked off and we experienced some issues around the reverting of the IP address. As a result NetNames set up an Apache redirect at their end to redirect any TC.com traffic hitting their landing page. After a period of propagation everything was shifted back to normal."

At this stage (we have asked again), Thomas Cook will not tell us anything about the "IT related issue".

It seems, from the people we have spoken to, that this state of affairs is somewhat unusual.

But we are by no means experts in this particular field, so what do people think?

0 TrackBacks

Listed below are links to blogs that reference this entry: So why did thomascook.com go down?.

TrackBack URL for this entry: http://blog.travolution.co.uk/movabletype/cgi-bin/mt-tb.cgi/8332

4 Comments

Thomas Cook's description of the problem seems reasonable. (I have technical experience with DNS (Domain Name Systems) and backup systems for enterprise webservers.)

Though a reasonable explanation of the downtime, it looks like what they did was a last-ditch effort - they must have had multiple system failures, including the failover of their main backup or failover system.

The problem with relying on changing domain name settings to move to a backup server is that domain name data needs to "propagated" (or copied) to each ISP. These settings take time to copy, so any change will slowly take effect across the internet. It's not an instant process, though if you're planning on doing this to redirect to a secondary server, you can reduce this time to 5 or 10 minutes. So you'd only have 5-10 minutes of downtime across the internet.

I took a quick look at thomascook.com and their settings are set to be "cached" (or stored) at ISPs for up to 6 hours before being rechecked, so even if they fixed the problem within 20 minutes on their end, there could have been incorrect settings for up to 6 hours throughout the internet.

If this "domain switch" was in their backup plan, they would have lowered that "cache" time in advance. That said, even though their backup system failed, someone was sharp enough to make the DNS change quickly, even if that resulted in a few hours of downtime for some users. It could have been worse if they didn't act that quickly.

It is possible that the press folk weren't seeing it as they are connected to the site via an internal link (LAN/WAN) and not via the public internet or they are accessing server for internal use instead of the same ones that the consumers use.

The method that they described about going to a backup domain seems quite strange to me and it suggests that they aren't running the type of load balancer that would let them shift traffic to different backend servers due to a planned or unplanned outage.

Without knowing anymore detail, I tend to agree with Salim that they were using DNS to make the change - which is why a 'redirect' was required from NetNames back to the normal Thomas Cook domain. If they were using a load balancer per my previous paragraph, they'd have simply moved the traffic back to their primary servers once the outage was resolved.

With respect to Google updating its index with the NetNames page - that isn't that surprising at all. A popular site can often see hundreds of hits per day from various spiders - it takes only one from Google to hit the home page to have that effect. What is more problematic is that an official Thomas Cook maintenance page wasn't shown (which should return a 400 or 500 series HTTP response and cause the search engines to 'return later' instead of indexing the maintenance page).

If Thomas Cook had intelligent load balancing software in place it is likely the company would have avoided the problems it encountered earlier this week.

It would appear that they have no automated failover system, and rely on manually manipulating DNS to achieve this. There are plenty of global load balancing solutions available, most will monitor and redirect clients when they detect problems. A few will also allow active-active sites and the directing of clients based on location, as well as load.

Software now offers the same level of load balancing capabilities as legacy hardware but at a fraction of the cost. Making high availability across multiple locations within the reach of any organisation with an online business model.

With more people heading online to book holidays than ever before travel firms should avoid missing out on sales by ensuring they have flexible solutions in place that keep their business going, even when IT systems are stretched, or failing.

It looks like their problems aren't over yet. On Saturday I tried to make a payment which failed to complete with an error and now the 'Make a Payment' page is not available. Is anyone else having problems??

As to the observations about the outage, their arrangements certainly don't look appropriate. Systems redundancy and backup provisions should all be automated, including the network routing/DNS to the public Internet. I wouldn't like to be in their CIO's shoes right now.

Leave a comment