Web server move officially begins 7/8

When NCIS’s T-1s went away a few months ago, we ended up with severely reduced bandwidth – 512Kbps shared with three other customers, as compared to the 3Mbps we previously enjoyed (and practically had to ourselves).

After many attempts to get moved onto a different connection here in Mora were fruitless, I finally gave up a few weeks ago and called another ISP. Since then I have set up a new server in their facility, where it will be on a fully redundant 100Mbps connection that is a only two hops from the Internet backbone (even better than the T-1s!). This will mean much better throughput, no interruptions in service, and greatly improved response times.

Pretty much every issue we’ve had in the past months (downtime, dropped connections, slow connections, SMTP issues, etc) can be attributed to the poor connection we’ve been on. We used to only have such issues on the rarest of occasions. Our new ISP assures me that we will never have such issues with them… and from what I’ve seen, I tend to believe it.

The first sites to be moved will be Four Corners and the remaining assets of Kidsnet. Once those are done, it’ll be the Kanabec Systems sites and services, then 332 Media (including their all-important scripts that generate weather and news updates for KCIZ-AM/FM/TV). And once that’s done, we’ll start in on the paying customers – one at a time – until everyone has been moved. This will probably take several days to complete, so don’t worry if you don’t notice any differences in your hosting right away.

When putting together the new server, we were able to make some changes we’d long wanted to implement, as well as switching some of the software involved. You may notice that the Cobalt management interface has been replaced by Virtualmin, a more modern package that accomplishes the same purposes.

Most of the new server’s facilities will work the same as the old one’s – but if we notice that the differences might impact your site or operations, we’ll be sure to call you and work through any potential changes ahead of time.

Also, be aware that we will be making some DNS changes as part of this move, so there may be random brief periods of downtime while changes propagate. Unfortunately there is nothing we can do to prevent this – but be assured that everything we do have control over will be moved in the most graceful manner possible.

Again, thanks to all our existing customers for their patience, and we look forward to seeing your site over at the new server!

Doing the “Hurry Up & Wait”

Well, several weeks have passed since the NCIS/Onvoy incident unfolded. The original plan was to get our equipment off the temporary connection within a couple of days, and move it to something more permanent. But (as you probably guessed by this point), that has yet to happen. NCIS is still working on obtaining additional bandwidth for us – but it’s taking a lot longer than we hoped.

The good news is that we’ve got our temporary connection as groomed and tuned as it can be. Packet loss is at a minimum, and we’ve gotten a handle on the congestion situation. But the bad news is that it’s still only a 768K pipe… half of what we used to be on.

So we wait. The word is that we’ll be back at full capacity in another week or so, but it’s hard to say. Until then, keep on keepin’ on, and thanks to our hosting customers on raqpaq for being good sports throughout this process.

Outgoing mail bouncing?

Many customers have reported that some (but not all) of their outgoing mail is being rejected by the recipient’s site. This post addresses that problem.

Some mailservers, especially those at larger organizations, have spam filters that utilize a variety of tactics to spot potentially forged mail. One of these tactics is to compare the sending mailserver’s name to it’s reverse DNS entry. If the two do not match, it is assumed that the sending mailserver is lying about its name (it may have been set up for the purpose of sending forgeries, or for trying to conceal a spammer’s identity and thus shield them from any fallout). In any case, the filter will reject mail under such circumstances.

Since moving to its new address on the temporary connection (see earlier posts for details on what happened), our mailserver’s reverse DNS entry has not been updated. Though we are in the process of getting it changed, it will likely be a few days before anything takes effect. Until that time our mailserver’s forward and reverse records will not match… and so filters using the test outlined above will reject any mail sent through it.

If you are having problems with mail not being delivered due to this, you have two options: either wait for the changes to take effect, or use another outgoing mailserver.

Your ISP may operate an outgoing mailserver. If so, using theirs would probably be your best option. Some common ISP’s outgoing mailservers include:

Qwest: pop.mpls.qwest.net
NorthStar: mail.izoom.net
NCIS: mail.ncis.com
YouBetNet: mail.youbetnet.com
Genesis: mail.genesiswireless.us
DirecWay: smtp.hughes.net
WildBlue: smtp.gmail.com (port 465, SSL required)
Frontier: smtp.frontier.com
CenturyTel: smtpauth.centurytel.net (authentication required)

Just reconfigure your mail client (ThunderbirdOutlook ExpressApple Mail, or what-have-you) as appropriate, and you should be good to go.

Failing that, we have permission to provide our customers in need with access to another friendly local provider’s outgoing mailserver. If you need details, call us – we’d be happy to walk you through it.

And, of course, we’ll post here once the reverse DNS situation is resolved.

UPDATE: 2/3/10 10:00a The reverse DNS settings for NCIS’s outgoing mailserver seem to have taken effect sooner than ours. As such, we’ve changed our configuration – now any mail sent through our outgoing server will be relayed through their outgoing mailserver before continuing on to its destination. This should cure the problem until the reverse DNS situation is over and done with. If you have any issues, let us know.

Kidsnet and the Onvoy situation

Kidsnet has not been immune to our recent connection woes (see previous posts). Our magilla.mnkids.net is still hosted in NCIS’s datacenter, and as such was cut off from the Internet on Friday along with our other equipment.

As you may know, Magilla is the Web, mail, primary DNS, and LDAP server for Kidsnet. We have been in the process of moving the LDAP service to Scooby (a Family Pathways-owned machine fed by Sherbtel lines), but the outage cut our efforts short.

When the outage began, our secondary nameserver (puck.nether.net) began taking on the network’s DNS load. But with the LDAP database still on the now-disconnected Magilla, there was no chance for users to carry on as normal (unless they already had a session open – and once they logged out, they were out for good).

We ended up doing two things to try and resolve the situation: moving the LDAP database to Scooby by physically going to Magilla’s console and copying it to disk, and giving Magilla an address on the temporary connection (see previous posts). But it was all in vain.

Turns out that our terminal servers and secondary nameserver are configured in such as way as to create gridlock in this situation.

Whenever a user tries to log in, the LDAP client attempts to connect to “ldap.mnkids.net” (an alias for Magilla) and authenticate. So whenever a user attempted to log in, a DNS query went out – which would be answered by our secondary nameserver, and would return the Onvoy-routed IP address of Magilla.

“No big deal”, you say, “just change the A-record on your secondary nameserver and you’re good to go.” If only it were that easy!

You see, our secondary nameservice is not provided by a machine we control. It’s a free service. And it only accepts updates from… wait for it… the primary nameserver. Which is Magilla. Which is disconnected.

“Okay, then,” you ask, “why not just SSH into the terminal servers and make an entry in /etc/hosts for ldap.mnkids.net, or put Magilla’s new IP into resolv.conf, or ldap.conf?” Because the terminal servers don’t allow root to log in via SSH (security, of course!), and there are no other users in /etc/passwd that can log in at all.

So, we’ve got ourselves a pickle. There were only two solutions to the problem:

1) Visit each site, log in as root, change the settings. For free. While dozens of other (paying) customers are having issues. Not gonna happen.

2) Change the A-records on Magilla, call up the registrar of mnkids.net and ask them to change the IP they have listed for Magilla, wait 24 hours for the changes to propagate, then wait another 12 or so hours for our secondary nameserver to notice and refresh (we can’t force it into a zone transfer – the admin doesn’t seem to allow it).

Needless to say, we had to take Choice #2. Nobody here’s happy about it, and I’m sure the children aren’t exactly thrilled either, but it’s all we can reasonably do right now – especially considering that Kidsnet is unofficially Not My Problem as of 12/22/2009 (it’s only official once Magilla is retired… and I was soooo close!).

The registrar was contacted this morning. Now we wait.

UPDATE 1: 2/3/10 9:00a – the changes have propagated to most of the Internet’s nameservers. About the only one that doesn’t seem to have noticed is Sherbtel’s (208.38.65.35). Unfortunately, most Kidnset equipment is configured to use Magilla (at its old IP) and Sherbtel for their nameservers… so things won’t be back to normal until the changes are reflected there. Needless to say, we won’t be using Sherbtel’s nameserver for lookups anymore after this – its shortcomings have been perhaps the biggest hurdle in this entire situation (even bigger than getting the temporary connection for Magilla!). Google’s 8.8.8.8 will likely be substituted – once I can log in to the many hosts involved, that is. Kidtime is 6 hours away… and the clock is ticking.

UPDATE 2: 2/3/10 6:00p – our new DNS settings have (partially) propagated to Sherbtel’s nameserver… several records are still cached incorrectly, but fortunately ldap.mnkids.net is not among them. Logged into each and every Kidsnet host and changed the primary nameserver to 8.8.8.8. Things now appear to be working correctly. The other Kidsnet-related domains (warehouse214.org, stacyteencenter.com, etc) are not yet working, but that’ll be a project for tomorrow.

Onvoy-NCIS saga ends

Well, the Onvoy saga is over, but the fun is just beginning: it seems that what began as an outage has turned into a parting of ways between NCIS and Onvoy.

Contract negotiations between the two companies were going on before the outage occurred, with NCIS already on the verge of walking away and finding another bandwidth provider. But being without service for three days was the last straw, and so as of today NCIS is on the hunt for a new ISP.

[ Side note: This doesn't come as a surprise - or a disappointment - to me. Onvoy has changed hands several times over the years, and the latest incarnation seemed to be the lousiest yet. 'Smatterafact, the only time I can ever recall seeing Gary (owner of NCIS) mad was after dealing with an Onvoy tech a couple of years ago... and that says a lot. -KT ]

What does this mean to us? Well, it means that our servers will be on the temporary connection (mentioned earlier) for at least the rest of the week while NCIS gets a new provider. It also means that our next move isn’t as definite as before – we had planned on getting back onto NCIS’s Onvoy-fed connections over this coming weekend, but now we’re not 100% sure just what (or where!) we’ll be moving to. This will at least mean another long weekend of address-swapping and DNS nightmares, and might even mean that our servers will be physically moved to another building, or perhaps another town (we’re working on several possible plans).

But, what does all this mean to you? Probably not much – just that we’ll be down at various points throughout the coming weekend, and that there may be some new numbers to enter into your routers and other equipment yet again next week.

At any rate, rest assured that we don’t intend to let this put an end to our hosting or other services. It’s just another bump in the long road we’ve travelled – and will continue to travel – in our quest to bring you the best services we possibly can. Hang in there!

Onvoy outage

At around noon on Friday, January 29 2010, we experienced an Internet outage that cut off access to our primary Web and mail server.

It seems that the problem was two hops detached from us – our systems were OK, our ISP (NCIS)’s systems were OK, but their ISP (Onvoy)’s systems were not. This caused a trickle-down effect that cut off all of NCIS’s T1-fed customers… such as us. It also resulted in all of NCIS’s servers (mail, web, DNS, etc.) being disconnected from the Internet.

In the 16 years that NCIS and Onvoy have been doing business, a prolonged outage such as this has only occurred once. In our 10 years of working with NCIS, it has never happened – until now.

We installed a temporary connection over the weekend, provided to us by the good folks at YouBetNet. This connection is nowhere as nice as the one we normally use, but it will keep things going until Onvoy gets their act together.

Mail sent to our customers on Friday afternoon and over the weekend started arriving on Sunday morning, and will continue to arrive throughout the day today as other mailservers throughout the Internet discover our new routing and attempt to re-send the undelivered mail.

Any customers experiencing connectivity issues at their homes or offices should confirm that they are notusing a DNS server whose address is 206.146.216.x – if this is the case, you will need to change the address to something else. We recommend 8.8.8.8 (the Google public nameserver) as an alternative.

If you are having any hosting-related issues as a result of this situation, we encourage you to call us – that way we can figure out what’s going on and get it fixed.

We thank all our customers for their patience, and look forward to serving you.