The indie publisher moving to Azure, part 3: pain

posted by Jeff | Thursday, July 17, 2014, 11:16 PM | comments: 0

In the first post I did on the subject, I talked about the migration of my sites to Azure. In the second post, I talked about the daily operation of those sites. Now I want to talk a bit about the pain I've been enduring.

I'll admit, I'm a total Azure fanboy. I had a lot of success building stuff with its vast toolbox, hosting apps within worker and web roles, or cloud services, as they came to be known. As I mentioned in the other posts, the pricing just recently came to a point where the financials made sense to move off of dedicated commodity hardware and to a place where I didn't have to administer stuff. That said, the experience thus far hasn't been particularly good, and there has been down time. Whether you think that's my fault or Microsoft's is up to you.

Once I flipped PointBuzz over to v4.5 of .NET, as I mentioned previously, I thought I was in the clear. At the very least, that site behaved awesomely. Then I had a few instances where the sites would all go down at the same time, eventually, after more than two minutes per request, returning 503 errors. Sometimes just recycling the sites would fix the problem, but other times it did not. If that wasn't weird enough, I could scale up to a medium instance, then back down to a small, and everything would be awesome again, for days.

I observed a lot of strange things:

  • The down time seemed to happen in off-peak times, so it definitely wasn't me getting too much traffic.
  • The memory graph would should doubling of usage when the down time happened, on the Azure preview portal going from 40% to 80%, and it would stay at 80% even after scaling up with twice the memory.
  • The config pages in the standard management portal would not load.
  • The SCM diagnostic pages that are in preview would often not load at all, and if they did, couldn't complete a memory dump, essentially making them useless.
  • The scaling up then down worked for awhile.

So what's the first thing you would think in these cases? Because Azure websites are this abstract thing, your first thought is that the configuration is totally screwed up. That the portal couldn't load config settings, but not for every site, reinforced that. Also, the preview portal, which I understand isn't "done," has more broken things than functional things on it, but only the panel for CoasterBuzz.

I contacted support, which is only for billing if you're not paying for it. Whatever, they eventually get you to someone who looks at the technical problem. I got a guy from India who worked overnight, and told me that I was hitting my traffic limit for my free tier web sites. Considering I was on the standard tier, this was not a good start.

I admit it... I emailed someone higher up who referred me to the product team, who in turn sent me to a support case worker, I think in Redmond. He knew his stuff, but was frustrated by the fact that we couldn't repro the problem, and the diagnostic stuff was failing when there was a problem. I was fixated on the configuration problem. It didn't seem like a great leap to think that if config was failing, the configuration was hosed.

The conclusion, however, was that I was simply hitting a memory ceiling. You can imagine how absurd that seemed considering I used to run these sites on a box with 2 gigs of RAM that was also running SQL Server! I know from load testing that under significantly higher traffic the two main sites rarely exceed 500 MB of RAM combined (I also ran each site in its own app pool on the dedicated box, so I routinely saw the sites running at around 200 MB each). Then the support engineer showed me the break down. The first problem is that the staging sites were taking up a bunch of memory. OK, that's annoying, but it's legit. The second problem is that the diagnostic sites were also consuming a fair amount of memory, nearly 200 MB each. So think about that... I run into a problem, and I start hitting those sites and now I'm doubling my memory usage. What's worse is that you can't turn them off or stop them, so once they start going, you're kind of stuck. When I added QuiltLoop, that was the end of it.

I can take responsibility for my apps using a lot of memory. But there are a few points that I leave squarely in Microsoft's house:

  • There is no aggregate view of how much memory the VM is using. The preview portal kind of has that, if you poke around and find the percentage under a box for the "hosting plan" unit, but that portal is more broken than functional.
  • There is no way to see what processes are running on the VM, so there's no way to tell at a glance if something is being a memory hog relative to everything else.
  • The diagnostic site app, which you wouldn't even know about unless you were directed there by support people, or you happened to catch a blog post about it, is going to suck your memory dry. Microsoft has to either let you turn it off or not have it count toward your memory quota.
  • Reading configuration data about the site shouldn't fail if the running app fails. That's architecturally weird. It's like your furnace dying and causing the thermostat to no longer show the indoor temperature.

I'll scale up the VM to the medium size if the load merits the change, but right now I don't really know. Even the auto-scaling feature is tied to CPU usage triggers, and it spins up more instances, not a bigger VM. My stuff isn't written (yet) to go multi-instance, and it wouldn't matter if it did because memory usage tends to be fairly constant regardless of traffic.

Again, I'm critical because I'm a fan, and I want this stuff to work. I really believe in the platform, and despite these problems I think it's awesome. I was very close, however, to going back to dedicated hardware (or at the very least, a full virtual machine). That would have been a step backward.


Comments

No comments yet.


Post your comment: