CrapFlingingMonkey.com
Coding to the beat, yo

TAG | work

Dec/09

10

S3 At A Real-world Company

Let’s face it, most bigger companies nowdays are afraid of trying something new.  That happens with good reason — most new ideas tend to fall by the wayside, as trends normally do, and companies like to play it as safe as possible.  I see new ideas and frameworks popping up all over the Twittersphere every day, and I wouldn’t consider using any of them in a production environment.

Amazon Web Services Isn’t Just a Pie-In-The-Sky

The reason I bring this up is this — Amazon Web Services in the business (not startup) world is *still* considered a new, unproven technology.  And with all the marketing hype around clouds, infinitely scalable services, etc, etc, I honestly don’t blame them.  It hard to believe a pie-in-the-sky promise.  That’s just the point — AWS is not pie in the sky, and people that think it is need to dig deeper and understand what it is and what it offers.  The fact is that Amazon Web Services has been around since 2002, and has uptime that is most likely better than your data center.  Coincidentally, Amazon also knows this and is trying to eliminate the false perception that IT IS GOOD FOR YOUR COMPANY TO USE IT TOO.  They published this article, along with an updated cost calculator and an Excel spreadsheet to compare your datacenter with using AWS.

Backcountry.com and S3

S3 At Backcountry.com
Ok, so the real reason for this article.  At Backcountry.com, we try hard to stay as close as we can to the bleeding edge, but going into “the cloud” has always received serious backlash.  That is, until recently.  Earlier this month we took advantage of the cloud for the first time in a production environment: by using S3 for our “Jumbo” product images.

First, let me explain the reasons we decided to use S3.  Our webapp tier, consisting of a few boxes, hosts the Interchange e-commerce framework, and also contains all our static content.  The trouble was, the 900×900 images consumed about 100gb disk space, but each box only had less than 20gb left.  That left us with one of two traditional options: put new hard disks in each webapp, or use our NetApp to host the images from a single location.  Neither seemed ideal, since putting in new hard disks would be pricey and could take some time, and we were already short on NetApp space given the current budget.  I had done some side-work using S3, and mentioned it.   Chris Alef was able to push the decision as a great idea and it was agreed to do it.

Flash forward 1 week, and we were ready to go live.  We were able to convert and upload the 900×900 images to S3 over the weekend, and get the UI in place in no-time flat.  We have Akamai hosting edge cache in front of S3, and we had zero problems since launch last month.  I asked our operations team what they thought the bill for the month would be, and they guessed $4,000.  The actual bill?  Under $50.  Granted, Akamai probably took most of the traffic, but that’s still mighty impressive.

There’s so much more we can do with AWS, and I hope this is just the start.  I hope to be able to take advantage of other AWS services such as EC2 and SQS in the future, and I think S3 helped build confidence.  AWS is a service that can be relied on for both startups and established internet businesses alike.

, , ,

Dec/09

9

YUI-Magnifier Released

A coworker of mine, Dustin McQuay, released the YUI Magnifier, a YUI implementation of other popular image zoom utilities.   We were actually surprised to see that nothing else like it already existed for YUI, so Dustin took the challenge of building his own, with the hopes that it might be included in other larger YUI libraries.

It boasts the features:

  • Display a magnified portion of an image, which is controlled by where the mouse is hovering over the image
  • Control over styling
  • Control over location of magnification lens
  • Magnified image can be wrapped by a larger element

Though the release wasn’t very public, it was still quite an accomplishment. It happens to be one of the first open-source releases from Backcountry.com (preceded to my knowledge by only Bucardo, a Postgres replication application written for Backcountry.com by Endpoint).  It was originally designed to be used for our 900×900 images, but got cut after development has essentially finished due to changed requirements.

It’s a pretty solid application, and hopefully the start of more open source to be coming out of Backcountry.com

, , ,

If you work for any website that receives a lot of traffic, you know how aggravating it can be when you get woken up in the middle of the night because the website is down or payments aren’t getting processed. People call this many things — Seg-1, P5, it doesn’t matter — it’s “the shit has hit the fan”. Working at Backcountry.com, I know I’ve seen my fair share of experiences. When you have a group of 5 or more people trying to work on the same problem, chaos can ensue. People will work on the same problem, stepping on each others toes, change something without letting others know, withhold vital information for the sake of being the “rockstar” who fixes the problem, or various other problems. The ultimate problem is that the company loses money, and it’s an embarrassment to have the downtime.

Emergency Response
Emergency Response

The first practice I recommend you, and the one thing I hope you keep from the article, is this: keep track of what you do. Track everything. Track changes you make. Track decisions you made. Track data you have gathered, no-matter how irrelevant. Recently at Backcountry, we were troubleshooting a problem that involved nearly all aspects of our architecture — high load on databases, high load on webapps, traffic stays the same, 500 errors increased, people were losing sessions, traffic through the load balancer was inconsistent… but we couldn’t pinpoint the problem. Searching through the logs, there were no errors, only timeouts. No query locks in the database. We kept record of all that information, and tried to make correlations. We eventually came to a solution by putting the pieces (or what we tracked) together until it made sense. Then everything else falls into place. The end result was that, coincidentally, there were 2 major problems at once — Varnish was passing through a 500 error which happened to be an RSS feed (i.e. high traffic), and session databases were intermittently not allowing connections (for various reasons). If we didn’t record all the data, we couldn’t have made the connections.

The second practice I would recommend is to elect a “Call Leader” when responding to an emergency. This coordinator has a few roles: communicate with business owners periodically, keep track of what tasks people are working on, and make recommendations, in some some cases dictating, what actions are going to be taken.  A side-effect is that communication patterns within the team become explicit — techs looking into the problem know they need to communicate to the Call Lead, and the Call Lead needs to work with the techs.  This leave the rest of the team to concentrate on the problem at hand, and only their specific silo.  An example conversation among the team might go like this:

Tech 1: “I’m seeing some weird stuff in our Apache error logs, something about an error with connecting to session dbs.  I’d like to take a look at it.”

Call Lead: Ok, go ahead.

That’s all you need to communicate effectively.  But you’d be surprised at how many companies and teams don’t do this.  Having a Call Lead helps ease this transition.

Let me move on to another subject — the stages of an emergency.  One thing I’ve seen a lot of teams do is circle around a problem, jumping from one observation to the next, without ever remediating anything.  I’ve attempted to layout these stages so you know where you’re at in solving the problem so you know where you need to go next to get the problem fixed.  The five stages are: Reaction/Response, Collection/State What You Know, Discovery, Remediation, and Verification.

Reaction/Response

  • Once you hear about the problem, whether it be nagios or the guy sitting next to you, make sure the problem is documented however way to document these things (Bugzilla, Jira, Google Doc, whatever).
  • Dial into a phone conference, or join a chatroom, or do whatever you need to communicate with the other team members.
  • Validate there is actually a problem.  You could waste expensive, valuable time by assuming the problem is larger than it actually is.
  • Communicate outwardly that you are taking care of the problem
  • Get ahold of ANYONE that should be there.  Don’t be afraid to call the CEO of the company if you need, the fact is that if the problem says more than X amount of time, the company will go under.
  • Elect a Call Leader (discussed above)

This should take a maximum about 5 minutes.

Collection/State What You Know (SWYK)

  • Have everyone on the call state what they know, documenting thoroughly.  Make sure you state what YOU know.
  • Set a schedule or a plan of attack.

Discovery

  • Call leader makes assignments (dependent on what people say, of course)
  • Get a list of options/suggestions from techs working on the issue.
  • Weigh options, and avoid “Analysis Paralysis”

Should take 5-30 minutes, sometimes more, sometimes less.

Remediation

  • Call leader makes decision to do an option, and you execute on it.

Some notes about this one:

You can easily get yourself into a bind by making changes too rashly, and not thinking about the consequences.  I try to use the following principles:

  • Rollback should always be an option.  Too many people are afraid to remove new functionality because of pride or whatever reason.  After a really, most often it’s the best solution to just roll back.
  • When changing live-site behavior immediately, try to do it in a rolling fashion.  Restart servers one at a time when in a clustered environment.  When code changes are necessary, roll them to one server if possible to verify changes fix the problem.

Verification

Another often over-looked step in the process.  This is when the business verifies the problem as fixed — or there is no longer any customer impact.

There’s much more, so much that I could write a book on the subject, but I hope this is enough information to be helpful. I may dive deeper into the different roles and practices in another post, so keep checking back.  As always, I would love to hear feedback on the subject.

, , ,

Find it!

Theme Design by devolux.org