In the aftermath of Super Storm Sandy, this panel of CTOs from AppNexus, adMarketplace, Tapad, x+1 and Aerospike discussed issues and best practices in architecting and operating Real-time Big Data systems at ad:tech New York, 2012. The transcript of this event is broken out into four topics outlined below. This post is the second, with trailblazers reflecting on Super Storm Sandy’s impact on business and how they’ve managed to achieve 100% uptime:
- Impact of Real-time Big Data on the Business
- Super Storm Sandy and 100% Uptime
- CTO Secrets to Scaling Systems & the Scoop on Hadoop
- CTOs tips – Developing a Scalable system, Problem Solving at Scale
Mike Yudin: Okay. Well, thank you, Srini. You have to remind me about the most depressing week of my life.
Moderator/Srini Srinivasan: I’m sorry.
Mike Yudin: But it’s all good. We do a 100% uptime. We lost one of our data centers in the flood, and it’s not just the data center itself that lost power, it’s the entire network infrastructure of the tri-state major area, all the backbones. Your Verizons and Sprints of the world lost connectivity. So, and then we stayed up and didn’t lose a bit of data. How did we do this?
We do this by having redundant, not only redundant equipment within the data center, but also the globally load-balanced infrastructure across multiple locations. If one gets flooded, then traffic just gets shifted into the data center that survives. The trick here of course is to make sure that your location has all the same data and all the same intelligence as the system that got destroyed.
There are several techniques in this, and one is cross-data center replication of data. So this is one of the reasons why we chose Aerospike. They have this ability, so our data centers exchange data between each other through their XDR cross-data center application process. That works quite well, and it’s fast. If a user is in Chicago, and they do a search for a new car in Chicago, and it hits the Chicago data center, in less than a second this information propagates through the New York data center.
If the same user then goes to another website, and a request hits and a disaster happens, and then his next request arrives to a different location, all the data is available. Of course you have to plan, and you have to have a disaster recovery plan in place, and then you have to have a plan in place for what you do after everything goes back to normal.
That’s what we’re dealing with today. And you also have to make sure you choose a nice and sunny location for your disaster recovery office. I spent this…
Pat: And then you get earthquakes…
Mike Y: …where there are no earthquakes. I could have gone to the south of France in the same amount of time it took me to get to Pittsburgh, Pennsylvania. So that’s my story.
[19:37] Srini/Moderator: Any of the other panelists want to add their thoughts to this?
[19:45] Mike Nolet: I think the one thing I’ll tell you we do is, first redundancy – data replication. We were talking about this before the panel, you have to have multiple locations. Not only in advertising are multiple locations enough, but also understanding within each of the locations how you’re connected to the Internet, and how you’re connected to different partners, and your point around network infrastructure. Many of the ISPs had major, major flooding inside their hubs.
In our facility, we’re lucky enough not to lose power, so we stayed up. But we saw that half of our network providers lost power. We lost cross connects to Google, to Amazon…
Mike Yudin: Well that’s what happened to us, we were up, but all of a sudden our traffic dropped 70 % because no requests were coming through. So how good is that?
Mike Nolet: It’s true. 111 Eighth Avenue is one of the largest buildings in Manhattan, and it went down basically for eight hours. And this is actually where we meet up with Microsoft and Amazon, with Google and all these major companies. Our network team was actually out for seven days straight, playing Whack-a-mole, routing traffic, trying to make it work.
And then the one last thing that I’d say that we do that actually helps a lot with the data, we actually — well, we don’t use the Aerospike replication, we wrote our own replication layer, and what we do is we replicate incremental changes to all of our user data. One thing we track a lot is how many ads you’ve seen, right? And of course behavior. The problem is — let’s say I have your resume, right? And I have that in your key value store. As I change the resume both in LA and New York, how do you make sure that both copies of the resume get the exact same change?
One way is to send the whole resume across, but then you end up, actually, if you have conflicting changes, you can end up losing data in the middle. So what we always say is that we’re adding a line to the resume, at this location, and we might do that multiple times on multiple records on the same user. Then even if we lose connectivity, or something weird happens, whenever connectivity comes back we stream those messages back and forth to each other so that in the end those two copies end up being the same again.
I think that’s one technique that was used very successfully to make sure that our user data in the end always ends up being 100 % the same across all of our locations.
[21:55] Srini/Moderator: I’d like to add a follow-up, to Brian. For example, AppNexus has been our first customer, so our cross-data center support only showed up last year. So I’m asking Brian, is Aerospike going to solve these problems?
Brian: Yeah, but I wanted to share another customer’s story, which was our first appointment with cross-data was with a company many of you probably know. Exelate. And Exelate has three different data pools, a US data pool, a global data pool, all of them replicating among four data centers. What happened to them was they lost one of their New York data centers as well. Everything backed up on the servers of the data center feeding New York was fine, it ended up when the data center came back re-replicating.
What they did say that was interesting was they actually had to call us, because when their New York office went down, they had IP-based security into their data center so they couldn’t get into their…we… they had abandoned their office. And at home, they couldn’t actually get into their data centers to do a graceful shutdown of the servers. So they had to call our support guys, and say, “Hey, Kavin, can you please take down these servers gracefully because we just got notice that there’s only 30 minutes of power left.”
We were happy to oblige and help them take down their servers gracefully. That’s the kind of thing we do. Only a few of our customers lost full data centers. As you say, connectivity was really the issue.
In terms of being able to support at our layer, both what we call ‘delta shipping’, which is just the updates, which is basically the techniques Mike was talking about, we’ll be having that probably in the next six months or so. It’s an important technique, both for bandwidth resolution as well as getting the correct data and not losing updates.
[21:57] Srini/Moderator: Okay. Mike, did you want to say something?
Dag: Cross-data center, replication and redundancy are important, but of course you also need to have intra-data center redundancy, and this is where this product also does a really good job. If you lose a node, you have routing into clients and into servers, so they will route traffic to wherever the data is. If you add nodes, you don’t have to plan for if you can just add nodes and they will automatically start sifting data from the other servers. Data centers do fail, fortunately they don’t fail that frequently. Servers, they fail pretty frequently, or they don’t even fail — someone just takes them down by mistake. I think that happens quite a bit as well. We’ve seen that happen a few times. Fortunately it hasn’t really affected us.
CTO Secrets to Scaling Systems and the Scoop on Hadoop. Read more