Mistakes were Made, Lessons were Learned Vice President, Cloud Research at Trend Micro
Mark Nunnikhoven [16:09]
So the team, you know, having broken this apart, having shifted the video delivery through CloudFront, of course, is asking themselves, well, why is the chat app so big and so heavy? Well, it turns out, it doesn’t need to be. There’s in fact, an example project on the AWS blog that shows the exact solution where the team started to lean towards. And that was a serverless based solution with AWS lambda, AWS app sync, and then they switch to Amazon dynamoDB table in the back end, because it was a much better fit for an online real time chat. So moving this to an entire serverless design was a great evolution. Because again, they’re lowering their operations, and they’re aligning their on demand uses of their resources, right, great evolution. And then they’re exploring yet another evolution here, they’re actually looking at removing CloudFront, because it’s too much of a burden, they kind of got hooked on the whole serverless thing, they kind of got hooked on that, which is a great thing. They got hooked on the whole, like, what else can we move up the stack to a better, less touch service for us. And they’re investigating Amazon interactive video service, which is something that is a recently new service. And essentially, it hides the CloudFront and s3 bucket usage in the background. And you have an endpoint you create in the service, you push your video to that endpoint. And then you use the services SDK, to integrate it into your application, whether it’s a web browser or a mobile app, or what have you. So the teams in the phase, just exploring this now to roll this out for their customers. And this goes to the core principle of automate all the things, if you’ve got things stable, if you’re delivering a good value to your customer, is there something that you can do to make your life easier from an operational perspective, so customers won’t see really a big difference with Amazon interactive video service, but the team will because it’s one less set of things they have to worry about right now, they’ll have a highly managed service where they can just push video through, and it scales out for them at a great, great rate and streamlines their back end experience. So that ties in with automating it, because even you know, setting up the distributions, worrying about the cache, worrying about the s3 bucket, that kind of stuff. If you don’t have to do it, don’t do it. And I mean, that was a much better solution than the instances. But this is an even better solution of going to a fully managed service fit for purpose.
Mark Nunnikhoven [18:18]
So we look at overall where we are with the solution. And operations is through the roof cost optimization, everything we’ve done better, if we look at where we started, we’ve made significant increases is a great work by the team. They’ve used multiple rounds, based on facts they’re seeing, and each rounds are using those feedback loops. They’re evolving the architecture through lessons learned, right, really great fit for them. And, you know, kudos to the team for making these improvements and moving forward. So let’s use another example here. And this one was a team that I spoke to about their legacy system data storage. So they had taken an application that was designed for an on premise environment. And they were serving it out from their own data centers, and they’re migrating it into the cloud. So they forklift into the AWS cloud. And then they’re starting their experience from there. So again, we’ve removed some of the details, just so that it’s not obvious as to who we’re talking about here. But there’s some great lessons in this one.
Mark Nunnikhoven [19:12]
And we’re gonna look at this from a number of devices that are sitting out in labs around the world. And each of these are sending live data back in this example, we’re just showing weather data, because it’s something easy to relate. And there’s a lot of different metrics coming in pretty much constantly. It’s a real time system that sends them to a central server system in order to analyze the data and cross reference it. So there’s a need to pull in real time events. And then there’s a secondary need for this analysis across all of these events. So this is a data heavy application. And all of the good news is all of the processing is done centrally. So you may be screaming in your head, this is going to be an IoT solution, and it would be for part of it. We’re going to focus on the data storage piece because I think this is where a lot of people are sitting especially in large enterprises. Where you have something that was built, that is working, that is solving your business needs in this particular case was making a lot of money. But the when they were moving into the AWS cloud, they’re going along this journey that I think we’ve all been on. And so we want to highlight that. So if we look at the architecture, when they forklift what they had into the AWS cloud, they moved the devices, were talking to an elastic load balancer, an EOB, that went to an Amazon EC2 instance, pool and an auto scaling group that was running their custom app. And in the back end, it was running an Amazon RDS Oracle instance, or set of instances. And of course, they were monitoring it all with cloud watch. This solution, and this is really, really important to call out. This solution is a forklift from what was on premises, and it worked great. It was highly reliable, and it fit the purpose, and they had enough income to justify high licensing costs in Oracle. And they had they were an Oracle shop already. So you know, you know, where are we said no judgments, no judgments, they’re making a ton of bank office.
Mark Nunnikhoven [20:59]
So you can probably figure out where we’re going with this one as well, though some of the data and collection analysis all being in an instance pool, may cause some challenges here, right. But that’s the The important thing to think of. But it’s that our Amazon RDS Oracle instance, where is going to pop up with our first example. So here, we’re looking at a chart again, normalized to make it easier for you, on the y axis is the database size, right? So the amount of data that we’re storing in the database and terabytes, and across the x axis is our event volume in millions. So when we’re looking at sort of one to 5 million, we don’t have that much data storing being stored when we hit 10, we’re starting to creep into the, you know, maybe one terabyte, when we get to 100 million events, we’re at about eight or nine terabytes. And we quickly quickly spike up north of 65-70 terabytes, when we’re hitting 1000 million events here. So or groups of 1000s of a million events. So the challenge is, is that, you know, based on the fact that we’re seeing is that as these events spike up, our database costs are going to increase massively, they’re going to follow this chart, because storing data in a traditional RDBMS is going to be expensive, just it’s, but again, this application was designed on premise, and it was working fine. The challenge specifically is that per day, per day, this team was seeing about eight or nine terabytes of data every day coming in. And that was only going up as they were being more successful. So why is that a problem?
Mark Nunnikhoven [22:31]
Well, because it’s all being stored in the database, they were spending a lot of money. So if they’re seeing about eight terabytes a day, one of the maxed out 64 terabyte storage nodes, reasonably, you know, savings on the three year reservation was costing them 15 and a half thousand dollars a month after a 32 and a half $1,000 reservation over the three years. So thing to remember, again, that they’re filling one of these every eight days. So every eight days, they’re spinning up another one of these nodes is costing them $15,500. Again, they’re making enough money that it’s not huge, but it’s an area of concern. So right now, if we look at the solution status across the, the pillars, the framework, operationally, it’s fine. It’s a little burdensome, but that’s okay. It’s not really optimized the cost, but it’s not that bad, because they’re making money, rock solid reliability, not great on performance, cause of those Oracle instances in the back end, and security is okay.
Mark Nunnikhoven [23:30]
But of course, the team’s looking at that going, that’s a big bill. So they start there, they switch that Oracle instance over to Amazon RDS Aurora. This requires a little bit of custom code change in their app, but not much doesn’t fundamentally change how they’re handling data. So when they switch this to an Amazon Aurora instance, for the similar performance and storage tier that drops down to $8000 a month and a $5,000 reservation, so they actually end up saving 49% through this code change, which is an absolutely massive win for the team. And they were ecstatic about that because it was a minimal effort on their part, and a huge savings on the back end. So this readjusts our pillars where cost optimization is now rising steadily, because we’ve made 49 of one move to save 49% on our data storage costs. But of course, they’re not stopping there, there’s a lot more to do. And they realize that, you know, storing all the data in a relational database isn’t really the way to go, let’s start taking advantage of Amazon s3. So instead of filling up the database every eight days, now we can turn around and go, okay, we can keep some stuff live in the database, but we can start pushing more and more events out to s3, and that’s going to cut our costs down significantly, storing a similar amount in s3 was costing them around $1,000 a month. So significantly cheaper, way better for scalability. And that was a great evolution in their architecture. They’re taking advantage of The cloud. So this is a very common pattern, we see people forklift what they have, because it was working, move it as is great step one, you should absolutely do that. But according to the well architected framework, we should continue to evolve. And as we’re evolving, we’re taking advantage of other powers that are available in the cloud. So here, we’ve cut down cost by switching to Aurora. And we’re cutting down costs even more by moving some of that data that’s not required to be available instantly, through their custom code. And we can make it available in a different manner. In fact, they realize, wait a minute, we don’t even really need the relational database in the way that it’s existing. If we’re willing to rewrite our data code architecture and our code, we can start to leverage dynamodb. Now, is that going to be cheaper than Aurora? Not necessarily, um, it ended up being less expensive for them. But it was more performant. And they got a lot more flexibility, they were able to decrease the response time for requests in the back end, which was great. And this is all just changing. This is all just changing their custom code, again, evolving that architecture to see those results, either back end for the team, or front end for their customers.
Read More HERE