How to Measure Latency in the Cloud

When it comes to measuring applications performance across our local enterprise network, we think we know what network latency is and how to calculate it. But when we move these apps into the cloud there are a lot of subtleties that can impact latency in ways that we don’t immediately realize. Let’s examine what latency means for deploying cloud applications and how you, as a developer, can keep better track of it. The goal is to ensure the best performance of your cloud-based applications.

For years, latency has bedeviled applications developers who have taken for granted that packets could easily traverse a local network with minimal delays. It didn’t take long to realize the folly of this course of action: when it came time to deploy these applications across a wide-area network, many apps broke down because of networking delays of tens or hundreds of milliseconds. But these lessons learned decades ago have been forgotten. Today we have a new generation of developers and networking engineers who have to understand a new set of latency delays across the Internet.

Many of the current generation of developers have never experienced anything other than high-speed Internet access and assume that it has always been that way. This tends to make for some sloppy coding decisions, creating unnecessary back-and-forth computer communications that introduce long latency times in running their apps. As we’ll see, now when everything is moving to the cloud, latency becomes even more important than before.

Trying to define cloud latency isn’t easy.

In the days before the ubiquitous Internet, understanding latency was relatively simple. You looked at the number of router hops between you and your application, and the delays that the packets took to get from source to destination. For the most part, your corporation owned all of the intervening routers and the network delays remained fairly consistent and predictable.

Those days seem so quaint now, like when we look at one of the original DOS-based IBM dual-floppy drive PCs. With today’s cloud applications, the latency calculations aren’t so easy.

First off, the endpoints aren’t fixed. The users of our apps can be anywhere in the world, sitting on anything ranging from a high-speed fiber line in a densely served urban area to a satellite uplink in the middle of Africa, and everywhere in between. And the apps themselves can be located pretty much anywhere too: that is the beauty and freedom of the cloud. But this freedom comes at a price. The resulting latencies can be horrific and huge.

We also need to consider the location of the ultimate end users and the networks that connect them to the destination networks. We also need to understand how the cloud infrastructure is configured, and where the particular pieces of network, applications, servers, and storage fabrics are deployed and how they are connected.

And it also depends whom the ultimate “owners” and “users” of our apps are too. Latency can be important for the end-user experience of an enterprise’s apps. But if you are a service provider or a system integrator, you will want to control the network and deliver the appropriate service levels to your customers, and that means also controlling the expected latencies as part of these agreements.

One solution: triage your apps.

While reducing latency is desirable, not every app will need the lowest latencies. Applications such as such as financial services, video streaming, more complex Web/database services, backups. and 3-D engineering modeling are in this category. But apps such as email, analytics and some kinds of document management aren’t as demanding.

Latency has had three traditional metrics.

In the past, latency has three different measures: roundtrip time (RTT), traceroutes, and endpoint computational speed. Each of these is important to measure in understanding the true effect of latency, and only after understanding each of these metrics can you get the full picture.

RTT measures the time it takes one packet to transit the Internet from source to destination and back to the source, or the time it takes for an initial server connection. This is useful in interactive applications, and also in examining app-to-app situations, such as measuring the way a Web server and a database server interact and exchange data.

Traceroute is the name of a popular command that examines the individual hops or network routers that a packet takes to go from one place to another. Each hop can also introduce more or less latency. The path with the fewest and quickest hops may or may not correspond to what we would commonly think of as geographically the shortest link. For example, the lowest latency and fastest path between a computer in Singapore and one in Sydney Australia might go through San Francisco.

Finally there is the speed of the computers at the core of the application: their configuration will determine how quickly they can process the data. While this seems simple, it can be difficult to calculate once we start using cloud-based compute servers.

First complicating factor: distributed computing.

As we said earlier, in the days when everything was contained inside an enterprise data center, it was easier to locate bottlenecks because the enterprise owned the entire infrastructure from source to destination. But with the rise of Big Data apps built using tools such as Hadoop and R (the major open source statistics language used for data analytics), the nature of applications is changing and a lot more distributed. These apps employ tens or even thousands of compute servers that may be located all over the world, and have varying degrees of latency with each of their Internet connections. And depending on when these apps are running, the latencies can be better or worse as other Internet traffic waxes or wanes to compete for the same infrastructure and bandwidth.

Virtualization adds another layer of complexity, too.

Today’s modern data center isn’t just a bunch of rack-mounted servers but a complex web of hypervisors running dozens of virtual machines. This introduces yet another layer of complexity, since the virtualized network infrastructure can introduce its own series of packet delays before any data even leaves the rack itself!

Understand Quality of Service and what traffic is prioritized.

In the pre-cloud days, Service Level Agreements (SLAs)and Quality of Service were created to prioritize traffic and to make sure that latency-sensitive apps would have the network resources to run properly. These agreements were also put in place to ensure minimal downtime by penalizing the ISPs and other vendors who supplied the bandwidth and the computing resources.

But with the rise of more cloud and virtualized services, it isn’t so cut and dried. For one thing, the older SLAs typically didn’t differentiate between an outage in a server, a network card, a piece of the storage infrastructure, or a security exploit. But these different pieces are part and parcel to the smooth and continuous operation of any cloud infrastructure.

An example of this is a back office application that produces daily summary charts about a particular business process. If one of the many components of this app is down briefly, probably no one would notice nor really care, as long as the reports are produced eventually. We’ve put together the chart below that summarizes our thoughts on how critical particular apps are and under what circumstances they should be prioritized for particular SLAs.

This means that your SLAs need to handle a variety of situations. You don’t want to enforce (nor pay for) the same service levels on your test/dev cloud that you would on a production cloud.

Reducing latency has several dimensions.

So now that we have a better understanding of some of the complicating factors, the next step is to start to examine how you can reduce latencies in particular segments of your computing infrastructure. In a paper for Arista Networks, they mention four broad areas of focus:
• Reduce latency of each network node
• Reduce number of network nodes needed to traverse from one stage to another
• Eliminate network congestion
• Reduce transport protocol latency

Of course, they sell some of the gear that can help you reduce network switch transit times or cut network congestion, but still it is worth examining these more mundane pieces of your cloud provider’s network infrastructure (if you can) to see where you can start to apply some of these savings.

Can content delivery networks (CDNs) help?

Not much. CDNs are designed mostly for delivering static content to a broad collection of distributed end users. One of the largest CDNs is Akamai, which is based on 95,000 servers installed in 1,900 ISPs around the world. But many cloud applications have a different type of treatment, and in many cases won’t get much of a latency improvement from a CDN because they aren’t using static pieces of content. Nevertheless, CDNs are expanding their capabilities and trying to help reduce latencies by caching more than just static HTML pages. Certainly, it is worth investigating whether a CDN partner can improve your particular situation.

Conclusion

As we can see, cloud latency isn’t just about doing traceroutes and reducing router hops. It has several dimensions and complicating factors. Hopefully, we have given you some food for thought and provided some direction so that you can explore some of the specific issues with measuring and reducing latencies for your own cloud apps, along with some ideas on how you can better architect your own apps and networks.

Webinar: Integrating Cloud Services Management Into Your IT Operations

Getting into the cloud is a lot easier than understanding how to make it a part of your overall IT operations. In this webinar, I look at ways that you can better govern your cloud deployments and make use of the best practices of IT that you use for your own servers. I will show you more than a dozen different services that can help you understand your cloud computing costs, figure our better ways to make your cloud infrastructure secure, and better manage your cloud deployments.

The webinar is held this Thursday at 12:30 pm ET.

You can register for the webinar here, download the white paper Dec Cloud Integration here, and view the slides that I will use for the event here.

QuorumLabs’ onQ: a new way to recover Windows servers (video review)

More businesses are depending that their computer systems are staying up and continuously running. To protect them, they have piles of tape backups made. However, these tapes are never ever touched or tested. Another choice for disaster recovery is to build a replicated remote data center. But this can get pricey.

Enter QuorumLabs and their onQ Recovery appliance. I spent some time last week working with them and produced this video screencast that explains its features. For about $20 large, you can set up a pair of these appliances and fully protect all of your Windows servers. It uses some cool virtualization technology to make copies of your running servers, so when one goes south you don’t have to run around trying to recover it quickly.

MSPtv Webinar: How the Private Cloud Can Be More Secure

Security concerns remain one of the biggest obstacles to cloud computing adoption, even as spending on cloud-based solutions accelerates. Users welcome the affordability and scalability of cloud offerings, but many remain fearful about the potential for network breaches and leaks. These fears typically focus on public cloud offerings, and as such, they open opportunities for IT service providers to extol the virtues of secure private cloud environments.

Today I will be doing a webinar for MSPtv on this subject. You can tune in here.

You can download my slides here.

Live text chat today on storage virtualization

I will be moderating a live text chat at 1pm ET today on behalf of ReadWriteWeb, with guests from NetApp and VMware, to talk about general storage virtualization topics.

Some of the topics we plan to discuss include:

  • How do you allocate storage appropriately when you don’t know what your needs are?
  • How important is thin provisioning for your storage solution, and where do these features need to be integrated?
  • Does your backup solution need to be virtualization-aware?

Feel free to join us by clicking here.

MSPtv: How the channel can win with cloud computing

Now more than ever, cloud computing has become the single most important factor in helping boost the reseller channel to new heights. No matter their specializations or backgrounds, all channel players can leverage cloud computing to become more profitable, competitive and widen their reach with acquiring new customers and business opportunities.

You can register and watch live this MSPtv event on Thursday June 23rd at 12:30 ET here.

SearchCloudComputing: Preparing for a hybrid cloud move

The notion of hybrid cloud computing is gaining traction. While the concept isn’t all that new, vendors are constantly adding to the ways IT managers can effectively migrate and manage these mixed environments. And new providers spring up frequently, which makes evaluating them all that much harder. Assuming you’re ready to hop into the cloud, what are the right steps to take with a hybrid offering?

You can read my article on Techtarget’s site here that goes into details about the steps you need to take.

MSPtv: Cloud ROI: Pie in the Sky?

As with every emerging technology, once the hype gives way to reality, attention turns to ROI. And that is exactly what is happening with cloud computing. Is ROI as hard to prove with the cloud as it had been with most legacy technologies or do new financial models make ROI projections more straightforward? Either way, solution providers must become adept at it because their customers will want to understand cost benefits before embracing the cloud.

You can register and view this recorded webinar, where I talk about these issues with reseller Jason Smith of Binary IT solutions, here.

Datamation: Virtualization Software Trends: Hybrid Clouds Mature, Virtual Firewalls Lag

In my last update on virtualization for Datamation last winter, I looked at developments toward the end of 2010 concerning virtual desktops and improvements to virtual infrastructure. The past six months have seen increasing sophistication in both areas, with new products from the major virtualization vendors and some interesting twists, which I’ve noted in my story for them this week here. I review where Microsoft, Xen and VMware have been with recent new products and acquisitions and identify a few trends.