How Facebook Visualizes Cache Health

One of Facebook’s more infamous development initiatives is its use of Memcache to handle the large influx of data to keep performance up. And this week we are treated to what Facebook engineer Sean Lynch explains how he and his team came up with a monitoring tool called Claspin that uses heat maps to present the status of their systems in an easy-to-interpret format. The post can be found here.

Actually, Memcache is just one of two caching systems in use at Facebook. The other is TAO, a caching graph database that does its own queries to MySQL. Claspin is used to monitor both.

Lynch started thinking about ways to visualize all the relevant metrics that were easy to see trends, given that thousands of charts are involved and the racks upon racks of servers that are running to support Facebook’s applications and Web services.

The post describes the process that Lynch went through to get to the ultimate tool that is being used by the social networking operations team. The journey is instructive in understanding how you might want to develop your own monitoring tools and how to incorporate what Lynch calls “tribal knowledge” of his team into what metrics to use to measure a poorly performing host.

He settled on using two-dimensional heat maps that can pack a lot of visual information in a small space. “On a 30-inch screen we could easily fit 10,000 hosts at the same time, with 30 or more stats contributing to their color, updated in real time—usually in a matter of seconds or minutes,” he writes. Here is an example of the visualization, with each square representing a particular server, red being an issue and green meaning all systems are good.

fbAs you can see in the screen capture, “mousing over a host draws an outline around its rack and pops up a tooltip with the hostname, rack number, and all the stats Claspin is looking at for that host, with the values colored based on Claspin’s thresholds for that stat,” he posted.

Perhaps the ultimate testimony of the power of his tool is in its utility. The tool’s name comes from the name of a protein that monitors for DNA damage in a cell. And like DNA, it has become very useful for other sorts of circumstances at Facebook. “It’s quite gratifying to walk around the campus and see Claspin up on the wall-mounted screens of teams I didn’t even know were using it,” he writes.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.