In a previous post (Availability-Zone Visibility) we explained how to make sure that your application has the right Availability-Zone strategy and prevent down time. We believe that budget is just another resource that goes through engineering trade-offs just like performance, capacity and others (see: Show.Me.The.Money). In this post I would like to tie the two things together and share how we applied a set of small changes to our environment with significant impact to both performance and the cost of the solution. We will look at how simple configuration change have:

  • Improved access time (reduced latency) to the Database (Cassandra Cluster)
  • Reduced timeouts as a result of zone outages.
  • Reduce cost of AWS cross-zone significantly.


Architecture – High Level

There are 3 main Services that are using our EC2 Cassandra Deployment:

  • data-process-pipeline-worker:  process new payload and insert to Cassandra
  • analytics-worker: data analysis
  • analytics-api: user api

Here is a diagram. Note that each scaling group contain many instances (each dashed line represent multiple lines, one per instance).


In practice, each client node will talk to one out of 3 different Cassandra nodes (one per AZ):


Auto-Scaling-Group – Details

The number of EC2 instances of each service are different and can scale up are down.

Here the detailed layout of the analytics-worker  Auto-Scaling Group with 8 EC2 instances

Each EC2 instance can access any of the Cassandra node (in reality, we narrow it down to three – one per Availability Zone).


Our Cassandra  “replication factor” is 3 and  Zoning wise, the EC2 and the Cassandra nodes are randomly distributed across  3 Availability Zones (us-east-1b, us-east-1c and use-east-1d)

AWS Zone Traffic – Cross vs. Internal

Using ITculate engine, network traffic to the Cassandra is break into 2 buckets

Internal and Cross, in the following example we can see that 80% (400MB/s with spike of 900MB/s) of the traffic is Cross-Zone, while 20% (100MB/s with spike of 300MB/s) is in the same Zone

The OMG moment!

Cross zone traffic is 4 times more that internal zone traffic

  • It is costly 500MB/s it is $12,960 per month (Each second cost 0.5 cent)
  • It adds a significant latency (5-20 ms)
  • In case of a zone outage, the default accessed node might not be available. This may not cause an actual outage (as the communication falls to the next Cassandra node on the list), but will surely slow down the system.


The Hoohah moment!

Step 1. Cluster side, setup Cassandra Snitch For Multi AZ

A snitch determines which data centers and racks nodes belong to (more info), Ec2Snitch and Ec2MultiRegionSnitch are optimized for AWS. For single region Cluster use Ec2Snitch and for where cluster that spans in multiple regions use Ec2MultiRegionSnitch.

In the following code snippet cassandra.yaml we set the endpoint_snitch to be Ec2Snitch

Changing endpoint_snitch can be destructive. more info in Switching snitches.

Step 2. Client side, setup the Replication Factor Strategy

The default  Cassandra strategy is “SimpleStrategy”, it will ignoring which region or AZ the client belongs to. The NetworkTopologyStrategy is optimized for cross AZ and also cross region.

In the following code snippet we create a keyspace named  “itcualte” with a NetworkReplicationStrategy, there is only 1 region “us-east”, with a replication factor of 3.

Changing Strategy can be destructive, more info in Updating the replication factor.


The result: Faster, Cheaper!

We saved much more than expected, the saving was $8,000 (~60%).

This is the before (left – cross zone) and after the change (right – internal communication) :

About ITculate provides a monitoring solution for DevOps environments. ITculate’s solution captures not only raw and custom metrics but also the architecture of the customer’s environment. ITculate’s core technology tracks relationships between and within services. Understanding the relationships allows ITculate to provide a context to the user. It also allows for better visualization and enable much faster troubleshooting. ITculate provides a more intuitive way of data exploration and dramatically improves the user experience of monitoring. Please check us out at to learn more!

Comments are closed.