Production Walkthrough Of AWS ElasticCache (Redis)

Summary

Recently AWS service notice for AWS ElasticCache to patch with security patches came in LambdaTest AWS portal. We had to double-check every possible gap before we could do the activity, there were multiple learnings for the Engineering team. At LambdaTest, our endeavor has always been to follow the best practices at scale for our customers, so that maintenance activities don't impact your build cycle(s).

We call it "Maximum Results with Minimal (or Zero) Downtime"

How to find what services are getting impacted?

With a high pace of growth and teams at scale, every document becomes obsolete. Plans will fail if you expect an SME to remember the client list which is connected to your Redis node.

It was simple to identify from Redis client list commands to find out which ec2 machines were connected to our ElasticCache.

redis-cli -h ENDPOINT client list | awk '{print $2}' | sed s/addr=//g | sed s/:.*//g | sort | uniq

But what you can miss is lambda functions or some reporting server that runs in a day. These can be extracted from only VPC flow logs by scanning a week's data.

How to find out details of servers connecting to the AWS ElasticCache using private IPs?

Simply pass the IPs obtained from the command listed in the earlier step in the below Python script. The script will output a detailed list of the EC2 machines:

from collections import defaultdict

import boto3

"""
A tool for retrieving basic information from the running EC2 instances.
"""

# Connect to EC2
ec2 = boto3.resource('ec2')

# Get information for all running instances
running_instances = ec2.instances.filter(Filters=[{
    'Name': 'private-ip-address',
    'Values': ['xxxxxx', 'xxxxxx']}])

ec2info = defaultdict()
for instance in running_instances:
    for tag in instance.tags:
        if 'Name'in tag['Key']:
            name = tag['Value']
    # Add instance info to a dictionary         
    ec2info[instance.id] = {
        'Name': name,
        'Type': instance.instance_type,
        'State': instance.state['Name'],
        'Private IP': instance.private_ip_address
        }

attributes = ['Name', 'Type', 'State', 'Private IP']
for instance_id, instance in ec2info.items():
    for key in attributes:
        print("{0}: {1}".format(key, instance[key]))
    print("------")

Once you have the details of EC2, it becomes easy to monitor logs, exceptions, config changes (if required), config changes (if required).

Architecture of Amazon ElastiCache (Redis) under high availability

Amazon ElastiCache offers support for Multiple Availability Zones (Multi-AZ) with the auto-failover feature. This enables you to set up a cluster with one or more replicas across zones. In the event of a failure on the primary node, Amazon ElastiCache for Redis automatically fails over to a replica to ensure high availability.

However, you might notice a brief write interruption (up to a few seconds) associated with DNS updates.

Enable Multi-AZ with automatic failover: In case of any planned (or unplanned) maintenance, enabling Multi-AZ minimizes downtime by performing automatic failovers from the primary node to replicas. Enabling Multi-AZ is essential to minimize downtime.

Enabling MultiAZ via AWS Cli

aws elasticache modify-replication-group \
    --replication-group-id redisxxx \
    --automatic-failover-enabled \
    --multi-az-enabled \
    --apply-immediately

The role of the primary node will automatically failover over one of the read replicas. There is no need to create and provision a new primary node because ElastiCache will handle it in a transparent manner. The failover and replica promotion mechanism ensures that you can resume writing to the new primary node as soon as the promotion is complete.

ElastiCache also propagates the DNS (Domain Name Service) name of the promoted replica. This is done to ensure that no endpoint change is required in your application when your application is writing to the primary endpoint. If you are reading from individual endpoints, make sure that you change the read endpoint of the replica promoted to primary to the new replica's endpoint.

The second best practice to achieve more scalability is to have 2 more replica nodes that caters to read load. This approach further helps in failover in case there is an occurrence of failure at the primary node. Enable the Auto-failover configuration to realize this scalability requirement.

As shown below, you should see two things that are enabled after performing the above changes: Capture.JPG

One more problem statement that comes prominently with auto-failover is right-sizing of the nodes. This can be done by monitoring the following metrics of the CPU - DatabaseMemoryUsagePercentage and ReplicationLag.

Let's consider a scenario where your high availability architecture will break is load. With growing traffic, your primary node would keep choking day by day and eventually it might run out of memory. When this occurs, it would auto-failover to read replica and this cycle will continue until you upscale the nodes.

Cloudwatch alerts can be configured for over-optimized usage, thereby giving the provision to send alerts before time like 70 percent (keep 30 percent for operational work, monitoring, and replication or cluster workload).

Testing Automatic Failover

Here are the pre-requisites before testing the failover of AWS ElasticCache:

  1. Enable MultiAZ
  2. Presence of minimum 1 read replica

Two parameters required to run the testing --replication-group-id (Mandatory): The replication group (on the console, cluster) that has to be tested. --node-group-id (Mandatory): Name of the node group on which you intend to test automatic failover. You can test a maximum of five node groups in a rolling 24-hour period.

aws elasticache test-failover \
   --replication-group-id redis00 \
   --node-group-id redis00-0003

How to check your Redis ElasticCache is a cluster (or setup) in a replication fashion?

We would cover working with cluster mode in a separate blog, as it is beyond the scope of this article. However, if you still intrigued to know more about the cluster mode, refer to this informative blog

You can quickly check whether Redis is cluster enabled or not using the below command:

$ redis-cli -h ENDPOINT CLUSTER SLOTS
(error) ERR This instance has cluster support disabled

How to make your application handle failovers?

AWS ElasticCache is an extremely scalable Redis solution. Here are some of the things that have to be taken care of at the application layer:

  1. Configure Cluster Endpoint: When AWS will detect failures in primary or slave, it will update the cluster endpoint if your application is connected to the node DNS. There is a 100% probability of failover and your application would try to write to the read replica (earlier it was the primary node). This in turn will throw exceptions.

AWS ElasticCluster also provides a reader endpoint for reading load splitting.

  1. DNS Caching: OS and Code do some level of DNS caching. The only thing you can do is to keep TTL (Time to Live) as low as a minute. In fact, ElasticCluster keeps the DNS TTL as low as 15 seconds. We use GoLang which doesn't do DNS caching but few commands based on OS do help in clearing the DNS cache.

Most of the Linux-based distros use the following commands to cache the DNS:

sudo systemd-resolve --flush-caches
sudo service dnsmasq restart
sudo service nscd restart

We have a large volume of MacMini where our microservices run but they don't directly connect with ElasticCache. Just in case you are hosted on a Mac, you can use the following command to clean DNS Caching.

sudo killall -HUP mDNSResponder

The command might slightly vary in case you are running an older version of Mac OSX.

  1. Reconnect: In case of failures, you will have to write code to reconnect on failure or fall back to the database. Most of the languages don't provide this feature as every organization uses it in a different manner. Some might be using pipeline and pub/sub whereas some might be using the cluster.

In the case of Kubernetes, the Kubernetes scheduler will start auto spawning your pod crashes (or exists) in case the health endpoints are testing the Redis connection and exit on a connection failure.

Conclusion

In a production setup, we have multi-AZ enabled and a minimum of one read replica for handling any failures to the primary node. This auto heals our services and keeps running your tests suites without any glitches (or interruptions). We also take daily snapshots for minimizing the chances of any occurrence of data losses. Our team at LambdaTest continuously monitors Redis Health metrics to make sure LambdaTest APIs provide you the best response time.

With a high availability setup, we are expecting our downtime to be as low as 1 min even during our critical maintenance windows.

Do let us know your thoughts about AWS ElasticCache in the comments section as it is an integral part of the learning process at LambdaTest!