Disaster recovery, HA, RTO, RPO

From Ever changing code
Jump to navigation Jump to search

Two important aspects of resiliency are high availability and disaster recovery.

High availability (HA)
is the ability of the application to continue running in a healthy state, without significant downtime. By "healthy state," we mean the application is responsive, and users can connect to the application and interact with it.
Disaster recovery (DR)
is the ability to recover from rare but major incidents: non-transient, wide-scale failures, such as service disruption that affects an entire region. Disaster recovery includes data backup and archiving, and may include manual intervention, such as restoring a database from backup.

One way to think about HA versus DR is that DR starts when the impact of a fault exceeds the ability of the HA design to handle it.

Business continuity (BC)
which is the ability to perform essential business functions during and after adverse conditions, such as a natural disaster or a downed service.

Systems resiliency Two important metrics to consider are the recovery time objective and recovery point objective.

Recovery time objective (RTO)
is the maximum acceptable time that an application can be unavailable after an incident. If your RTO is 90 minutes, you must be able to restore the application to a running state within 90 minutes from the start of a disaster. If you have a very low RTO, you might keep a second deployment continually running on standby, to protect against a regional outage.
Recovery point objective (RPO)
is the maximum duration of data loss that is acceptable during a disaster. For example, if you store data in a single database, with no replication to other databases, and perform hourly backups, you could lose up to an hour of data.

Resources