Introduction
To start it off, we replicate data in the first place? We’ll more so focus on a replication as a technique for achieving scalability, and motivate why reasoning about consistency is so
important.
Reasons for Replication
Data replication is a common technique in distributed systems. There are two primary reasons:
- Reliability:
- If one replica is unavailable or crashes, use another.
- It protects against corrupted data.
- Performance:
- Scale with size of the distributed system (replicated web servers).
- Scale in geographically distributed systems (web proxies).
Replication as Scaling Technique
Replication and caching are widely used for system scalability. Having multiple copies improves performance by reducing access latency, but at a cost of higher network overheads of maintaining consistency. Suppose a data item is replicated numerous times.
- Consider the data item is accessed by a client $N$ times from a local replica and then the replica updates that item $M$ times.
- If $N \ll M$, then many of updates made to the replica are not read — i.e. useless.
Maintaining consistency is itself an issue. To keep replicas consistent, we generally need to ensure that all conflicting operations are done in the the same order everywhere. In particular, we need to look out for:
- Read-write conflict: A read operation and a write operation act concurrently.
- If the read operation occurs before the write operation completes, the read operation observes the old value of $x$ (i.e. $5$). In this case:
- $R(x)$ returns $5$, and then $W(x = 8)$ updates $x$ to $8$.
- Thus, the read operation does not reflect the latest write, resulting in a read value of $5$.
- Conversely, if the write operation completes before the read operation starts, the read operation observes the new value of $x$ (i.e. $8$ ). In this case:
- $W(x = 8)$ updates $x$ to $8$, and then $R(x)$ returns the updated value.
- Thus, the read operation reflects the latest write, resulting in a read value of 8.
- Write-write conflict: Two concurrent write operations.
The solution is to loosen consistency requirements, so that hopefully global synchronization can be avoided
-
Read-write conflict: When a read and a write operation occur concurrently. Suppose $x = 5$:
$$
R(x) 5 \to W(x)8
$$
- If the read $R(x)$ happens before the write $W(x)$ completes, it sees the old value of $x$, which is $5$.
- If the write $W(x)$ completes before the read $R(x)$ starts, the read sees the updated value of $x$$x a$$x$$8$.
-
Write-write conflict: Two concurrent write operations. Suppose $x = f(y)$ and $y = f(x)$
$$
W(x)5 \to W(x)8
$$
- If the second write operation $W(x)8$ occurs before the first $W(x)5$, the final value of $x$ is $8$.
- If the first write operation $W(x)5$ completes before the second, $x$ is initially set to $5$, but to maintain a constraint, $y$ must be updated, resulting in $x$ eventually being set to $8$.
Data-centric Consistency Models
Traditionally, consistency has been discussed in the context of read and write operations on shared data, available through a data store; it is a distributed collection of storages accessible to clients.