- git AntidoteDB/antidote master
- default config (plus
sync_log true
when nemesis is process kill) - Erlang 24
- default config (plus
- topologies
- intra dc -> 1 * dc1n5 (single data center with 5 nodes)
- inter dc -> 5 * dc1n1 (5 data centers with a single node)
- Erlang client
:antidote_crdt_set_aw
,:antidote_crdt_counter_pn
datatypesstatic: :true
transactions:timeout 1_000
Phase | Activity | Duration | |
---|---|---|---|
workload | ~ 10 reads and 10 writes (random sequence) / sec / key | 60s | |
fault(s) (random) | 0 <= random <= 30s | ||
heal | resolve any faults | immediate | |
quiesce | low rate of reads no writes |
10s | |
final reads | 1 read / node / key | immediate |
Generator adds a sequential integer to a single set.
Generators try different counter strategies:
- random values
- grow-only
- swinging between p's and n's
Uses largish unique random values to help in checking the results. E.g. when calculating all of the possible eventually states to evaluate a read, larger values produce a sparser set of possible ranges than +/-1.
- none
- partition
- packet
- kill
- pause
- file corruption
- targets: [:one, :all, :majority, :minority, :minority_third, :majorities_ring :primaries]
Clock and time faults are not being tested. (Need real VMs.)
Currently testing with NetTickTime
/ 2 (e.g. 30s)
Would like to test at just less than :nodedown
:
0 <= fault_duration <= NetTickTime
- 1 (e.g. 59s)
Uses Jepsen's set-full and an enhanced (fuzz_dist) pn-counter model/checker for verification.
Anomalies Observed/Reproducible | ||||||
---|---|---|---|---|---|---|
no faults | partition | packet | pause | kill (w/sync_log) | file corruption | |
Intra DC | none | none | none | none | yes | yes |
Inter DC | yes | yes | yes | yes | yes | yes |
Intra, e.g. 1 * dcn5, networking is significantly more resilient than inter, e.g. 5 * dcn1.
See initial GitHub issues:
- pn-counter can lose :ok'd increments in a no fault environment
- pn-counters are susceptible to partitioning
- Inter DC partitioning can disrupt replication
- no anomalies observed
- g-set: ~ 10+% write operations abort
- pn-counter: ~ 2-3/1000 write operations abort
g-set:
- all op's return
:ok
- but not all writes replicated
pn-counter:
- all op's return
:ok
- but pn-counter can produce:
- impossible read values
- invalid final read values
- most common failure is value read from node appears to have lost a previous :ok'd write for the same node
- less frequently the lost write appears to have come from a previously replicated op on another node
- no anomalies observed
- increased client write timeouts post partition, in healed state
- can crash
stable_meta_data_server
g-set:
- all op's return
:ok
- but most tests fail most of the time
- not all writes are replicated
pc-counter:
- all op's return
:ok
- but most tests fail most of the time
- invalid final read values
- impossible read values
- non-monotonic reads for grow-only
- no anomalies observed
- increased client write timeouts post packet disruption, in healed state
g-set:
- all op's return
:ok
- but a majority of tests fail most of the time
- not all writes are replicated
pc-counter:
- all op's return
:ok
- but most tests fail most of the time
- invalid final read values
- impossible read values
- no anomalies observed
- increase in timeouts post pause, in healed state
g-set:
- not all
:ok
writes replicated
pn-counter:
- not all
:ok
writes replicated
sync_log true
Clients will occasionally return :error
for long periods, at times the remainder of the test, even after being healed.
g-set:
- partial replication of writes
- increase in timeouts post kill, in healed state
pn-counter:
- impossible read values
- invalid final reads
- non-monotonic reads for grow-only
- increased latency, aborted transactions
- node fails to restart
- most tests fail
Current docker compose is problematic, so a workaround has been developed.
Jepsen's LXC environment is the best.
Also see previously mentioned Antidote GitHub issues for steps to reproduce.