`fuzz-dist`'ing AntidoteDB

Configuration

git AntidoteDB/antidote master
- default config (plus sync_log true when nemesis is process kill)
- Erlang 24
topologies
- intra dc -> 1 * dc1n5 (single data center with 5 nodes)
- inter dc -> 5 * dc1n1 (5 data centers with a single node)
Erlang client
- :antidote_crdt_set_aw, :antidote_crdt_counter_pn datatypes
- static: :true transactions
- :timeout 1_000

Workload

Phase	Activity	Duration
workload	~ 10 reads and 10 writes (random sequence) / sec / key	60s
workload	fault(s) (random)	0 <= random <= 30s
heal	resolve any faults	immediate
quiesce	low rate of reads no writes	10s
final reads	1 read / node / key	immediate

g-set

Generator adds a sequential integer to a single set.

pn-counter

Generators try different counter strategies:

random values
grow-only
swinging between p's and n's

Uses largish unique random values to help in checking the results. E.g. when calculating all of the possible eventually states to evaluate a read, larger values produce a sparser set of possible ranges than +/-1.

Faults

Types

none
partition
packet
kill
pause
file corruption
targets: [:one, :all, :majority, :minority, :minority_third, :majorities_ring :primaries]

Clock and time faults are not being tested. (Need real VMs.)

Duration

Currently testing with NetTickTime / 2 (e.g. 30s)

Would like to test at just less than :nodedown:

0 <= fault_duration <= NetTickTime - 1 (e.g. 59s)

Verification

Uses Jepsen's set-full and an enhanced (fuzz_dist) pn-counter model/checker for verification.

Anomalies

Anomalies Observed/Reproducible
	no faults	partition	packet	pause	kill (w/sync_log)	file corruption
Intra DC	none	none	none	none	yes	yes
Inter DC	yes	yes	yes	yes	yes	yes

Intra, e.g. 1 * dcn5, networking is significantly more resilient than inter, e.g. 5 * dcn1.

Observations

See initial GitHub issues:

pn-counter can lose :ok'd increments in a no fault environment
pn-counters are susceptible to partitioning
Inter DC partitioning can disrupt replication

No Faults

Intra DC

no anomalies observed
g-set: ~ 10+% write operations abort
pn-counter: ~ 2-3/1000 write operations abort

Inter DC

g-set:

all op's return :ok
but not all writes replicated

pn-counter:

all op's return :ok
but pn-counter can produce:
- impossible read values
- invalid final read values
most common failure is value read from node appears to have lost a previous :ok'd write for the same node
less frequently the lost write appears to have come from a previously replicated op on another node

Partitions

Intra DC

no anomalies observed
increased client write timeouts post partition, in healed state
can crash stable_meta_data_server

Inter DC

g-set:

all op's return :ok
but most tests fail most of the time
- not all writes are replicated

pc-counter:

all op's return :ok
but most tests fail most of the time
- invalid final read values
- impossible read values
- non-monotonic reads for grow-only

Packets

Intra DC

no anomalies observed
increased client write timeouts post packet disruption, in healed state

Inter DC

g-set:

all op's return :ok
but a majority of tests fail most of the time
- not all writes are replicated

pc-counter:

all op's return :ok
but most tests fail most of the time
- invalid final read values
- impossible read values

Process Pause

Intra DC

no anomalies observed
increase in timeouts post pause, in healed state

Inter DC

g-set:

not all :ok writes replicated

pn-counter:

not all :ok writes replicated

Process Kill

sync_log true

Clients will occasionally return :error for long periods, at times the remainder of the test, even after being healed.

g-set:

partial replication of writes
increase in timeouts post kill, in healed state

pn-counter:

impossible read values
invalid final reads
non-monotonic reads for grow-only

File Corruption

Intra DC

increased latency, aborted transactions
node fails to restart

Inter DC

most tests fail

Usage

Current docker compose is problematic, so a workaround has been developed.

Jepsen's LXC environment is the best.

Also see previously mentioned Antidote GitHub issues for steps to reproduce.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

antidotedb.md

antidotedb.md

`fuzz-dist`'ing AntidoteDB

Configuration

Workload

g-set

pn-counter

Faults

Types

Duration

Verification

Anomalies

Observations

No Faults

Intra DC

Inter DC

Partitions

Intra DC

Inter DC

Packets

Intra DC

Inter DC

Process Pause

Intra DC

Inter DC

Process Kill

File Corruption

Intra DC

Inter DC

Usage

Files

antidotedb.md

Latest commit

History

antidotedb.md

File metadata and controls

fuzz-dist'ing AntidoteDB

Configuration

Workload

g-set

pn-counter

Faults

Types

Duration

Verification

Anomalies

Observations

No Faults

Intra DC

Inter DC

Partitions

Intra DC

Inter DC

Packets

Intra DC

Inter DC

Process Pause

Intra DC

Inter DC

Process Kill

File Corruption

Intra DC

Inter DC

Usage

`fuzz-dist`'ing AntidoteDB