Skip to content

Latest commit

 

History

History
272 lines (198 loc) · 5.51 KB

antidotedb.md

File metadata and controls

272 lines (198 loc) · 5.51 KB

fuzz-dist'ing AntidoteDB

Configuration

  • git AntidoteDB/antidote master
    • default config (plus sync_log true when nemesis is process kill)
    • Erlang 24
  • topologies
    • intra dc -> 1 * dc1n5 (single data center with 5 nodes)
    • inter dc -> 5 * dc1n1 (5 data centers with a single node)
  • Erlang client
    • :antidote_crdt_set_aw, :antidote_crdt_counter_pn datatypes
    • static: :true transactions
    • :timeout 1_000

Workload

Phase Activity Duration
workload ~ 10 reads and 10 writes (random sequence) / sec / key 60s
fault(s) (random) 0 <= random <= 30s
heal resolve any faults immediate
quiesce low rate of reads
no writes
10s
final reads 1 read / node / key immediate

g-set

Generator adds a sequential integer to a single set.

pn-counter

Generators try different counter strategies:

  • random values
  • grow-only
  • swinging between p's and n's

Uses largish unique random values to help in checking the results. E.g. when calculating all of the possible eventually states to evaluate a read, larger values produce a sparser set of possible ranges than +/-1.


Faults

Types

  • none
  • partition
  • packet
  • kill
  • pause
  • file corruption
  • targets: [:one, :all, :majority, :minority, :minority_third, :majorities_ring :primaries]

Clock and time faults are not being tested. (Need real VMs.)

Duration

Currently testing with NetTickTime / 2 (e.g. 30s)

Would like to test at just less than :nodedown:

0 <= fault_duration <= NetTickTime - 1 (e.g. 59s)


Verification

Uses Jepsen's set-full and an enhanced (fuzz_dist) pn-counter model/checker for verification.


Anomalies

Anomalies Observed/Reproducible
no faults partition packet pause kill (w/sync_log) file corruption
Intra DC none none none none yes yes
Inter DC yes yes yes yes yes yes

Intra, e.g. 1 * dcn5, networking is significantly more resilient than inter, e.g. 5 * dcn1.

Observations

See initial GitHub issues:


No Faults

Intra DC

  • no anomalies observed
  • g-set: ~ 10+% write operations abort
  • pn-counter: ~ 2-3/1000 write operations abort

Inter DC

g-set:

  • all op's return :ok
  • but not all writes replicated

pn-counter:

  • all op's return :ok
  • but pn-counter can produce:
    • impossible read values
    • invalid final read values
  • most common failure is value read from node appears to have lost a previous :ok'd write for the same node
  • less frequently the lost write appears to have come from a previously replicated op on another node

Partitions

Intra DC

  • no anomalies observed
  • increased client write timeouts post partition, in healed state
  • can crash stable_meta_data_server

Inter DC

g-set:

  • all op's return :ok
  • but most tests fail most of the time
    • not all writes are replicated

pc-counter:

  • all op's return :ok
  • but most tests fail most of the time
    • invalid final read values
    • impossible read values
    • non-monotonic reads for grow-only

Packets

Intra DC

  • no anomalies observed
  • increased client write timeouts post packet disruption, in healed state

Inter DC

g-set:

  • all op's return :ok
  • but a majority of tests fail most of the time
    • not all writes are replicated

pc-counter:

  • all op's return :ok
  • but most tests fail most of the time
    • invalid final read values
    • impossible read values

Process Pause

Intra DC

  • no anomalies observed
  • increase in timeouts post pause, in healed state

Inter DC

g-set:

  • not all :ok writes replicated

pn-counter:

  • not all :ok writes replicated

Process Kill

sync_log true

Clients will occasionally return :error for long periods, at times the remainder of the test, even after being healed.

g-set:

  • partial replication of writes
  • increase in timeouts post kill, in healed state

pn-counter:

  • impossible read values
  • invalid final reads
  • non-monotonic reads for grow-only

File Corruption

Intra DC

  • increased latency, aborted transactions
  • node fails to restart

Inter DC

  • most tests fail

Usage

Current docker compose is problematic, so a workaround has been developed.

Jepsen's LXC environment is the best.

Also see previously mentioned Antidote GitHub issues for steps to reproduce.