Redis Replication #NA2: timeout before receiving expected number of responses

warpftl · May 2, 2024, 5:41pm

I’m stuck on Redis Replication Stage 18.

After receiving the WAIT on the master, I send REPLCONFs to all the replicas. Then I start a thread for every replica, where I try to receive a response from them (within the timeout). In the master’s main thread, it waits for one of the two conditions to fulfil (timeout and min acks) and finally respond with the current number of confirmations.

However, 95% of the time, I don’t get enough the expected number of replica responses before the timeout expires. This is even though I start the timer after sending REPLCONFs to the replicas and starting the receiving threads, to account for the overhead from these functions.

Here some logs:

[replication-18] Testing Replica : 4
[replication-18] Received ["SET", "baz", "789"]
[your_program] recvd from 47954: 99
[replication-18] Received ["REPLCONF", "GETACK", "*"]
[your_program] timeout from 47956 with 500ms
[your_program] timeout from 47944 with 500ms
[your_program] timeout from 47930 with 2000ms
[your_program] timeout from 47956 with 2000ms
[your_program] timeout from 47954 with 2000ms
[your_program] Exception in thread Thread-1 (respond):
[replication-18] Expected 3, got 1
[replication-18] Test failed (try setting 'debug: true' in your codecrafters.yml to see more details)
[your_program] Traceback (most recent call last):
[your_program]   File "/usr/local/lib/python3.12/threading.py", line 1073, in _bootstrap_inner

And here’s a snippet of my code:
Handling the WAITs

min_acks, timeout = int(params[1]), int(params[2])

confirmed = MutableInteger()
required_offset = writes.offset + (writes.num - 1) * ACK_BYTES_SIZE

for repl_conn in replicas:
    repl_conn.send(encode_resp(["REPLCONF", "GETACK", "*"]))

for repl_conn in replicas:
    t = threading.Thread(
        target=recv_repl_ack,
        args=(repl_conn, required_offset, confirmed, timeout),
    )
    t.start()

end = time.time() + (timeout / 1000)

while (time.time() < end) and (confirmed.val < min_acks):
    time.sleep(0.2)

conn.send(integer(confirmed.val))

Receiving acks from replicas

def recv_repl_ack(
    conn: socket.socket,
    required_offset: int,
    confirmed: MutableInteger,
    timeout: int,
):
    try:
        conn.settimeout(timeout / 1000)
        response = conn.recv(DEFAULT_RECV_BUFFER)
    except TimeoutError:
        print(f"timeout from {conn.getpeername()[1]} with {timeout}ms")
    else:
        response = decode_resp(response)[0]
        offset = int(response[2])

        print(f"recvd from {conn.getpeername()[1]}: {offset}")

        if offset >= required_offset:
            confirmed.val += 1
    finally:
        conn.settimeout(None)
        return

Extras:

class MutableInteger:
    def __init__(self) -> None:
        self.val = 0

gnomeby · May 3, 2024, 12:34pm

Looks like the problem start at 17-th step:

warpftl · May 3, 2024, 1:36pm

I am not sure if this is related. My stage 17 was passing consistently when I was on it and the recent change by CodeCrafters means the tests are run in reverse order, so it only runs tests for stage 17 when stage 18 passes.

My problem is that ~95% of the time, I don’t get enough reponses from the replicas before they timeout. Either

I get 0 responses on the first WAIT and it fails
Or it passes the first WAIT test case but I don’t get the n responses needed for the second test case.
Or all test cases (and all previous stages) pass (happened only twice so far).

rohitpaulk · May 4, 2024, 1:29pm

We’ll add better logs for this stage that convey when we a replice receives a GETACK (and whether it intends to respond with an ACK or not)

warpftl · May 4, 2024, 2:19pm

That would be very helpful, thanks!

From my implementation, it seems all replicas receive GETACKs but not all of them which intend to reply, reply before their associated receiving thread times out.

I was wondering whether it is a guarantee that all the intended replicas will reply before the timeout specified by the WAIT?

rohitpaulk · May 6, 2024, 2:04pm

From my implementation, it seems all replicas receive GETACKs but not all of them which intend to reply, reply before their associated receiving thread times out.

I was wondering whether it is a guarantee that all the intended replicas will reply before the timeout specified by the WAIT?

Hmm, the replicas that intend to respond should respond immediately. The WAIT timeout is long enough (500ms?) that in practice there should be ample time for replicas to respond.

We do run these tests against an official Redis instance 100s of times on each release to make sure that there aren’t obvious bugs, but maybe there’s some other behaviour here that we aren’t accounting for.

When we add logs, I’ll try to get timestamps added too – that should make it clear when a replica receives GETACK, and when it responds

rohitpaulk · June 3, 2024, 6:51pm

These have now been added!

rohitpaulk · June 3, 2024, 6:52pm

Going to mark this as solved for now since we’ve improved logs, if anyone’s still running into this please share code + logs and we’ll take a look!

system · June 4, 2024, 3:27am

Note: I’ve updated the title of this post to include the stage ID (#NA2). You can learn about the stages rename here: Upcoming change: Stages overhaul.

nissenyeh · June 7, 2024, 7:24pm

I encountered the same question

The replicas don’t respond immediately until the WAIT timeout, so it replies 0

system · June 12, 2024, 7:24pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Redis Replication #NA2: Timeout Issue Challenges challenge:redis	8	287	June 4, 2024
Replication stage (#NA2) No ACKs received from replicas? Challenges challenge:redis	11	378	June 18, 2024
Incorrect test for Replication #TU8 Challenges challenge:redis	14	291	June 4, 2024
Ruby: Redis Replication Stage #NA2: Error reading replica acknowledgement Challenges challenge:redis	12	292	June 4, 2024
Python Replicas Stage (#NA2) ACK from tester is not sent immediately General	4	90	June 11, 2024

Redis Replication #NA2: timeout before receiving expected number of responses

Related topics