Zig: when I added REPLCONF GETACK later in #XV6, #HD5 test fails with extra data

I’m stuck between these stages:

  • Single-replica propagation #ZN8
  • Multi Replica Command Propagation #HD5
  • ACKs with no commands #XV6
  • ACKs with Commands #YD3

Initially I’ve passed #HD5 and #ZN8 just fine.

As part of next stages, I’ve implemented the support for REPLCONF GETACK, ie. master will just send the GETACK command to replica just after the empty RDB. This, however, broke test for #HD5:

remote: [tester::#HD5] [handshake] replica-1: Received bytes: "$88\r\nREDIS0011\xfa\tredis-ver\x057.2.0\xfa\nredis-bits\xc0@\xfa\x05ctime\xc2m\b\xbce\xfa\bused-mem°\xc4\x10\x00\xfa\baof-base\xc0\x00\xff\xf0n;\xfe\xc0\xffZ\xa2"
remote: [tester::#HD5] Found extra data: "*3\r\n+REPLCONF\r\n+GETACK\r\n+*\r\n"
remote: [tester::#HD5] Test failed
remote: [tester::#HD5] Terminating program

I’ve tried adding sleep before master sends GETACK but then I get another problem: while the test is trying to establish the second replica, it won’t get the PONG from master:

remote: [tester::#HD5] [handshake] replica-2: Sent bytes: "*1\r\n$4\r\nPING\r\n"
remote: [tester::#HD5] Received: "" (no content received)
remote: [tester::#HD5]            ^ error
remote: [tester::#HD5] Error: Expected start of a new RESP2 value (either +, -, :, $ or *)
remote: [tester::#HD5] Test failed
remote: [tester::#HD5] Terminating program

Strangely I can’t reproduce this second issue locally. My master will just always respond to PINGs, even in the middle of the sleep (after all, it’s just the existing master-replica thread that is sleeping, right?)…

I’ve also tried to restart the repo since I’ve noticed that my first attempt was still using Zig 0.12 (locally I’ve been using 0.13 all along.). This is why in my current repo you might not see the further stages as completed if you look right now.

Here are my logs, without the sleep:

[...]
remote: [compile] Moved ./.codecrafters/run.sh → ./your_program.sh
remote: [compile] Compilation successful.
remote:
remote: Debug = true
remote:
remote: [tester::#HD5] Running tests for Stage #HD5 (Replication - Multi Replica Command Propagation)
remote: [tester::#HD5] $ ./your_program.sh --port 6379
remote: [your_program] info: PID: 680
remote: [your_program] info: app_options.address: 127.0.0.1
remote: [your_program] info: app_options.port: 6379
remote: [your_program] info: app_options.debug: true
remote: [your_program] info: app_options.verbose: true
remote: [your_program] info: app_options.thread_buff_size: 3145728
remote: [your_program] info: app_options.server_buff_size: 104857600
remote: [your_program] info: app_options.max_dump_bytes: 10485760
remote: [your_program] info: app_options.max_threads: null
remote: [your_program] info: app_options.dir: null
remote: [your_program] info: app_options.dbfilename: dump.rdb
remote: [your_program] info: app_options.redis_replicaof_host: null
remote: [your_program] info: app_options.redis_replicaof_port: 6379
remote: [tester::#HD5] [handshake] Creating replica: 1
remote: [tester::#HD5] [handshake] replica-1: $ redis-cli PING
remote: [tester::#HD5] [handshake] replica-1: Sent bytes: "*1\r\n$4\r\nPING\r\n"
remote: [your_program] info: [id=0001 j=0 r=cmd t=681 a=127.0.0.1:60832] * command channel listening
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] * command channel listening
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] < Elem{[.{"PING"}]}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] > Elem{"PONG"}
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "+PONG\r\n"
remote: [tester::#HD5] [handshake] replica-1: Received RESP simple string: "PONG"
remote: [tester::#HD5] [handshake] Received "PONG"
remote: [tester::#HD5] [handshake] replica-1: > REPLCONF listening-port 6380
remote: [tester::#HD5] [handshake] replica-1: Sent bytes: "*3\r\n$8\r\nREPLCONF\r\n$14\r\nlistening-port\r\n$4\r\n6380\r\n"
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] < Elem{[.{"REPLCONF"},.{"listening-port"},.{"6380"}]}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] > Elem{"OK"}
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "+OK\r\n"
remote: [tester::#HD5] [handshake] replica-1: Received RESP simple string: "OK"
remote: [tester::#HD5] [handshake] Received "OK"
remote: [tester::#HD5] [handshake] replica-1: > REPLCONF capa psync2
remote: [tester::#HD5] [handshake] replica-1: Sent bytes: "*3\r\n$8\r\nREPLCONF\r\n$4\r\ncapa\r\n$6\r\npsync2\r\n"
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] < Elem{[.{"REPLCONF"},.{"capa"},.{"psync2"}]}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] > Elem{"OK"}
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "+OK\r\n"
remote: [tester::#HD5] [handshake] replica-1: Received RESP simple string: "OK"
remote: [tester::#HD5] [handshake] Received "OK"
remote: [tester::#HD5] [handshake] replica-1: > PSYNC ? -1
remote: [tester::#HD5] [handshake] replica-1: Sent bytes: "*3\r\n$5\r\nPSYNC\r\n$1\r\n?\r\n$2\r\n-1\r\n"
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] < Elem{[.{"PSYNC"},.{"?"},.{"-1"}]}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] > Elem{"FULLRESYNC 91e84..(37 more)"}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:60838] * re-using connection for replica propagation
remote: [your_program] info: [id=0002 j=2 r=m2r t=682 a=127.0.0.1:60838] >> [88]|REDIS0011\xfa\tredis-ver7.2.0\xfa\nredis-bits\xc0@\xfactime\xc2..|
remote: [your_program] info: [id=0002 j=2 r=m2r t=682 a=127.0.0.1:60838] * master2replica channel listening
remote: [your_program] info: [id=0002 j=2 r=m2r t=682 a=127.0.0.1:60838] > Elem{[.{"REPLCONF"},.{"GETACK"},.{"*"}]}
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "+FULLRESYNC 91e8479d4c20a0205968aac2f4807e36f4d291af 0\r\n"
remote: [tester::#HD5] [handshake] replica-1: Received RESP simple string: "FULLRESYNC 91e8479d4c20a0205968aac2f4807e36f4d291af 0"
remote: [tester::#HD5] [handshake] Received "FULLRESYNC 91e8479d4c20a0205968aac2f4807e36f4d291af 0"
remote: [tester::#HD5] [handshake] Reading RDB file...
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "$88\r\nREDIS0011\xfa\tredis-ver\x057.2.0\xfa\nredis-bits\xc0@\xfa\x05ctime\xc2m\b\xbce\xfa\bused-mem°\xc4\x10\x00\xfa\baof-base\xc0\x00\xff\xf0n;\xfe\xc0\xffZ\xa2"
remote: [tester::#HD5] Found extra data: "*3\r\n+REPLCONF\r\n+GETACK\r\n+*\r\n"
remote: [tester::#HD5] Test failed
remote: [tester::#HD5] Terminating program
remote: [your_program] info: [id=0001 j=0 r=cmd t=681 a=127.0.0.1:60832] * connection closed by peer
remote: [tester::#HD5] Program terminated successfully
[...]

…and with the sleep:

[...]
remote: [compile] Moved ./.codecrafters/run.sh → ./your_program.sh
remote: [compile] Compilation successful.
remote:
remote: Debug = true
remote:
remote: [tester::#HD5] Running tests for Stage #HD5 (Replication - Multi Replica Command Propagation)
remote: [tester::#HD5] $ ./your_program.sh --port 6379
remote: [your_program] info: PID: 680
remote: [your_program] info: app_options.address: 127.0.0.1
remote: [your_program] info: app_options.port: 6379
remote: [your_program] info: app_options.debug: true
remote: [your_program] info: app_options.verbose: true
remote: [your_program] info: app_options.thread_buff_size: 3145728
remote: [your_program] info: app_options.server_buff_size: 104857600
remote: [your_program] info: app_options.max_dump_bytes: 10485760
remote: [your_program] info: app_options.max_threads: null
remote: [your_program] info: app_options.dir: null
remote: [your_program] info: app_options.dbfilename: dump.rdb
remote: [your_program] info: app_options.redis_replicaof_host: null
remote: [your_program] info: app_options.redis_replicaof_port: 6379
remote: [tester::#HD5] [handshake] Creating replica: 1
remote: [tester::#HD5] [handshake] replica-1: $ redis-cli PING
remote: [tester::#HD5] [handshake] replica-1: Sent bytes: "*1\r\n$4\r\nPING\r\n"
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] * command channel listening
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] < Elem{[.{"PING"}]}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] > Elem{"PONG"}
remote: [your_program] info: [id=0001 j=0 r=cmd t=681 a=127.0.0.1:41954] * command channel listening
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "+PONG\r\n"
remote: [tester::#HD5] [handshake] replica-1: Received RESP simple string: "PONG"
remote: [tester::#HD5] [handshake] Received "PONG"
remote: [tester::#HD5] [handshake] replica-1: > REPLCONF listening-port 6380
remote: [tester::#HD5] [handshake] replica-1: Sent bytes: "*3\r\n$8\r\nREPLCONF\r\n$14\r\nlistening-port\r\n$4\r\n6380\r\n"
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] < Elem{[.{"REPLCONF"},.{"listening-port"},.{"6380"}]}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] > Elem{"OK"}
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "+OK\r\n"
remote: [tester::#HD5] [handshake] replica-1: Received RESP simple string: "OK"
remote: [tester::#HD5] [handshake] Received "OK"
remote: [tester::#HD5] [handshake] replica-1: > REPLCONF capa psync2
remote: [tester::#HD5] [handshake] replica-1: Sent bytes: "*3\r\n$8\r\nREPLCONF\r\n$4\r\ncapa\r\n$6\r\npsync2\r\n"
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] < Elem{[.{"REPLCONF"},.{"capa"},.{"psync2"}]}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] > Elem{"OK"}
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "+OK\r\n"
remote: [tester::#HD5] [handshake] replica-1: Received RESP simple string: "OK"
remote: [tester::#HD5] [handshake] Received "OK"
remote: [tester::#HD5] [handshake] replica-1: > PSYNC ? -1
remote: [tester::#HD5] [handshake] replica-1: Sent bytes: "*3\r\n$5\r\nPSYNC\r\n$1\r\n?\r\n$2\r\n-1\r\n"
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] < Elem{[.{"PSYNC"},.{"?"},.{"-1"}]}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] > Elem{"FULLRESYNC 7cc7f..(37 more)"}
remote: [your_program] info: [id=0002 j=1 r=cmd t=682 a=127.0.0.1:41968] * re-using connection for replica propagation
remote: [your_program] info: [id=0002 j=2 r=m2r t=682 a=127.0.0.1:41968] >> [88]|REDIS0011\xfa\tredis-ver7.2.0\xfa\nredis-bits\xc0@\xfactime\xc2..|
remote: [your_program] info: [id=0002 j=2 r=m2r t=682 a=127.0.0.1:41968] * master2replica channel listening
remote: [your_program] info: [id=0002 j=2 r=m2r t=682 a=127.0.0.1:41968] * sleeping for 5000ms
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "+FULLRESYNC 7cc7fa9d18da3544d06b317b2593afcfce295195 0\r\n"
remote: [tester::#HD5] [handshake] replica-1: Received RESP simple string: "FULLRESYNC 7cc7fa9d18da3544d06b317b2593afcfce295195 0"
remote: [tester::#HD5] [handshake] Received "FULLRESYNC 7cc7fa9d18da3544d06b317b2593afcfce295195 0"
remote: [tester::#HD5] [handshake] Reading RDB file...
remote: [tester::#HD5] [handshake] replica-1: Received bytes: "$88\r\nREDIS0011\xfa\tredis-ver\x057.2.0\xfa\nredis-bits\xc0@\xfa\x05ctime\xc2m\b\xbce\xfa\bused-mem°\xc4\x10\x00\xfa\baof-base\xc0\x00\xff\xf0n;\xfe\xc0\xffZ\xa2"
remote: [tester::#HD5] [handshake] Received RDB file
remote: [tester::#HD5] Creating replica: 2
remote: [tester::#HD5] [handshake] replica-2: $ redis-cli PING
remote: [tester::#HD5] [handshake] replica-2: Sent bytes: "*1\r\n$4\r\nPING\r\n"
remote: [tester::#HD5] Received: "" (no content received)
remote: [tester::#HD5]            ^ error
remote: [tester::#HD5] Error: Expected start of a new RESP2 value (either +, -, :, $ or *)
remote: [tester::#HD5] Test failed
remote: [tester::#HD5] Terminating program
remote: [your_program] info: [id=0001 j=0 r=cmd t=681 a=127.0.0.1:41954] * connection closed by peer
remote: [tester::#HD5] Program terminated successfully
[...]

My codebase is already relatively complex (when implementing I’m trying to add extra challenges for me, outside strictly CC, such as stable memory management, some configurability, testability, bit more fleshed out logging and command handling.,…). So I’ll describe my implementation first.

The server operates a task pool; this is basically a thread pool where a task is thread with assigned fixed-buffer memory allocator (yeah, Tiger Style influence! :-D), logger and references to server as well as ability to pass connection to another task.

Initially a master server will just take commands, one task per thread (handle_cmd() below), but if a successful handshake is completed, the connection is saved (in a map of active replicas) and handed off to another task (handle_master2replica()), which is then used to send the RDB file, do propagations as well as GETACK’s.

A replica works similarly except that it will pro-actively start another task (handle_replica2master() which takes care of the handshake and proceeds to respond to propagations, and/or GETACK’s, responding selectively as per spec.

(Note that the replica is still churning on command connections. Ie. I assumed that a replica must still respond to normal commands even in the middle of handshake–it will deny REPLCONF, though, ie. we can’t have chain of replicas).

So with that, here are the relevant handlers:

/// Handshake with master at *conn* and start handling its commands
/// as if it was client.
pub fn handle_replica2master(
    sess: *Sess,
    server: *Server,
    tp: *task.TaskPool,
) !void {
    _ = tp;
    try handshake(sess, server.replication.replica, server.config.port);
    sess.logger.info("* replica2master channel listening", .{});
    while (true) {
        const cmd_elem = try sess.rcv_elem() orelse break;
        const cmd = try sess.parse_cmd_or_err(cmd_elem) orelse break;
        const cmd_result = cmd.run(sess, server);
        if (!cmd.should(.ack_master)) continue;
        try sess.send_elem(cmd_result.elem);
    }
    sess.destroy();
}

pub fn handle_master2replica(
    sess: *Sess,
    server: *Server,
    tp: *task.TaskPool,
) !void {
    _ = tp;
    try sess.send_blob(EMPTY_RDB);
    sess.logger.info("* master2replica channel listening", .{});
    {
        // get an ACK
        util.sleep_s(5000, sess.logger);
        const getack_elem = Elem{ .arr = &.{
            Elem{ .str = "REPLCONF" },
            Elem{ .str = "GETACK" },
            Elem{ .str = "*" },
        } };
        try sess.send_elem(getack_elem);
        const maybe_elem = sess.rcv_elem() catch null;
        if (maybe_elem == null) {
            sess.logger.warn("* dropping broken replica: {d:0>4}", .{sess.id});
            _ = server.replication.master.replicas.remove(sess.id);
            sess.destroy();
            return;
        }
        try _assert_arr_len(maybe_elem.?, 3);
        try _assert_eql_at(maybe_elem.?, 0, "REPLCONF");
        try _assert_eql_at(maybe_elem.?, 1, "ACK");
    }
}

/// Handle client's RESP commands at *conn*.
pub fn handle_cmd(
    sess: *Sess,
    server: *Server,
    tp: *task.TaskPool,
) !void {
    sess.logger.info("* command channel listening", .{});
    while (true) {
        const cmd_elem = try sess.rcv_elem() orelse break;
        const cmd = try sess.parse_cmd_or_err(cmd_elem) orelse break;
        if (cmd.should(.propagate) and server.replication == .master) server.propagate(cmd_elem.*);
        const cmd_result = cmd.run(sess, server);
        try sess.send_elem(cmd_result.elem);

        // keep stream as new replica queue?
        if (cmd.should(.start_replica)) {
            if (cmd_result.elem == .err) {
                sess.logger.warn("* replica propagation setup failed", .{});
                break;
            }
            sess.logger.info("* re-using connection for replica propagation", .{});
            try tp.queue(handle_master2replica, sess.conn, .m2r);
            return; // not going to destroy the conn
        }
    }
    sess.destroy();
}

I’ve noticed in another thread that the replica should do the handshake while blocking, so I’ll try to fix that later today.

Not that I can see how that bit would break the test… except maybe by causing some timing conditions…

I’ve noticed in another thread that the replica should do the handshake while blocking, so I’ll try to fix that later today.

I tried this and no luck: I still get the same error where the second replica won’t get thre PONG response from the server.

I’m out of ideas…

I’ll take a look shortly.

1 Like

Hi @kelduben-cc, if you haven’t reached the WAIT stages yet, the master doesn’t need to send GETACK at this point. Currently, it’s the replicas that need to support GETACK.

You can safely postpone implementing GETACK on the master side for now. Let me know if you need further clarification.

Ahh, I totally missed that detail.

Thank you so much!

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.