Replication Stage (#XV6) Slave receive nothing on occasion

I’m stuck on Replication stage 14.

In stage 13, I have already support that the rdb file and commands may come from the TcpStream together and I can always pass the test. However, in stage 14 it seems that the slave sometimes just get nothing when the master send GetAck command. I add some logs, and everything seems fine (my read buf has length of 1024, and the stream is connecting to master).

Here are my logs in the situation that I passed the test:

[replication-14] Running tests for Replication > Stage #14: ACKs with no commands
[replication-14] Master is running on port 6379
[replication-14] $ ./spawn_redis_server.sh --port 6380 --replicaof localhost 6379
[replication-14] master: Received bytes: "*1\r\n$4\r\nPING\r\n"
[replication-14] master: Received RESP value: ["PING"]
[replication-14] Received ["PING"]
[your_program] slave connect with master: peer: 127.0.0.1:6379 local: 127.0.0.1:50914
[replication-14] master: Sent "PONG"
[replication-14] master: Sent bytes: "+PONG\r\n"
[replication-14] master: Received bytes: "*3\r\n$8\r\nREPLCONF\r\n$14\r\nlistening-port\r\n$4\r\n6380\r\n"
[replication-14] master: Received RESP value: ["REPLCONF", "listening-port", "6380"]
[replication-14] Received ["REPLCONF", "listening-port", "6380"]
[replication-14] master: Sent "OK"
[replication-14] master: Sent bytes: "+OK\r\n"
[replication-14] master: Received bytes: "*3\r\n$8\r\nREPLCONF\r\n$4\r\ncapa\r\n$6\r\npsync2\r\n"
[replication-14] master: Received RESP value: ["REPLCONF", "capa", "psync2"]
[replication-14] Received ["REPLCONF", "capa", "psync2"]
[replication-14] master: Sent "OK"
[replication-14] master: Sent bytes: "+OK\r\n"
[replication-14] master: Received bytes: "*3\r\n$5\r\nPSYNC\r\n$1\r\n?\r\n$2\r\n-1\r\n"
[replication-14] master: Received RESP value: ["PSYNC", "?", "-1"]
[replication-14] Received ["PSYNC", "?", "-1"]
[replication-14] master: Sent "FULLRESYNC 75cd7bc10c49047e0d163660f3b90625b1af31dc 0"
[replication-14] master: Sent bytes: "+FULLRESYNC 75cd7bc10c49047e0d163660f3b90625b1af31dc 0\r\n"
[replication-14] Sending RDB file...
[replication-14] master: Sent bytes: "$88\r\nREDIS0011\xfa\tredis-ver\x057.2.0\xfa\nredis-bits\xc0@\xfa\x05ctime\xc2m\b\xbce\xfa\bused-mem°\xc4\x10\x00\xfa\baof-base\xc0\x00\xff\xf0n;\xfe\xc0\xffZ\xa2"
[replication-14] Sent RDB file.
[replication-14] master: $ redis-cli REPLCONF GETACK *
[replication-14] master: Sent bytes: "*3\r\n$8\r\nREPLCONF\r\n$6\r\nGETACK\r\n$1\r\n*\r\n"
[your_program] connection to master: peer: 127.0.0.1:6379, local: 127.0.0.1:50914
[your_program] slave buf len: 1024, cursor: 0
[your_program] slave read 130 bytes from master
[your_program] get rdb file, buf remains: 37
[your_program] get command array, buf remains: 0
[your_program] get command in array: ["replconf", "getack", "*"]
[your_program] slave get command: Replconf(GetAck)
[your_program] offset is Some(0), length is 31
[your_program] slave need response: [Array([SimpleString("REPLCONF"), SimpleString("ACK"), SimpleString("0")])]
[your_program] connection to master: peer: 127.0.0.1:6379, local: 127.0.0.1:50914
[your_program] slave buf len: 1024, cursor: 0
[replication-14] master: Received bytes: "*3\r\n+REPLCONF\r\n+ACK\r\n+0\r\n"
[replication-14] master: Received RESP value: ["REPLCONF", "ACK", "0"]
[replication-14] Received ["REPLCONF", "ACK", "0"]
[replication-14] Test passed.
[replication-14] Terminating program
[your_program] slave read 0 bytes from master
[replication-14] Program terminated successfully

Here is the log in situation that I did not pass the test

remote: [replication-14] Running tests for Replication > Stage #14: ACKs with no commands
remote: [replication-14] Master is running on port 6379
remote: [replication-14] $ ./spawn_redis_server.sh --port 6380 --replicaof localhost 6379
remote: [your_program] slave connect with master: peer: 127.0.0.1:6379 local: 127.0.0.1:41210
remote: [replication-14] master: Received bytes: "*1\r\n$4\r\nPING\r\n"
remote: [replication-14] master: Received RESP value: ["PING"]
remote: [replication-14] Received ["PING"]
remote: [replication-14] master: Sent "PONG"
remote: [replication-14] master: Sent bytes: "+PONG\r\n"
remote: [replication-14] master: Received bytes: "*3\r\n$8\r\nREPLCONF\r\n$14\r\nlistening-port\r\n$4\r\n6380\r\n"
remote: [replication-14] master: Received RESP value: ["REPLCONF", "listening-port", "6380"]
remote: [replication-14] Received ["REPLCONF", "listening-port", "6380"]
remote: [replication-14] master: Sent "OK"
remote: [replication-14] master: Sent bytes: "+OK\r\n"
remote: [replication-14] master: Received bytes: "*3\r\n$8\r\nREPLCONF\r\n$4\r\ncapa\r\n$6\r\npsync2\r\n"
remote: [replication-14] master: Received RESP value: ["REPLCONF", "capa", "psync2"]
remote: [replication-14] Received ["REPLCONF", "capa", "psync2"]
remote: [replication-14] master: Sent "OK"
remote: [replication-14] master: Sent bytes: "+OK\r\n"
remote: [replication-14] master: Received bytes: "*3\r\n$5\r\nPSYNC\r\n$1\r\n?\r\n$2\r\n-1\r\n"
remote: [replication-14] master: Received RESP value: ["PSYNC", "?", "-1"]
remote: [replication-14] Received ["PSYNC", "?", "-1"]
remote: [replication-14] master: Sent "FULLRESYNC 75cd7bc10c49047e0d163660f3b90625b1af31dc 0"
remote: [replication-14] master: Sent bytes: "+FULLRESYNC 75cd7bc10c49047e0d163660f3b90625b1af31dc 0\r\n"
remote: [replication-14] Sending RDB file...
remote: [replication-14] master: Sent bytes: "$88\r\nREDIS0011\xfa\tredis-ver\x057.2.0\xfa\nredis-bits\xc0@\xfa\x05ctime\xc2m\b\xbce\xfa\bused-mem°\xc4\x10\x00\xfa\baof-base\xc0\x00\xff\xf0n;\xfe\xc0\xffZ\xa2"
remote: [replication-14] Sent RDB file.
remote: [replication-14] master: $ redis-cli REPLCONF GETACK *
remote: [replication-14] master: Sent bytes: "*3\r\n$8\r\nREPLCONF\r\n$6\r\nGETACK\r\n$1\r\n*\r\n"
remote: [your_program] connection to master: peer: 127.0.0.1:6379, local: 127.0.0.1:41210
remote: [your_program] slave buf len: 1024, cursor: 0
remote: [replication-14] Received: "" (no content received)
remote: [replication-14]            ^ error
remote: [replication-14] Error: Expected start of a new RESP value (either +, -, :, $ or *)
remote: [replication-14] Test failed
remote: [replication-14] Terminating program
remote: [your_program] slave read 0 bytes from master
remote: [replication-14] Program terminated successfully

And here’s a snippet of my code that the slave listens to master

i                s.spawn(move || {
                    let mut buf = BytesMut::zeroed(1024);
                    let mut cursor = 0;
                    let mut offset = None;
                    loop {
                        println!(
                            "connection to master: peer: {}, local: {}",
                            connect.peer_addr().unwrap(),
                            connect.local_addr().unwrap()
                        );
                        println!("slave buf len: {}, cursor: {}", buf.len(), cursor);
                        let n = connect
                            .read(&mut buf[cursor..])
                            .context("read command from master")
                            .unwrap();
                        println!("slave read {n} bytes from master");
                        if n == 0 {
                            break;
                        }
   
                        let _ = buf.split_off(n);
// new_cursor is used, when the buf do not include the whole command, so next read will start will new_cursor
                        let (protocols, new_cursor) = Decoder
                            .decode_protocol(&mut buf)
                            .context("decode input bytes to protocol")
                            .unwrap();
                        protocols
                            .into_iter()
                            .for_each(|DecodedProtocol { protocol, length }| {
                                match protocol {
                                    RedisProtocol::Rdb(_) => {
                                        // TODO: deal with rdb file
                                    }
                                    RedisProtocol::Array(a) => {
                                        let command = RedisCommand::parse_from(a)
                                            .context(
                                                "parse redis command from array of string protocol",
                                            )
                                            .unwrap();
                                        println!("slave get command: {command:?}");
                                        let response = ref_self
                                            .execute(&command, &map, None, &mut offset, length)
                                            .context("execute command from master")
                                            .unwrap();
                                        println!("slave need response: {response:?}");
                                        if matches!(
                                            command,
                                            RedisCommand::Replconf(ReplconfCommand::GetAck)
                                        ) {
                                            response.iter().for_each(|r| {
                                                connect
                                                    .write_all(&r.as_bytes())
                                                    .context("write GetAck response to master")
                                                    .unwrap();
                                            });
                                        }
                                    }
                                    _ => {
                                        eprintln!(
                                        "other redis protocols are not support as input command"
                                        )
                                    }
                                }
                            });
                        match new_cursor {
                            Some(c) => cursor = c,
                            None => {
                                buf.resize(1024, 0);
                                cursor = 0
                            }
                        }
                    }
                });

I know the code may be too complicated… I just wanna show where I print the log and basic logic. Looking forward to you help.

Sorry for the delay here, I’ll take a look at this today!

@jjakj1 took a look at this, the flakiness has to do with the fact that after the handshake the program reads a value from the master and discards it (It must be expecting FULLRESYNC + the RDB file).

I added logs to the bytes that were read:

In a failure case, here’s what the value turns out to be:

(See how the GETACK is present in that data that was discarded?)

This is a case that needs to be handled - a TCP “read” isn’t guaranteed to match a single RESP value sent by a Redis server. You might receive a partial value, or you might receive multiple values at once (you’ll need to parse the data to figure out which case it is)

Hope this helps!

Note: I’ve updated the title of this post to include the stage ID (#XV6). You can learn about the stages rename here: Upcoming change: Stages overhaul.