I’ve tried debugging my code by placing logs at places I feel are causing the error behaviour, but i am unable to understand why it is happening.
I am getting an EOF and an empty string , before the get command is fired the replicas to get the replicated messages.
Even though I passed this stage, I still get periodic errors, something which I think has got to do with concurreny and specifically at sending out the messages to the replica. Stage 11 works as expected from my code, but stage 12 is where I lose a lot of clarity.
We’ve pushed a fix that makes the logs a bit better here - it should now say something like Received: "" (no content received) to make it clear that the error is that we didn’t receive a response.
And re: the bug, you’re right - it is related to the handshake step. We were able to identify one bug in this section:
The Read call here here is assuming that it’ll only receive the +FULLRESYNC... response, but that’s incorrect.
After a replica sends a PSYNC, it’s supposed to receive (a) the +FULLRESYNC response to the PSYNC command, (b) The RDB file contents (c) any propagated commands. There’s no wait period between the sending of these values, so they might arrive all at once, or they might arrive one-by-one (in theory you could even get “partial” reads, but I don’t think that’ll apply to our tests).
The “flaky” pass you’re seeing is most likely because in that specific test run things were just slow enough for those Read calls to return the expected messages in order. In the failure case, the first Read call might’ve received all 3 messages and thus the next Read call would’ve blocked.
Hope this helps!
And thanks so much for highlighting this - I’ll keep an eye out for similar reports and see if I can add a note to the stage instructions for “Empty RDB transfer” that makes this more obvious.