Replication (#YG4) GET commands received before SET?

I’m stuck on Stage 13.

I’ve tried created a handshake with master node. The test passes a few times. What I can see is that before a SET command can come to replica node a GET command is called that returns nil as response. I made sure to add a lock on getter and setter of my cache. I cannot find a way to run the GET command after the SET is successful.

Here are my logs:
This image shows my test are passing. If you notice, it is because the replica got SET command before a GET

Here is the screenshot that shows GET called before SET could be called.

And here’s are links to relevant files:

Any hint or solution to tackle this problem is highly appreciated.

Thank you :grinning:

1 Like

I’m stuck on the same issue. I have the same problem with GET commands possibly being called before SET commands.

Stage 12 went through without any issues, and locally it seams to be working fine.

Here is my repo:

@NexFlare / @naqet there’s a 500ms sleep between each retry there, and we’re retrying 5 times, so that’s 2.5 seconds in total - the tester assumes that should be sufficient time to have received the SET command.

I wonder if there’s an issue with a lock not being released? Can you try adding logs around when (a) a lock is acquired (b) a process is going to block waiting for a lock and (c) a lock is released? This might help pinpoint the issue.

Same issue here. I didn’t use channels or any intentional blocking so far, And just like @naqet all things went well until now, and locally i can’t reproduce the scenario in the test. After running the tests multiple times i noticed that it is definitely a race condition, in most cases the replica manages to respond to 2 of the 3 SET commands.
@rohitpaulk could i use any locks without knowing, aka is the net.Conn read/write based on a lock/channel behind the scenes ?

I used sync.Map as database for simplicity, so there shouldn’t be any issues with locks as far as I know.

The issue could be related to the reply you get after PSYNC… sometimes its comes out like this:

+FULLRESYNC 75cd7bc10c49047e0d163660f3b90625b1af31dc 0\r\n$88\r\nREDIS0011\xfa\tredis-ver\x057.2.0\xfa\nredis-bits\xc0@\xfa\x05ctime\xc2m\b\xbce\xfa\bused-mem°\xc4\x10\x00\xfa\baof-base\xc0\x00\xff\xf0n;\xfe\xc0\xffZ\xa2*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\n123\r\n*3\r\n$3\r\nSET\r\n$3\r\nbar\r\n$3\r\n456\r\n*3\r\n$3\r\nSET\r\n$3\r\nbaz\r\n$3\r\n789\r\n

everything gets packed together so you need to figure out a way to parse queries like this…

you can put a print statement right after the PSYNC command is sent and check the reply you get… use fmt.Printf("%#v", string(buf[:n])) to also print the new line characters


Ah, yep this can be one cause here. The response to PSYNC, the RDB file and the propagated commands can be received all at once. This is because there’s no “wait” step (like waiting for a response) in between - so even though they’re sent “one by one”, we can’t know if your program received them and thus have to continue sending the others. So your program might receive all at once, or one by one.

We’ll incorporate this into the instructions!

@rohitpaulk i see that in stage 13, in tests the master is not booted in the same way as in tests of stage 12. i added some extra logging and they just don’t show, however they appear in stage 12 tests (See screenshots) . Are we supposed to block only the replica commands(like GET foo) while the handshake is still in process ? Really lost here, tried to solve this all week, initially without channels, now with channels for state and im at the same point. What can i do to get more understanding of what is wrong with my code ? The test doesn’t seem to be using my code for the master instance, otherwise logs would be the answer…

This is test 12 output with extra logs at the start

Hey @remuspoienar,

I understand this can be hard to debug, we’re actively thinking of ways to make our logs + instructions better. Hopefully once we’ve figured this out we’ll be able to make the experience smoother for others :slightly_smiling_face:

The test doesn’t seem to be using my code for the master instance, otherwise logs would be the answer…

In stage 13, the tester acts as the master and only executes user’s code as a replica. The reason we do this is so that we can test the replica’s behaviour and ensure it is correct & matches the official Redis specification. If we didn’t do this, you might run into false positives, which would likely cause problems in later stages.

Are we supposed to block only the replica commands(like GET foo) while the handshake is still in process?

Hmm, not really.

When a Redis server is booted as the replica, it must do two things at once:

  1. Respond to Redis clients like usual on --port
  2. Initiate the handshake with a master, and once complete, continuously listen for propagated commands on the replication connection (the same one used for the handshake)

So there’s no blocking needed per-se, however the replicate does need to complete the handshake before it can start receiving propagated commands. It can start responding to regular commands from clients right away, that isn’t blocked on anything.

Thanks @rohitpaulk
Unfortunately im out of options so i ll just switch to another track, because locally all this works, and managed to connect 3 replicas and they all get propagated commands and have it ready for redis clients. For me this was an issue with the test, because I already pass the replica requirements, locally, all handshakes are smoothly done, with proper logging and attention. I wouldn’t be here if i hadn’t done that first, i assure you, since it wouldn’t be fair to complain in that case.

As for hte blocking part i meant the replicas blocking until they finish the handshake and start the replication loop, reusing the connection, as you said, exactly what i 've done as well. I hoped that if i delayed with some sleep on both master replica thinking it would no longer send multiple cmds in the same read buf, but ofc not the case since the test boots another master - got that bit now, thanks.
But the replica still has to block if it received a GET foo command while its doing the handshake. i tried handshake in a goroutine and also in a blocking way, the result is the same. The get commands are sent while the hanshake is in progress.
I managed to solve that and now i have to parse multiple commands from the same read operation in the replication loop. Every test run yields a difference there, so this is why i tried delaying stuff

That is a great observation. I looked into this and it seems to have solved the problem. :grinning:

I’m faced with the same issue as others, in stage 13 my replica receives all 3 requests at the same time and now I have to find a way to parse that.

I feel frustrated to have to tweak my code to pass this particular stage’s tests, it’s the only one where I’m receiving many requests in one TCP stream and like others I noted that it is not my Master sending it because however much delay I put between each request they get ignored :frowning:

I am glad it helped your issue. That took me way too long to figure out… what worked was drawing it out in a piece of paper and tracking each and every request and response in both master and replicas with logging… another issue was it was passing “sometimes” which made things even more confusing… but that’s how softwares are I suppose haha… it was fun nonetheless.

Can confirm that the issue was indeed the RDS file being squashed together with SET commands. Needed to parse the remaining handshake buffer and apply replicated commands to make it work. Thanks @nishojib!

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.

Note: I’ve updated the title of this post to include the stage ID (#YG4). You can learn about the stages rename here: Upcoming change: Stages overhaul