[Python][#MG6] Clone command - stuck on tree checkout

I’m facing an error during the tree checkout step, almost at the end of the clone operation and would like some help debugging.

For context, cloning is already working for a small repository :tada:

(.venv) ➜  codecrafters-git-python git:(master) ✗ rm -rf foobar && ./your_program.sh clone https://github.com/eaverdeja/foobar.git foobar
(.venv) ➜  codecrafters-git-python git:(master) ✗ tree foobar
foobar
├── a_bigger_file.py
├── bar
├── foo
└── foobar
    └── barfoo

2 directories, 4 files
(.venv) ➜  codecrafters-git-python git:(master) ✗ cat foobar/foobar/barfoo
barfoo!

It doesn’t work however for one the sample repositories:

$ rm -rf git-sample-1 && ./your_program.sh clone https://github.com/codecrafters-io/git-sample-1 git-sample-1

...
  File "/Users/eaverdeja/Projects/codecrafters-git-python/app/commands.py", line 277, in _read_git_object
    with open(obj_path, "rb") as file:
         ~~~~^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '.git/objects/78/1007c281d79173c6c166b753376147af4ace50'

It fails specifically when trying to look for the entry corresponding to create_content.py. For some reason, the hash I compute from the data incoming from the packfile is different than the hash for this entry on the tree object.

What’s weird is that my implementation of hash-object seems correct (which is reused when creating the objects incoming from the packfile).

(.venv) ➜  git-sample-1 git:(master) ../codecrafters-git-python/your_program.sh hash-object create_content.py
781007c281d79173c6c166b753376147af4ace50
(.venv) ➜  git-sample-1 git:(master) g hash-object create_content.py
781007c281d79173c6c166b753376147af4ace50

I’m kinda lost here, so help is appreciated. Here’s my code:
https://github.com/eaverdeja/codecrafters-git-python

It fails specifically when trying to look for the entry corresponding to create_content.py . For some reason, the hash I compute from the data incoming from the packfile is different than the hash for this entry on the tree object.

Hi @eaverdeja, could you confirm if the create_content.py file you cloned is identical to the original version?

Here’s the file I get from the packfile:

And here’s the original version:

I’m not sure if the difference is just formatting and the files are essentially equal or if this means they are different somehow.

A disclaimer is that the packfile parsing was mostly claude-generated since I was struggling a lot with it. I reviewed the code, but I might have missed something there.

Another thing to note is that my small sample repo also has this very same file and I didn’t run into issues cloning it. I gave it a different name, but it’s the same file:

@eaverdeja I cannot access your repo, maybe is it private? This: https://github.com/eaverdeja/codecrafters-git-python

@eaverdeja Your issue must be with the ref files.

The git protocol does not only send blobs and tree files, it also sends ref files, these, although considred a also a type of blob have a parent and are designed to minimize storage requirements by referencing existing data instead of storing duplicate content.

See the ref deltas highlighted in yellow and the final version highlighted in blue.

You can check the pack file, and the hashes with git verify-pack -v pack-hash.pack (you can see this in the image the line after the ls).

If you need help understanding them, here is a guide 5. Parsing REF_DELTA | Git-protocol.

2 Likes

Just made the repo public, sorry about that.

I’ll take a look at your suggestion tomorrow, thank you!

@i27ae15 you were right - part of the issue was that I wasn’t handling REF_DELTA objects.

I managed to progress a bit further and can:

  1. Resolve deltas for my tiny toy repo and still clone it successfully
  2. Resolve a few deltas for the sample repo used by the tester

However, I still fail mid process when resolving deltas on the sample repo. There are multiple deltas where their base SHA isn’t found.

Here are the logs from my program:

Number of base objects: 255
Number of delta objects to resolve: 9

Iteration 1:
Attempting to resolve delta with base SHA: 6a9f27650d6d08a9307f28a3e1697b32dc250a8a
Successfully resolved delta, new object SHA: 5a201b017b9c92745491d72a7301de7ec773f782
Attempting to resolve delta with base SHA: 5a201b017b9c92745491d72a7301de7ec773f782
Successfully resolved delta, new object SHA: 781007c281d79173c6c166b753376147af4ace50
Attempting to resolve delta with base SHA: 4913deec9d3e73f35eed54d56f970d612af219e6
Base object not found for SHA: 4913deec9d3e73f35eed54d56f970d612af219e6
Attempting to resolve delta with base SHA: 23a4932a19c2fbdabee41dbad4b85da7b126e4c8
Base object not found for SHA: 23a4932a19c2fbdabee41dbad4b85da7b126e4c8
Attempting to resolve delta with base SHA: d8abbebf29755ea44ce29796cfec22454d9e5015
Base object not found for SHA: d8abbebf29755ea44ce29796cfec22454d9e5015
Attempting to resolve delta with base SHA: c540373c3c78ea9ee2f132e9db4a6969eb78e3ae
Base object not found for SHA: c540373c3c78ea9ee2f132e9db4a6969eb78e3ae
Attempting to resolve delta with base SHA: 69c5abd9ee0dc23be6956f80aac885f8c361b6c6
Base object not found for SHA: 69c5abd9ee0dc23be6956f80aac885f8c361b6c6
Attempting to resolve delta with base SHA: 7f1c8ab9f7aa0c000ed27975b403789ccbc82f2a
Base object not found for SHA: 7f1c8ab9f7aa0c000ed27975b403789ccbc82f2a
Attempting to resolve delta with base SHA: d9a9200e88020014481414b104789c4bc9cf4faa
Base object not found for SHA: d9a9200e88020014481414b104789c4bc9cf4faa
Remaining deltas after iteration: 7

Iteration 2:
Attempting to resolve delta with base SHA: 4913deec9d3e73f35eed54d56f970d612af219e6
Base object not found for SHA: 4913deec9d3e73f35eed54d56f970d612af219e6
Attempting to resolve delta with base SHA: 23a4932a19c2fbdabee41dbad4b85da7b126e4c8
Base object not found for SHA: 23a4932a19c2fbdabee41dbad4b85da7b126e4c8
Attempting to resolve delta with base SHA: d8abbebf29755ea44ce29796cfec22454d9e5015
Base object not found for SHA: d8abbebf29755ea44ce29796cfec22454d9e5015
Attempting to resolve delta with base SHA: c540373c3c78ea9ee2f132e9db4a6969eb78e3ae
Base object not found for SHA: c540373c3c78ea9ee2f132e9db4a6969eb78e3ae
Attempting to resolve delta with base SHA: 69c5abd9ee0dc23be6956f80aac885f8c361b6c6
Base object not found for SHA: 69c5abd9ee0dc23be6956f80aac885f8c361b6c6
Attempting to resolve delta with base SHA: 7f1c8ab9f7aa0c000ed27975b403789ccbc82f2a
Base object not found for SHA: 7f1c8ab9f7aa0c000ed27975b403789ccbc82f2a
Attempting to resolve delta with base SHA: d9a9200e88020014481414b104789c4bc9cf4faa
Base object not found for SHA: d9a9200e88020014481414b104789c4bc9cf4faa
Remaining deltas after iteration: 7

Failed to resolve these deltas' base SHAs:
- 4913deec9d3e73f35eed54d56f970d612af219e6
- 23a4932a19c2fbdabee41dbad4b85da7b126e4c8
- d8abbebf29755ea44ce29796cfec22454d9e5015
- c540373c3c78ea9ee2f132e9db4a6969eb78e3ae
- 69c5abd9ee0dc23be6956f80aac885f8c361b6c6
- 7f1c8ab9f7aa0c000ed27975b403789ccbc82f2a
- d9a9200e88020014481414b104789c4bc9cf4faa

So it managed to resolve 2 deltas, but not the other 7. I tried using g cat-file -p on these hashes on the repo but all of them yielded fatal: Not a valid object name {hash}. The successfully resolved deltas evaluated to the contents of create_content.py.

Link to the relevant piece of code:

Here’s me trying to look for the hashes in the packfile (without success lol)

I feel like I’m close! Almost there.

@eaverdeja Is how you are decompressing the data from the ref delta on the _read_compressed_data method, you have this code:

    def _read_compressed_data(self) -> bytes:
        """Read zlib compressed data from the current position."""
        decompressor = zlib.decompressobj()
        data = b""
        while True:
            byte = self.stream.read(1)
            if not byte: # This might be the problem
                break
            try:
                data += decompressor.decompress(byte)
                if decompressor.eof:
                    break
            except zlib.error:
                self.stream.seek(-1, 1)  # Go back one byte
                break

        return data

You are reading until there is not more bytes, which can lead to an issue because after the ref delta is finished, in fact, there could be more bytes after it. You can see this here:

    def _parse_object(self) -> PackObject:
        """Parse a single object from the packfile."""
        start_offset = self.offset
        size, obj_type = self._read_varint()

        # For delta objects, read the base offset/reference
        base_sha = None
        if obj_type == "ref_delta":
            base_sha = self._read_bytes(20).hex()  # Base object SHA-1
            print("base sha1 from ref_delta: ", base_sha)
            ref_data = self._read_compressed_data()
            print("-" * 50)
            print(ref_data)
            print("-" * 50)

Which prints:

base sha1 from ref_delta:  6a9f27650d6d08a9307f28a3e1697b32dc250a8a
--------------------------------------------------
b'\x9f\n\xb6\n\x908\x7f    "humpty",\n    "dumpty",\n    "horsey",\n    "donkey",\n    "yikes",\n    "monkey",\n    "doo",\n    "scooby",\n    "dooby",\n    "v\x1banilla",\n]\n\nrandom.seed(2)\n\xb1\xb3.\x01\xb3\xe5\x01\x0c\x03\x93\xf5\x04*'

As you can see there are more bytes after the last human-readable part random.seed(2)\n.

Affecting the next git-objects, because the first byte changes, so other type of files will be confused as ref deltas.

You gotta stop reading the ref file based on the header of it, that indicates the 1. Where to copy/add (based on the offset of the targeting file`
2. The size of the infomration to copy.

Also, FYI, there are three ref deltas in git-sample-1.

You are pretty close of solving it!

2 Likes

I honestly have no idea how to do this. I’m way over my head here haha.

But I’ll keep scratching my beard and will get to the solution eventually, thanks for the help.

@eaverdeja You can follow the guide I have on this : 5. Parsing REF_DELTA | Git-protocol

Also, if any help, here is my solution for that (is C++ though) : git/src/file/pack/delta.cpp at master · i27ae15/git · GitHub

2 Likes

Thanks for the links! They were helpful.

I managed to progress a bit further - I can now find & resolve the 3 REF_DELTA objects in the git-sample-1 repo.

Number of base objects: 329
Number of delta objects to resolve: 3
Attempting to resolve delta with base SHA: 6a9f27650d6d08a9307f28a3e1697b32dc250a8a
Successfully resolved delta, new object SHA: 5a201b017b9c92745491d72a7301de7ec773f782
Attempting to resolve delta with base SHA: 5a201b017b9c92745491d72a7301de7ec773f782
Successfully resolved delta, new object SHA: 781007c281d79173c6c166b753376147af4ace50
Attempting to resolve delta with base SHA: 0f99f9c5b83b010cfbd67870502df7b293ec0e37
Successfully resolved delta, new object SHA: 2a7a45d39bd312e00c01f5972063b7ca12b6bd28
Remaining deltas after iteration: 0

Verifying the packfile gives me 329 non-delta objects and 3 delta objects, so everything seems aligned until now.

The new issue arises when checking out and reading the recently written git objects - there’s a tree that wasn’t written to .git/objects for some reason.

  File "/Users/eaverdeja/Projects/codecrafters-git-python/app/commands.py", line 237, in _checkout_tree
    type_str, data = _read_git_object(tree_sha1)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/eaverdeja/Projects/codecrafters-git-python/app/commands.py", line 263, in _read_git_object
    with open(obj_path, "rb") as file:
         ^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '.git/objects/f0/88b44e824793b7aaf2f65c0919c4287d500188'

From a fresh git clone of the sample repo, this SHA does exist and is a tree:

➜  git-sample-1 git:(master) ✗ g cat-file -p f088b44e824793b7aaf2f65c0919c4287d500188
100644 blob ab87edd434a2c46290ecbd9799b3bd2b6525f6d3    donkey
100644 blob 9f0eabb707754276cf5fa8492bc5ae2b1becbc4b    doo
100644 blob 578e86f46ec2d281b72c817c54c4c16bc2ff9e08    dooby
100644 blob a28840bd27bf82659e591c61f484f30714728249    dumpty
100644 blob e1c1cade21c7761bdb226666ae0d5ed70f7e3cdd    horsey
100644 blob 4a21229f266438606fa578ea9b478013b89e77b0    humpty
100644 blob 409b28ad925a57dd4b521d04274416301fec8828    monkey
100644 blob bd8a3fd65926c98bde51900ce0ab287197154e34    scooby
100644 blob 7f487faaebd80604675fc12b9e41f8c25cdeea76    vanilla

Running tree on my cloned repo’s .git/objects yields me 188 directories, 334 files. On a fresh git clone of the repo I get 187 directories, 332 files.

Seems weird that there’s a just a single object missing? Running diff --brief git-sample-1/.git/objects ../git-sample-1/.git/objects yields me the following (left is my clone, right is git clone):

Only in git-sample-1/.git/objects: 23
Only in git-sample-1/.git/objects: 4c
Only in git-sample-1/.git/objects: ff
Only in ../git-sample-1/.git/objects: info
Only in ../git-sample-1/.git/objects: pack

So it seems like I have extra objects (or wrongly named ones). How could that be happening? These extra objects are invalid btw:

(venv) ➜  codecrafters-git-python git:(master) ✗ cd git-sample-1 && g cat-file -p 23575504f27d489deee7ad72bfc9c2a185d4eb49
fatal: invalid object type

I’m so close I can even get tests passing - which is amazing but not quite fulfilling as I need to ignore errors when reading git objects to do so. I’d much prefer to get a proper working solution.

[tester::#MG6] Running tests for Stage #MG6 (Clone a repository)
[tester::#MG6] $ ./your_program.sh clone https://github.com/codecrafters-io/git-sample-3 <testDir>
[your_program]
[your_program] Number of base objects: 300
[your_program] Number of delta objects to resolve: 1
[your_program] Attempting to resolve delta with base SHA: 614a4a38d7b3dd6d34df0b99110b81ea32bef5a6
[your_program] Successfully resolved delta, new object SHA: b64fa8e5c5fcea1ecfd0cf36986a7b450656d944
[your_program] Remaining deltas after iteration: 0
[tester::#MG6] $ git cat-file commit 23f0bc3b5c7c3108e41c448f01a3db31e7064bbb
[tester::#MG6] Commit contents verified
[tester::#MG6] Reading contents of a sample file
[tester::#MG6] File contents verified
[tester::#MG6] Test passed.
1 Like

@eaverdeja congrats on having all tests passing! It is extremely difficult!

By how you are describing the issue, the first thing that comes to my mind is that you might be processing a ref delta pointing to a tree as a blob as well. ref delta files could be either blob or trees.

In git-sample-1 there are three deltas, 2 blob objects and 1 tree object. The two blobs are at the beginning and the tree at the end, you can see this by running git verify-pack.

I cannot take a deep look today, I am a bit busy with work.

Can you check this? If this is not the case, I will take a look tomorrow!

No rush at all, take a look at your leisure!

I’ll take some more time tonight to investigate and clean up some of the code. I’m still not happy with how I’m decompressing data, even though it works.

Edit:

Ok, I found out the issue with the extra objects - I was incorrectly creating git objects for ref_delta type objects. That’s fixed now.

I’m still missing the tree object f088b44e824793b7aaf2f65c0919c4287d500188 though, which is present on the sample repo, but not on my clone.

Running diff on the 2 repos indicates that I’m indeed missing a whole tree/subdirectory, with 9 files:

$ codecrafters-git-python git:(master) ✗ diff -bur git-sample-1/ ../git-sample-1

Only in ../git-sample-1/scooby/dooby: donkey
Only in ../git-sample-1/scooby/dooby: doo
Only in ../git-sample-1/scooby/dooby: dooby
Only in ../git-sample-1/scooby/dooby: dumpty
Only in ../git-sample-1/scooby/dooby: horsey
Only in ../git-sample-1/scooby/dooby: humpty
Only in ../git-sample-1/scooby/dooby: monkey
Only in ../git-sample-1/scooby/dooby: scooby
Only in ../git-sample-1/scooby/dooby: vanilla

Running tree tells us more about the diff:

$ codecrafters-git-python git:(master) - tree git-sample-1

41 directories, 191 files

---

$ Projects - tree git-sample-1


41 directories, 202 files

The 11 diff is the 9 extra files from above + the pack and index files. But why do the directories match if we’re missing a whole a folder? Well, turns out /scooby/dooby does exist on my side, but it’s empty!


From my program execution, I think the issue is still with how I’m decompressing data :confused:

I did try the simpler approach to decompression that I saw in multiple solutions, but it doesn’t work for me:

    def _read_compressed_data(self) -> bytes:
        dec = zlib.decompressobj()
        uncompressed_data = dec.decompress(self._read_bytes())   # breaks here with `incorrect header check`
        self.stream.seek(-len(dec.unused_data), 1)

        return bytes(uncompressed_data)

My program still fails here for a single tree object (which is probably f088b).

@eaverdeja Hellop, sorry for the delay, we are planning to migrate one of our services into its own, as a microservice, so, I’ve had couple of meetings.

I think you are right saying that the decompressing is the issue because I see this:

See at the left, the correct pack-file and at the right your implementation.

The line on the left that is highlighted is the tree you are missing, see that the two lines below the highlighted one, the highlighted in white is what you are returning.

You are missing:

f088b44e824793b7aaf2f65c0919c4287d500188 tree   303 265 8147
ab87edd434a2c46290ecbd9799b3bd2b6525f6d3 blob   52 53 8412

I am not sure why is this happening, but again, I am guessing something to do with the decompression method.

2 Likes

I’ve tried multiple iterations of the _read_compressed_data method to no avail.
I think that despite this hanging issue, I’m happy with my implementation, so I’ll submit the challenge as-is.

We can close this thread as I see it. Thanks again @i27ae15 for all of the help here!

:partying_face: :partying_face: :partying_face:

ps:

obviously the 1st test run for submission had to fail, it looked specifically for the folder I didn’t clone :see_no_evil: :laughing:

[tester::#MG6] Running tests for Stage #MG6 (Clone a repository)
[tester::#MG6] $ ./your_program.sh clone https://github.com/codecrafters-io/git-sample-1 <testDir>
[your_program] Could not find git object f088b44e824793b7aaf2f65c0919c4287d500188
[tester::#MG6] $ git cat-file commit 3b0466d22854e57bf9ad3ccf82008a2d3f199550
[tester::#MG6] Commit contents verified
[tester::#MG6] Reading contents of a sample file
[tester::#MG6] open /tmp/worktree2105773131/test_dir/scooby/dooby/doo: no such file or directory
[tester::#MG6] Test failed
2 Likes

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.