Future proofing my code

I feel like I’m not making code that will stand even the next challenge and that at any point I’ll have to redo all entire code for any challenge I start. Part of this stems from a lack of foresight into the project given to me but I am wondering if anyone has any tips on getting around this.

Hi @Moosems,

I feel the same way as you do. However, without seeing your code, it’s hard to give any useful tips.

My hunch is that as long as we follow the code structure outlined in the Crafting Interpreters book, we’re probably fine.

It’s hard to give tips without examples, but in general, avoid hardcoding stuff. I’ve worked through implementing lexers before. Build generically rather something just to pass your challenge. For instance: You just saw a !. Is it ! or !=? You just saw a letter “f”. Is it “for”, “fun”, or something else? You could have a generic function that, given the string, your position in it, and the expected suffix, could report whether you matched that suffix or not (careful doing this with the for/fun example). But that way you handled /*, //, !, and != at least.

@Amerikranian @andy1li This is my code now:

"""git commit --allow-empty -a -m 'test'
git push origin master"""

from sys import argv, stderr
from typing import TypeAlias
from pathlib import Path

Token: TypeAlias = tuple[str, str, str]

class Tokenizer:
    # The starting char and a tuple containing the name of that charm, any possible second char matches, and the name of that second match
    valid_tokens: dict[str, tuple[str, str, str]] = {
        "(": ("LEFT_PAREN", "", ""),
        ")": ("RIGHT_PAREN", "", ""),
        "{": ("LEFT_BRACE", "", ""),
        "}": ("RIGHT_BRACE", "", ""),
        ",": ("COMMA", "", ""),
        ".": ("DOT", "", ""),
        "-": ("MINUS", "", ""),
        "+": ("PLUS", "", ""),
        ";": ("SEMICOLON", "", ""),
        "*": ("STAR", "", ""),
        "=": ("EQUAL", "=", "EQUAL_EQUAL"),
        "!": ("BANG", "=", "BANG_EQUAL"),
        ">": ("GREATER", "=", "GREATER_EQUAL"),
        "<": ("LESS", "=", "LESS_EQUAL"),
        # Special cases
        "/": ("SLASH", "/", "COMMENT"),
        "\t": ("PASS", "", ""),
        "\n": ("PASS", "", ""),
        " ": ("PASS", "", ""),
    }
    def __init__(self, file: Path) -> None:
        self.file_contents: str = open(file, "r+").read()
        self.contents_len: int = len(self.file_contents)
        self.tokens: list[Token] = []
        self.line = 1 # Starts by incrementing one
        self.pos = -1 # Char in to the file
        self.hadError = False
        self.finished = False

        while not self.finished:
            self.create_token()

    def advance_char(self) -> str:
        self.pos += 1
        if self.pos >= self.contents_len:
            return ""
        return self.file_contents[self.pos]

    def validate_char(self, char: str) -> bool:
        if char == "":
            return False
        return True

    def create_token(self) -> None:
        char = self.advance_char()
        token: Token
        if not self.validate_char(char):
            token = ("EOF", "", "null")
            self.tokens.append(token)
            self.finished = True
            return
        try:
            match: tuple[str, str, str] = Tokenizer.valid_tokens[char]
            if match[0] == "PASS":
                if char == "\n":
                    self.line += 1
                return

            if not match[1]:
                token = (match[0], char, "null")
                self.tokens.append(token)
                return

            if self.pos + 1 >= self.contents_len:
                token = (match[0], char, "null")
                self.tokens.append(token)
                return

            next_char = self.file_contents[self.pos + 1]
            if not next_char == match[1]:
                token = (match[0], char, "null")
                self.tokens.append(token)
                return

            token = (match[2], char + next_char, "null")
            if token[0] == "COMMENT":
                token = ("EOF", "", "null")
                self.tokens.append(token)
                self.finished = True
                return
            self.pos += 1
            self.tokens.append(token)
        except KeyError:
            print(
                f"[line {self.line}] Error: Unexpected character: {char}", file=stderr
            )
            self.hadError = True

def main():
    # You can use print statements as follows for debugging, they'll be visible when running tests.
    print("Logs from your program will appear here!", file=stderr)

    if len(argv) < 3:
        print("Usage: ./your_program.sh tokenize <filename>", file=stderr)
        exit(1)

    command = argv[1]
    filename = argv[2]

    if command != "tokenize":
        print(f"Unknown command: {command}", file=stderr)
        exit(1)

    # Uncomment this block to pass the first stage
    tokenizer = Tokenizer(Path(filename))
    tokens: list[Token] = tokenizer.tokens
    for token in tokens:
        print(*token)
    if tokenizer.hadError:
        exit(65)


if __name__ == "__main__":
    main()

But it used to be this (and only made the current code when I felt I couldn’t keep monkey-pacthing in each new feature):

Token: TypeAlias = tuple[str, str, str]

def tokenize(file_contents: str) -> tuple[list[Token], bool]:
    token_list: list[Token] = []
    error: bool = False
    skip: bool = False
    for i, char in enumerate(file_contents):
        if skip:
            skip = False
            continue
        match char:
            case "(":
                token_list.append(("LEFT_PAREN", "(", "null"))
            case ")":
                token_list.append(("RIGHT_PAREN", ")", "null"))
            case "{":
                token_list.append(("LEFT_BRACE", "{", "null"))
            case "}":
                token_list.append(("RIGHT_BRACE", "}", "null"))
            case "*":
                token_list.append(("STAR", "*", "null"))
            case ".":
                token_list.append(("DOT", ".", "null"))
            case ",":
                token_list.append(("COMMA", ",", "null"))
            case "+":
                token_list.append(("PLUS", "+", "null"))
            case "-":
                token_list.append(("MINUS", "-", "null"))
            case ";":
                token_list.append(("SEMICOLON", ";", "null"))
            case "/":
                token_list.append(("SLASH", "/", "null"))
            case "=":
                if not len(file_contents) < i + 2:
                    if file_contents[i+1] == "=":
                        token_list.append(("EQUAL_EQUAL", "==", "null"))
                        skip = True
                        continue
                token_list.append(("EQUAL", "=", "null"))
            case _:
                error = True
                print(
                    f"[line {file_contents.count("\n", 0, file_contents.find(char)) + 1}] Error: Unexpected character: {char}", file=sys.stderr
                )

    token_list.append(("EOF", "", "null"))
    return (token_list, error)

Pretty solid code!

The first thing I noticed is that valid_tokens might be better un-hardcoded and turned into a Token data class, so that the matching logic stays in the create_token method.

I have something like this (a direct translation of the java code in the book):

1 Like

To be honest the reason I did I the way I did was to avoid a massive match statement

Yeah, I see.

May I ask the reasons why you tried to avoid match? What’re the pros and cons versus the current hardcoded valid_tokens?

IMHO, the matching logic has to stay somewhere, in one form or another. If the matching criterion are simple and the results are fixed, the hash map approach could be objectively better; if not, forcing it might hurt you in the future challenges.

I personally feel like, while the dict with tokens like that is a neat idea, it’s a bit harder to reason through if things go wrong. Full disclosure: I don’t particularly like writing lexers–I find scanning to be relatively mindless–so for me I want to figure out what went wrong very quickly so I can move on. If this were to break, I would personally have to think a bit harder about what went down and how.

That being said, there’s absolutely nothing wrong with your code. It’s arguably more concise (although I do second the suggestion to make a token class for syntax error reporting later). You can probably even wire up tokens to specific functions and make the main match even shorter.