I feel like I’m not making code that will stand even the next challenge and that at any point I’ll have to redo all entire code for any challenge I start. Part of this stems from a lack of foresight into the project given to me but I am wondering if anyone has any tips on getting around this.
Hi @Moosems,
I feel the same way as you do. However, without seeing your code, it’s hard to give any useful tips.
My hunch is that as long as we follow the code structure outlined in the Crafting Interpreters book, we’re probably fine.
It’s hard to give tips without examples, but in general, avoid hardcoding stuff. I’ve worked through implementing lexers before. Build generically rather something just to pass your challenge. For instance: You just saw a !. Is it ! or !=? You just saw a letter “f”. Is it “for”, “fun”, or something else? You could have a generic function that, given the string, your position in it, and the expected suffix, could report whether you matched that suffix or not (careful doing this with the for/fun example). But that way you handled /*, //, !, and != at least.
@Amerikranian @andy1li This is my code now:
"""git commit --allow-empty -a -m 'test'
git push origin master"""
from sys import argv, stderr
from typing import TypeAlias
from pathlib import Path
Token: TypeAlias = tuple[str, str, str]
class Tokenizer:
# The starting char and a tuple containing the name of that charm, any possible second char matches, and the name of that second match
valid_tokens: dict[str, tuple[str, str, str]] = {
"(": ("LEFT_PAREN", "", ""),
")": ("RIGHT_PAREN", "", ""),
"{": ("LEFT_BRACE", "", ""),
"}": ("RIGHT_BRACE", "", ""),
",": ("COMMA", "", ""),
".": ("DOT", "", ""),
"-": ("MINUS", "", ""),
"+": ("PLUS", "", ""),
";": ("SEMICOLON", "", ""),
"*": ("STAR", "", ""),
"=": ("EQUAL", "=", "EQUAL_EQUAL"),
"!": ("BANG", "=", "BANG_EQUAL"),
">": ("GREATER", "=", "GREATER_EQUAL"),
"<": ("LESS", "=", "LESS_EQUAL"),
# Special cases
"/": ("SLASH", "/", "COMMENT"),
"\t": ("PASS", "", ""),
"\n": ("PASS", "", ""),
" ": ("PASS", "", ""),
}
def __init__(self, file: Path) -> None:
self.file_contents: str = open(file, "r+").read()
self.contents_len: int = len(self.file_contents)
self.tokens: list[Token] = []
self.line = 1 # Starts by incrementing one
self.pos = -1 # Char in to the file
self.hadError = False
self.finished = False
while not self.finished:
self.create_token()
def advance_char(self) -> str:
self.pos += 1
if self.pos >= self.contents_len:
return ""
return self.file_contents[self.pos]
def validate_char(self, char: str) -> bool:
if char == "":
return False
return True
def create_token(self) -> None:
char = self.advance_char()
token: Token
if not self.validate_char(char):
token = ("EOF", "", "null")
self.tokens.append(token)
self.finished = True
return
try:
match: tuple[str, str, str] = Tokenizer.valid_tokens[char]
if match[0] == "PASS":
if char == "\n":
self.line += 1
return
if not match[1]:
token = (match[0], char, "null")
self.tokens.append(token)
return
if self.pos + 1 >= self.contents_len:
token = (match[0], char, "null")
self.tokens.append(token)
return
next_char = self.file_contents[self.pos + 1]
if not next_char == match[1]:
token = (match[0], char, "null")
self.tokens.append(token)
return
token = (match[2], char + next_char, "null")
if token[0] == "COMMENT":
token = ("EOF", "", "null")
self.tokens.append(token)
self.finished = True
return
self.pos += 1
self.tokens.append(token)
except KeyError:
print(
f"[line {self.line}] Error: Unexpected character: {char}", file=stderr
)
self.hadError = True
def main():
# You can use print statements as follows for debugging, they'll be visible when running tests.
print("Logs from your program will appear here!", file=stderr)
if len(argv) < 3:
print("Usage: ./your_program.sh tokenize <filename>", file=stderr)
exit(1)
command = argv[1]
filename = argv[2]
if command != "tokenize":
print(f"Unknown command: {command}", file=stderr)
exit(1)
# Uncomment this block to pass the first stage
tokenizer = Tokenizer(Path(filename))
tokens: list[Token] = tokenizer.tokens
for token in tokens:
print(*token)
if tokenizer.hadError:
exit(65)
if __name__ == "__main__":
main()
But it used to be this (and only made the current code when I felt I couldn’t keep monkey-pacthing in each new feature):
Token: TypeAlias = tuple[str, str, str]
def tokenize(file_contents: str) -> tuple[list[Token], bool]:
token_list: list[Token] = []
error: bool = False
skip: bool = False
for i, char in enumerate(file_contents):
if skip:
skip = False
continue
match char:
case "(":
token_list.append(("LEFT_PAREN", "(", "null"))
case ")":
token_list.append(("RIGHT_PAREN", ")", "null"))
case "{":
token_list.append(("LEFT_BRACE", "{", "null"))
case "}":
token_list.append(("RIGHT_BRACE", "}", "null"))
case "*":
token_list.append(("STAR", "*", "null"))
case ".":
token_list.append(("DOT", ".", "null"))
case ",":
token_list.append(("COMMA", ",", "null"))
case "+":
token_list.append(("PLUS", "+", "null"))
case "-":
token_list.append(("MINUS", "-", "null"))
case ";":
token_list.append(("SEMICOLON", ";", "null"))
case "/":
token_list.append(("SLASH", "/", "null"))
case "=":
if not len(file_contents) < i + 2:
if file_contents[i+1] == "=":
token_list.append(("EQUAL_EQUAL", "==", "null"))
skip = True
continue
token_list.append(("EQUAL", "=", "null"))
case _:
error = True
print(
f"[line {file_contents.count("\n", 0, file_contents.find(char)) + 1}] Error: Unexpected character: {char}", file=sys.stderr
)
token_list.append(("EOF", "", "null"))
return (token_list, error)
Pretty solid code!
The first thing I noticed is that valid_tokens
might be better un-hardcoded and turned into a Token data class, so that the matching logic stays in the create_token
method.
I have something like this (a direct translation of the java code in the book):
To be honest the reason I did I the way I did was to avoid a massive match statement
Yeah, I see.
May I ask the reasons why you tried to avoid match
? What’re the pros and cons versus the current hardcoded valid_tokens
?
IMHO, the matching logic has to stay somewhere, in one form or another. If the matching criterion are simple and the results are fixed, the hash map approach could be objectively better; if not, forcing it might hurt you in the future challenges.
I personally feel like, while the dict with tokens like that is a neat idea, it’s a bit harder to reason through if things go wrong. Full disclosure: I don’t particularly like writing lexers–I find scanning to be relatively mindless–so for me I want to figure out what went wrong very quickly so I can move on. If this were to break, I would personally have to think a bit harder about what went down and how.
That being said, there’s absolutely nothing wrong with your code. It’s arguably more concise (although I do second the suggestion to make a token class for syntax error reporting later). You can probably even wire up tokens to specific functions and make the main match even shorter.
This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.