12:50, when you said ‘state now’ and ‘state next’ you identified a problem I have been having with my own project that had been throwing me problems for a good part of the year! Thank you!!!’
I'm brazilian computer engineering student and I love to see your videos because they show me the side beyond the theory, I head about Finite State Machine in my class of "Aspectos teoricos da computação" and we started with Finit Automaton, love it!
Hi there, I am also a Brazilian and a university professor. I've been following OLC for quite a while and even used some of his examples in class. If you're interested, check out my playlists. Valeu.
Excellent topic! and inspiring to my current project of parsing C++ preprocessor in Python (to convert it into JSON, YAML, CSV, SQL, etc.). And state machines are so fundamental to all kinds of programming that it never hurts to revisit and review the potential of this straightforward but powerful concept.
I’m designing an assembler for a home brew instruction set architecture and coincidentally, OLC starts this series on designing custom programming languages! It’s like you’ve sensed that a viewer was in need and came to hold my hand. lol Love your videos and thanks!
Some time ago i've made my first compiler, and probably has the most simple tokenizer ever. Its basicly divide all the strings were there is some type of "separator" like spaces, operators and marker simbols, then for the classification i only checked if it matched with the predicted symbols of the language, if its not a simbol its a number , variable or a error (my language only has variable and number as data ), so all i needed is to check if it was numeric or if the variable regex was a match. I got really proud of it :D
This is one of those "minefield" programming tasks; now we need it to parse numbers formatted with commas instead of decimal points (e.g. 3.1415927 vs 3,1415927) i.e. make it international. Also maybe apostrophes e.g. $8'100.56 (oh yeah, currencies and units too!)
@javidx9 Absolutely. This is an example of one of the instances where (in a commercial environment) I believe a proper set of requirements should be documented with sound definitions of syntactically acceptable inputs and expected outputs. I think your teaching style is excellent - well paced and balanced with an emphasis on the fundamentals, leaving the viewer with the knowledge to approach solving problems for themselves. I'm looking forward to the rest of the series. Thanks.
Muahaha! This takes me back. I've written multiple programming languages just for giggles. I remember I became almost OCD with a language where everything is objects and code would be full of monads. What I mean is something like: 123.If > 0 .Then(...) .Else(..) I wrote programming language based on XML. I even wrote my own virtual machine. Come to think of it, I've written .NET virtual machine to track execution of programs. That was back when I had the drive to write stuff on my own time. Then I understood that being professional means you need to be paid to write code 😁
I remember this one chick I had the biggest crush on who tried to get me to learn about monads. I ended up hating monads and don't ever want to touch them again
Lol. Monads are handy in certain situations. I implemented this Result class which has methods OnSuccess and OnFailure. So when something returns Result type, I can write something like this: var result = ...; result.OnFailure(e => Log(e)); I find this much more clear than if(!e.Success) { .. } Luckily my wife isn't in software biz 😂
Awesome t-shirt! Ya seem like you would be an awesome dad! Edit: Love that you leave errors in the code! Reinforces that everyone makes mistakes. Thanks for the video!
That reminds me of when i made a json parser. Surprised to see that smart people actually do it similar to how i did it. Definitely have to revisit the project and see if i can improve something
I can't help but think that a "pure" FSM would deal with decimal points by adding one or more new states, e.g. "seen digits but no point", "seen digits then point then zero or more digits", "seen point" etc
Nice implemenation of simple language. Takes me back to learning lex/yacc (or bison for open source folks), and 'antlr' for Java platforms. Using those, taught myself a lot about 'Backus-Naur' form of describing programming languages. It dovetails nicely with your FSM description of your development. I note that some of your if-checks are very order sensitive but you don't really mention it. You check white-space first, and numerics before symbols to ensure the right tokens are recognized first. And so far you don't differentiate between integers and reals. This can be another fun bit to explore a little. Then we could have fun with '' as shift operators only valid on integer types. Or maybe supporting octal by recognizing a leading zero?? A lot of fun bits to mess around with. :)
I've made compilers using Flex and Bison, but this is very interesting. Learning the craft from scratch. If you ever write a book about (game)programming I'll definitely buy it It may be a nice idea to add some learning order in your video's
just FYI: I'm no Hungarian, but I do know a few lang-related things, and one of those is that the Hungarian 'S' is technically "sz", as just a simple 's' is pronounces "esh". So the name "Thomas" (in ENG) would be "Tomas" but pronounces "Tomash". Amazing vid! I was always quite interested in such lang-related parts, and making my own tokenizer + parser + AST processor. My inspiration primarily comes actually from the Groovy language which includes the "groovyConsole" utility within its SDK and allows a dev to inspect the AST and even the CST! Freakin awesome! Cheers!
When he said "Hungarian S" he didn't mean "S in the Hungarian language". Hungarian notation in computer science is a naming convention for variables that includes the type of the variable (or some indication of the type) in the name. So in the case of sMyVariable, the "s" at the beginning of the name indicates it's a struct or a string ( en.wikipedia.org/wiki/Hungarian_notation ) When he sayd "let's drop the Hungarian S" he just means that he's renaming the variable without indication of the type in it. It's not common anymore these days, particularly not when it comes to strongly typed languages like C++, but Javid likes to use it.
@@chillyvanilly6352 They are, but not for that reason. If you look through Simonyi's original article on the notation, "st" refers to strings, and "sz" refers to zero-terminated strings.
Good video so thanks. But gets less excited because you're using ever more features of C++...Will that language ever be completed??! C++ IMO is constantly getting bigger and bigger and bigger and over complicated for the problem in hand...No wonder it takes 6 years to produce a computer game these days. Some will say ah but you don't have to use every feature. But many will, so you have the situation when people looking to get into C++ will be overwhelmed by 20 versions of code when they ask for help! Even John Carmack has said he codes in C mainly and adds just a bit of C++.
I've never managed to get myself all the way through one of your series, but just seeing you post videos brings a smile to my face. Have a good day, sir, and keep doing what you do!
@13:27 yes, i stumbled on this problem with kind of FSM which i discovered myself (not knowing it has a name), sometimes i introduced next state, sometimes depended on break behaviour in swich / case constuct, but nextstate is easier to underestand, because switch/case may became unreadable if you try to not repeat code
*Summary* This video focuses on building a more robust parser for the DIY programming language using a finite state machine (FSM). *Key points:* * *(**0:00**)* Introduction and recap of the limitations of the previous shunting yard algorithm. * *(**7:28**)* Implementing the FSM in code. * *(**7:47**)* Housekeeping: Adding standard libraries, updating the `Symbol` struct to `Token`, and creating the `Compiler` class. * *(**12:02**)* Implementing the FSM states (`new token`, `numeric literal`, `operator`, `complete token`). * *(**17:47**)* Creating a custom "is digit" function for performance and flexibility. * *(**21:35**)* Handling whitespace in the input. * *(**26:54**)* Adding parenthesis handling and a parenthesis balance checker. * *(**32:01**)* Introducing symbols for variables, function names, etc., and the corresponding FSM state. * *(**36:18**)* Handling hexadecimal and binary numeric literals. * *(**40:35**)* Adding additional token types for future use (parenthesis, scope, separator, string literal, keyword). * *(**41:24**)* Implementing the `string literal` state. * *(**46:35**)* Integrating the shunting yard algorithm with the new token-based system. * *(**47:10**)* Demonstrating the improved calculator with different numeric types and syntax checking. *Result:* A more sophisticated calculator that can handle complex expressions with different numeric types and basic syntax checking. Future videos will build upon this foundation to implement more advanced programming language features. i used gemini 1.5 pro to summarize the transcript
Are you planning to add line and character counters so that the error reporting can give the exact location of an error? seems like a very simple addition that provides a nice feature many compilers have.
OK ..i agree your tokenizer work well and seems that produce rock solid array of tokens ..or at least array of UDT ...so next step will be parser i guess..it is kind of hard to follow allthis because i dont use C++ i find is hard ... can you btw explain how function with parms can be called multiple times in a recursive call i mean how to 2 or more recursive calls know which call is already executed ,,thanks
I just watched a video about category theory, and that diagram at 7:35 looked very familiar. Containing all of the features of a category: objects, morphisms and even endomorphisms. If i said something wrong here please correct me. My knowledge on the subject consists of a 5 minute youtube video and using bartosz milewski's course on category theory to sleep.
I understand every single thing that you discuss in your compiler videos, but I know that I would never be able to implement said things on my own. And I’d feel like a total fraud if I tried copying your code just to end up putting it on my resume. This is my dilemma as a programmer. I love your videos. Any tips for this issue?
Firstly, thanks and I appreciate the support! Secondly, people only learn anything through copying or being shown. Once the basics are there, your skills can develop on their own and that's where novel solutions and engineering stem from. In my experience through the community around this channel, I've come to learn that many folk expect to be able to code like someone with 30 years practice in just 2 years. I don't know your experience level, but don't underestimate your abilities right now, the first step is to acknowledge the things you can't do, at least then you know what to work on.
@@javidx9 thank you so much for your advice. I actually recently graduated with a bachelor’s degree in software engineering but I can’t escape this imposter syndrome. I still have yet to land my first professional programming job. Your advice gives me something to surely think about however. I feel like there’s tons of stuff I can’t do. But I guess it also helps to know the type of stuff that I actually want to do. You’re awesome man!
It's so funny to me that you care so much about the speed of the code for checking if it's a digit, yet for checking if it's operator you just use this massive `contains` function from the standard xd Also it feels like there should be "stateLast" which i think could make some of the exceptions more readable. Not sure if intended but expression ".005" is considered valid numeric literal and also it's possible to start and end the expression with separator
Nice! Just did a system update and the calculator app (kcalc on KDE) no longer parses decimal numbers without the leading zero. ".1" is invalid input and it's annoying. I dove into the code but there's probably 2 dozen cpp files and the whole thing seems too far gone.
2:10. Not insane. If you're building a bit mask you might well mix bases, depending on context. I must confess, I've never used pi in a bit mask though
No that would be a jump table. A lookup table would be indexing an array, which is what he is doing. And will obviously be faster than creating a function with a switch statement that returns true/false.
@@rosen8757 but he uses the lookup table to jump around the code, why not use something like: switch (c) { case ' 0 ' . . . ' 9 ': case ' . ': // do stuff }
@@Raspredval1337 it will be the same just the other way around, with your version he would need to check which state machine state he is in in every // do stuff. As it is now the outer switch is for the state machine state and inside he check tokens.
I would like to point out a few things I would do differently: 1) Instead of Finite State Machine you could've just have a lexer class with a nextToken method, which would peek a char and decide what to do, for each variant just call a blocking function that returns a token 2) Maybe this is to keep things simple in the first episodes, but error handling with exceptions... In a programming language you'd probably want to get all errors at once (multiple error reports). So I'd store them in a vector. Note: This may not be very applicable for lexer, but for parsers error recovery is a thing. 3) Indexing char iterator with zero. It's a valid approach, but if this iterator was a stream directly read from a file (or simpler - linked list or set iterator), this probably won't work. Not all iterators implement indexing 4) Considering #1 and #3, I'd not collect tokens into a vector, but rather parse them immediately. This gets rid of memory and performance overhead and to make parsers easily you'd normally only need one token lookahead (peek a token and maybe consume it) 5) Parenthesis - as well as checking after the loop, you really should check for the count to be positive in closing parenthesis, so combinations like ")123(" are not accepted. But I would probably do it in the parser 6) Parser (more related to previous episode). I'd make a recursive decent one, it's normally more applicable for programming languages Yeah, this sounds kinda harsh, but I don't want to hurt. Just pointing some stuff out :)
12:02 I'am doing my own lang and I've come up with a hack, to use function pointers instead of state enums. The code looks something like this: struct StateData; using voidfptr = void*; /* some compilers allow to store function pointers as void*, but the C-standard says that one should only use a function pointer type to store a function address. So created a wrapper around a dummy function pointer type for my project that serves as a void*, but for functions */ voidfptr NewToken(StateData&); voidfptr NumericLiteral(StateData&); voidfptr OperatorLiteral(StateData&); voidfptr CompleteToken(StateData&); int main() { using StateProc = voidfptr (*)(StateData&); StateData data = {/*init data*/}; StateProc lpfnNextState = NewToken; while (lpfnNextState != nullptr) { lpfnNextState = lpfnNextState(data); } } voidfptr NewToken(StateData& state_data) { int c = /* get cur symbol*/; switch (c) { case '0'...'9': return NumericLiteral; case '+': case '-': case '*': case '/': return OperatorLiteral; case EOF: return nullptr; } }
@@karbovskiy_dmitriy modern cpus are quite smart about virtual calls, as long as the code fits into the instruction cache. The actual slow part of the virtual call is the cache miss, not the call itself
@@Raspredval1337 the slow part is indirect call prediction. Virtual calls are hard to predict, especially if the destination address changes all the time (as it will). Predictable virtual calls are the calls from different sites or repeating patterns of calls.
@@karbovskiy_dmitriy any conditional jump would invoke branch prediction, no matter if it's a switch or a virtual call, it makes no difference to the cpu. The main benefit of the switch is cache locality.
@@Raspredval1337 in your example lpfnNextState(data) is an indirect call from the same calling site with the same parameters. Regular branch prediction is easily predicted because of both static and dynamic predictions and it can be optimised for out-of-order execution and pipelining. Indirect branching leads to unpredictable addresses so it stalls the pipeline. The choice to take the branch is binary. The destination addresses in indirect branching are hard to predict, especially from the same calling site. Intel Software Optimisation Guide in the section 3.4.1.5 explains why indirect calls with multiple possible destinations are slow. If there is no BTB entry for this call, the default behaviour is fall-through (in terms of out-of-order execution, which is bad for FSM). BTB only holds 1 entry for this call, making it impossible to consistently predict the target address. They even suggest breaking indirect branching into direct calls for better prediction rates.
this made me realize how unclever, stupid, and genuinely worthless i am. ive never organically developed anything without several resources, websites, and other people's code. this is natural selection in action, i am not cut out for this world at all lol.
I feel like every video about a advanced topic has some "experts" waiting who want to feel superior. They already know everything but still watch the video with their notepad open to comment every little "mistake". He never claimed that this is best practise or an educational tutorial video. He is doing it for free and for fun. Just sit back, relaxe and watch it as entertainment.
Making a custom lexer isn't that hard though, and is quite fun. "Just use xxx" is a solution to almost everything shown on this channel, that's not why people come here. Generating the parser (not necessarily the lexer) using something like bison is likely the best option if your goal is to get a usable language though, since it gets exponentially more spaghetti as you add more to the grammar.
comparison operator does an implicit subtraction in x86_64 so i don't think there is any benefit of doing this. Except of making the code less readable
@@maksymiliank5135 Subtraction is slower How does this make it less readable? I'm transforming the result of removing the lower bound of the range from the character we want to check, then casting it to unsigned to ensure the comparison thereafter does not consider negatives. Also if the comparison is doing subtraction under the hood then all the MORE reason to cast the result to unsigned. Makes clear to the compiler that no jump is needed for the result wanted, just throw the result of the 1st subtraction into an unsigned subtraction to see if the result is < 0.
This is bad advice: - exceptions are a plague of bad solutions. They don't solve any real problem and they rot the code. I've implemented multiple parsers for existing and my own languages. It has been exactly 0 times that exceptions had a signle use case. This is a bad example for an educational video. Error codes are compact and meaningfull, and also local to the problem; - your token struct is way too large. Just the std::string in x64 Release build is 32 bytes. This token is 48 (without the duplicate "type" member), which almost the whole cache line. At least a union would be useful; - 32:32: this throw statement just crashes the lexer instead of handling the error. The robust implementation would report the lexical error and continue the analysis. The same applies to other lexical error, not covered in this video. This is a bad excuse to justify the use of exceptions. Correction: this is not a parser, this is a lexical analyser, or lexer. LUT with constexpr is a good advice. std::string.find is an extremely bad implementation that shouldn't be even mentioned (that's 4 procedure calls in Release mode, 2 if the string is stored as const).
why not program all operators as unary operators? so if you have 5, it is the unary +5 etc... So this whole video is about constructing a lexical analyzer? Cause you've not called this part of your language that at all.
I don't know how to interpret what a unary divide means. Correct. I've chosen not to use any compiler terminology, or established techniques, tricks or algorithms. Buy a book about compiler design if formal training is preferred. I code things for fun.
12:50, when you said ‘state now’ and ‘state next’ you identified a problem I have been having with my own project that had been throwing me problems for a good part of the year! Thank you!!!’
I know you’re a good person because you put opening and closing braces on their own lines.
I think he’s a good person but I do hate this about his coding style
@@johnbryan1049 Just take it as something else you can (and should) learn
@@maskettaman1488 yeah I used to use that style. I switched one day like 15 years ago and didn’t look back. It was just easier to read for me.
100% braces on their own lines.
I'm brazilian computer engineering student and I love to see your videos because they show me the side beyond the theory, I head about Finite State Machine in my class of "Aspectos teoricos da computação" and we started with Finit Automaton, love it!
Hi there, I am also a Brazilian and a university professor. I've been following OLC for quite a while and even used some of his examples in class. If you're interested, check out my playlists. Valeu.
Brazil
i find it very impressive how your setup never really changes.. i rearrange my monitors and computers every 6 months..
impressive stability
FSMs are one of my favourite CS Concepts. Happy to see you cover them
Excellent topic! and inspiring to my current project of parsing C++ preprocessor in Python (to convert it into JSON, YAML, CSV, SQL, etc.). And state machines are so fundamental to all kinds of programming that it never hurts to revisit and review the potential of this straightforward but powerful concept.
I’m designing an assembler for a home brew instruction set architecture and coincidentally, OLC starts this series on designing custom programming languages! It’s like you’ve sensed that a viewer was in need and came to hold my hand. lol
Love your videos and thanks!
Seems fitting to call the language OLC++; OLC# if one feels maverick.
OLCaml
SUPER DADDIO! I love it 😁
Some time ago i've made my first compiler, and probably has the most simple tokenizer ever. Its basicly divide all the strings were there is some type of "separator" like spaces, operators and marker simbols, then for the classification i only checked if it matched with the predicted symbols of the language, if its not a simbol its a number , variable or a error (my language only has variable and number as data ), so all i needed is to check if it was numeric or if the variable regex was a match. I got really proud of it :D
This is one of those "minefield" programming tasks; now we need it to parse numbers formatted with commas instead of decimal points (e.g. 3.1415927 vs 3,1415927) i.e. make it international. Also maybe apostrophes e.g. $8'100.56 (oh yeah, currencies and units too!)
Yeah it's an excellent point, and wonderfully illustrates just how complex "real" compilers need to be
@javidx9 Absolutely. This is an example of one of the instances where (in a commercial environment) I believe a proper set of requirements should be documented with sound definitions of syntactically acceptable inputs and expected outputs. I think your teaching style is excellent - well paced and balanced with an emphasis on the fundamentals, leaving the viewer with the knowledge to approach solving problems for themselves. I'm looking forward to the rest of the series. Thanks.
Muahaha! This takes me back. I've written multiple programming languages just for giggles. I remember I became almost OCD with a language where everything is objects and code would be full of monads. What I mean is something like:
123.If > 0
.Then(...)
.Else(..)
I wrote programming language based on XML. I even wrote my own virtual machine. Come to think of it, I've written .NET virtual machine to track execution of programs.
That was back when I had the drive to write stuff on my own time. Then I understood that being professional means you need to be paid to write code 😁
I remember this one chick I had the biggest crush on who tried to get me to learn about monads. I ended up hating monads and don't ever want to touch them again
Lol. Monads are handy in certain situations. I implemented this Result class which has methods OnSuccess and OnFailure. So when something returns Result type, I can write something like this:
var result = ...;
result.OnFailure(e => Log(e));
I find this much more clear than if(!e.Success) { .. }
Luckily my wife isn't in software biz 😂
I was just trying to write my own compiler-ish program and this video came out. OLC is always there when you need
You always upload exactly what I’m looking for at the perfect time!
Awesome t-shirt! Ya seem like you would be an awesome dad! Edit: Love that you leave errors in the code! Reinforces that everyone makes mistakes. Thanks for the video!
That reminds me of when i made a json parser. Surprised to see that smart people actually do it similar to how i did it. Definitely have to revisit the project and see if i can improve something
I can't help but think that a "pure" FSM would deal with decimal points by adding one or more new states, e.g. "seen digits but no point", "seen digits then point then zero or more digits", "seen point" etc
Am looking forward to more of these videos, writing my own language was interesting until I got utterly stuck at handling recursion
Nice implemenation of simple language. Takes me back to learning lex/yacc (or bison for open source folks), and 'antlr' for Java platforms. Using those, taught myself a lot about 'Backus-Naur' form of describing programming languages. It dovetails nicely with your FSM description of your development.
I note that some of your if-checks are very order sensitive but you don't really mention it. You check white-space first, and numerics before symbols to ensure the right tokens are recognized first.
And so far you don't differentiate between integers and reals. This can be another fun bit to explore a little. Then we could have fun with '' as shift operators only valid on integer types. Or maybe supporting octal by recognizing a leading zero?? A lot of fun bits to mess around with. :)
I've made compilers using Flex and Bison, but this is very interesting. Learning the craft from scratch.
If you ever write a book about (game)programming I'll definitely buy it
It may be a nice idea to add some learning order in your video's
just FYI: I'm no Hungarian, but I do know a few lang-related things, and one of those is that the Hungarian 'S' is technically "sz", as just a simple 's' is pronounces "esh".
So the name "Thomas" (in ENG) would be "Tomas" but pronounces "Tomash".
Amazing vid! I was always quite interested in such lang-related parts, and making my own tokenizer + parser + AST processor. My inspiration primarily comes actually from the Groovy language which includes the "groovyConsole" utility within its SDK and allows a dev to inspect the AST and even the CST! Freakin awesome!
Cheers!
When he said "Hungarian S" he didn't mean "S in the Hungarian language". Hungarian notation in computer science is a naming convention for variables that includes the type of the variable (or some indication of the type) in the name. So in the case of sMyVariable, the "s" at the beginning of the name indicates it's a struct or a string ( en.wikipedia.org/wiki/Hungarian_notation )
When he sayd "let's drop the Hungarian S" he just means that he's renaming the variable without indication of the type in it.
It's not common anymore these days, particularly not when it comes to strongly typed languages like C++, but Javid likes to use it.
@@captainufo4587 no no, I am fully aware, but I can tell you for a fact that this notation is actually done by prefixing var-names with `sz`
@@chillyvanilly6352 They are, but not for that reason. If you look through Simonyi's original article on the notation, "st" refers to strings, and "sz" refers to zero-terminated strings.
your videos make me very happy
Ooo right up my street. Grabs some popcorn and off we go!
Good video so thanks. But gets less excited because you're using ever more features of C++...Will that language ever be completed??! C++ IMO is constantly getting bigger and bigger and bigger and over complicated for the problem in hand...No wonder it takes 6 years to produce a computer game these days. Some will say ah but you don't have to use every feature. But many will, so you have the situation when people looking to get into C++ will be overwhelmed by 20 versions of code when they ask for help! Even John Carmack has said he codes in C mainly and adds just a bit of C++.
Awesome video, can't wait for more!
I've never managed to get myself all the way through one of your series, but just seeing you post videos brings a smile to my face.
Have a good day, sir, and keep doing what you do!
Me too. He's the bob ross of programing tutorials.
thankya kindly!
@13:27 yes, i stumbled on this problem with kind of FSM which i discovered myself (not knowing it has a name), sometimes i introduced next state,
sometimes depended on break behaviour in swich / case constuct, but nextstate is easier to underestand, because switch/case may became unreadable
if you try to not repeat code
*Summary*
This video focuses on building a more robust parser for the DIY programming language using a finite state machine (FSM).
*Key points:*
* *(**0:00**)* Introduction and recap of the limitations of the previous shunting yard algorithm.
* *(**7:28**)* Implementing the FSM in code.
* *(**7:47**)* Housekeeping: Adding standard libraries, updating the `Symbol` struct to `Token`, and creating the `Compiler` class.
* *(**12:02**)* Implementing the FSM states (`new token`, `numeric literal`, `operator`, `complete token`).
* *(**17:47**)* Creating a custom "is digit" function for performance and flexibility.
* *(**21:35**)* Handling whitespace in the input.
* *(**26:54**)* Adding parenthesis handling and a parenthesis balance checker.
* *(**32:01**)* Introducing symbols for variables, function names, etc., and the corresponding FSM state.
* *(**36:18**)* Handling hexadecimal and binary numeric literals.
* *(**40:35**)* Adding additional token types for future use (parenthesis, scope, separator, string literal, keyword).
* *(**41:24**)* Implementing the `string literal` state.
* *(**46:35**)* Integrating the shunting yard algorithm with the new token-based system.
* *(**47:10**)* Demonstrating the improved calculator with different numeric types and syntax checking.
*Result:*
A more sophisticated calculator that can handle complex expressions with different numeric types and basic syntax checking. Future videos will build upon this foundation to implement more advanced programming language features.
i used gemini 1.5 pro to summarize the transcript
That's cool... is this an automated message?
@@dude2542 no. i just copy and paste the transcript manually for some of the more interesting videos i watch
lakaban has a couple of articles on compact lexer table representations that are a banger! ❤
Are you planning to add line and character counters so that the error reporting can give the exact location of an error? seems like a very simple addition that provides a nice feature many compilers have.
Great, as always
OK ..i agree your tokenizer work well and seems that produce rock solid array of tokens ..or at least
array of UDT ...so next step will be parser i guess..it is kind of hard to follow allthis because i dont use C++
i find is hard ...
can you btw explain how function with parms can be called multiple times in a recursive call
i mean how to 2 or more recursive calls know which call is already executed ,,thanks
Thank you Javid!
I would be interested in your solution for a pure C implementation. 🍀
What if you implemented the lut as a bitfield rather than an array?
Also how do determine precedence without parentheses between (x++)+2 and x+(++2)?
26:45 has that answer
As always, you post a great vid! :)
Thanks for the video! It really motivates me to do my job =)
I just watched a video about category theory, and that diagram at 7:35 looked very familiar. Containing all of the features of a category: objects, morphisms and even endomorphisms. If i said something wrong here please correct me. My knowledge on the subject consists of a 5 minute youtube video and using bartosz milewski's course on category theory to sleep.
Nice video! Keep up the great work.
You t-shirt is so cool! I want one! :)
Yayy he's back
I understand every single thing that you discuss in your compiler videos, but I know that I would never be able to implement said things on my own. And I’d feel like a total fraud if I tried copying your code just to end up putting it on my resume. This is my dilemma as a programmer. I love your videos. Any tips for this issue?
Firstly, thanks and I appreciate the support! Secondly, people only learn anything through copying or being shown. Once the basics are there, your skills can develop on their own and that's where novel solutions and engineering stem from. In my experience through the community around this channel, I've come to learn that many folk expect to be able to code like someone with 30 years practice in just 2 years. I don't know your experience level, but don't underestimate your abilities right now, the first step is to acknowledge the things you can't do, at least then you know what to work on.
@@javidx9 thank you so much for your advice. I actually recently graduated with a bachelor’s degree in software engineering but I can’t escape this imposter syndrome. I still have yet to land my first professional programming job. Your advice gives me something to surely think about however. I feel like there’s tons of stuff I can’t do. But I guess it also helps to know the type of stuff that I actually want to do. You’re awesome man!
You are legend
43:16 That should trigger at newlines too. I think in javascript you can use newlines directly only when the opening character is `.
It's a raw string. Literally. There are no escape characters either
@@maksymiliank5135 But as indicated earlier this code is intended to be used on files later which means newlines need to be handled
maybe he wants to allow multi line strings with just quotes? :D
It's so funny to me that you care so much about the speed of the code for checking if it's a digit, yet for checking if it's operator you just use this massive `contains` function from the standard xd
Also it feels like there should be "stateLast" which i think could make some of the exceptions more readable.
Not sure if intended but expression ".005" is considered valid numeric literal and also it's possible to start and end the expression with separator
Nice! Just did a system update and the calculator app (kcalc on KDE) no longer parses decimal numbers without the leading zero. ".1" is invalid input and it's annoying. I dove into the code but there's probably 2 dozen cpp files and the whole thing seems too far gone.
Did I miss the bit where you explained why you use "digit" as a synonym for "character"?
I did what?!! Oh nooooo!
Hey Javid!!!! What a treat to see you again haha
Diggin the T-Shirt :D
Instead of Flexing your c++ muscles, I thought that you were going to talk about using Flex (Lexical analyser generator) ;)
41:38, why are you escaping the double quote in a character literal?
trust me, this video is not long enough
Cool. Show them lexx and yacc
Reinventing Lex - the lexical analyzer?
Incredible next le lexer?
wait, is this language going to be actually compiled or interpreted? and if so, compiled to what arch?
17:30 it's compiled, that's what he said
2:10.
Not insane. If you're building a bit mask you might well mix bases, depending on context.
I must confess, I've never used pi in a bit mask though
Instead of "fancy", "decorated" is a better word for the numeric literal state where it is open if binary or hex.
You have nice handwriting :)
18:53 aren't _switch_ statements the built-in lookup tables? It looks way nicer than that monstrosity
No that would be a jump table. A lookup table would be indexing an array, which is what he is doing. And will obviously be faster than creating a function with a switch statement that returns true/false.
@@rosen8757 but he uses the lookup table to jump around the code, why not use something like:
switch (c) {
case ' 0 ' . . . ' 9 ':
case ' . ':
// do stuff
}
@@Raspredval1337 it will be the same just the other way around, with your version he would need to check which state machine state he is in in every // do stuff. As it is now the outer switch is for the state machine state and inside he check tokens.
Hi! Can I use a dictionary for checking if a character is or is not a digit or operator?
Inefficiënt. Just check if it’s part of the string of operators
Just use isdigit and some switch statements, most efficient.
How was first compiler compiled ?????
By hand!
@@javidx9 can you make video about it
I would like to point out a few things I would do differently:
1) Instead of Finite State Machine you could've just have a lexer class with a nextToken method, which would peek a char and decide what to do, for each variant just call a blocking function that returns a token
2) Maybe this is to keep things simple in the first episodes, but error handling with exceptions... In a programming language you'd probably want to get all errors at once (multiple error reports). So I'd store them in a vector. Note: This may not be very applicable for lexer, but for parsers error recovery is a thing.
3) Indexing char iterator with zero. It's a valid approach, but if this iterator was a stream directly read from a file (or simpler - linked list or set iterator), this probably won't work. Not all iterators implement indexing
4) Considering #1 and #3, I'd not collect tokens into a vector, but rather parse them immediately. This gets rid of memory and performance overhead and to make parsers easily you'd normally only need one token lookahead (peek a token and maybe consume it)
5) Parenthesis - as well as checking after the loop, you really should check for the count to be positive in closing parenthesis, so combinations like ")123(" are not accepted. But I would probably do it in the parser
6) Parser (more related to previous episode). I'd make a recursive decent one, it's normally more applicable for programming languages
Yeah, this sounds kinda harsh, but I don't want to hurt. Just pointing some stuff out :)
9:20 my hungarian ass was real confused for a second.
24:01 (:3)
:3
12:02 I'am doing my own lang and I've come up with a hack, to use function pointers instead of state enums. The code looks something like this:
struct StateData;
using voidfptr = void*;
/* some compilers allow to store function pointers as void*, but
the C-standard says that one should only use a function pointer type
to store a function address. So created a wrapper around a dummy
function pointer type for my project that serves as a void*, but for functions
*/
voidfptr NewToken(StateData&);
voidfptr NumericLiteral(StateData&);
voidfptr OperatorLiteral(StateData&);
voidfptr CompleteToken(StateData&);
int main() {
using StateProc = voidfptr (*)(StateData&);
StateData data = {/*init data*/};
StateProc lpfnNextState = NewToken;
while (lpfnNextState != nullptr) {
lpfnNextState = lpfnNextState(data);
}
}
voidfptr NewToken(StateData& state_data) {
int c = /* get cur symbol*/;
switch (c) {
case '0'...'9':
return NumericLiteral;
case '+':
case '-':
case '*':
case '/':
return OperatorLiteral;
case EOF:
return nullptr;
}
}
So basically a virtual call. That's slow.
@@karbovskiy_dmitriy modern cpus are quite smart about virtual calls, as long as the code fits into the instruction cache. The actual slow part of the virtual call is the cache miss, not the call itself
@@Raspredval1337 the slow part is indirect call prediction. Virtual calls are hard to predict, especially if the destination address changes all the time (as it will). Predictable virtual calls are the calls from different sites or repeating patterns of calls.
@@karbovskiy_dmitriy any conditional jump would invoke branch prediction, no matter if it's a switch or a virtual call, it makes no difference to the cpu. The main benefit of the switch is cache locality.
@@Raspredval1337 in your example lpfnNextState(data) is an indirect call from the same calling site with the same parameters. Regular branch prediction is easily predicted because of both static and dynamic predictions and it can be optimised for out-of-order execution and pipelining. Indirect branching leads to unpredictable addresses so it stalls the pipeline. The choice to take the branch is binary. The destination addresses in indirect branching are hard to predict, especially from the same calling site.
Intel Software Optimisation Guide in the section 3.4.1.5 explains why indirect calls with multiple possible destinations are slow. If there is no BTB entry for this call, the default behaviour is fall-through (in terms of out-of-order execution, which is bad for FSM). BTB only holds 1 entry for this call, making it impossible to consistently predict the target address. They even suggest breaking indirect branching into direct calls for better prediction rates.
разбитие на объекты по условию, после определение типа объекта...
:)
Waffle
this made me realize how unclever, stupid, and genuinely worthless i am. ive never organically developed anything without several resources, websites, and other people's code. this is natural selection in action, i am not cut out for this world at all lol.
Just call the language Pixel.
I feel like every video about a advanced topic has some "experts" waiting who want to feel superior. They already know everything but still watch the video with their notepad open to comment every little "mistake". He never claimed that this is best practise or an educational tutorial video. He is doing it for free and for fun. Just sit back, relaxe and watch it as entertainment.
41:23 you forgot octals :)
This is the longest regular expression informercial I've ever seen.
You can use ready made compiler compiler. It’s easier
Boooooooo
Making a custom lexer isn't that hard though, and is quite fun. "Just use xxx" is a solution to almost everything shown on this channel, that's not why people come here.
Generating the parser (not necessarily the lexer) using something like bison is likely the best option if your goal is to get a usable language though, since it gets exponentially more spaghetti as you add more to the grammar.
taking shortcuts makes for shit videos
18:11 One better: if ( (unsigned)(c - '0') < 10 ) { ... }
comparison operator does an implicit subtraction in x86_64 so i don't think there is any benefit of doing this. Except of making the code less readable
@@maksymiliank5135 Subtraction is slower How does this make it less readable? I'm transforming the result of removing the lower bound of the range from the character we want to check, then casting it to unsigned to ensure the comparison thereafter does not consider negatives.
Also if the comparison is doing subtraction under the hood then all the MORE reason to cast the result to unsigned. Makes clear to the compiler that no jump is needed for the result wanted, just throw the result of the 1st subtraction into an unsigned subtraction to see if the result is < 0.
@@zxuijithe compiler will compile down to the exact same assembly with both versions.
@@rosen8757 If they're the only 2 conditionals then sure, I'd take you're word for it but not when there's extras to consider like c >= 'A' && c
This is bad advice:
- exceptions are a plague of bad solutions. They don't solve any real problem and they rot the code. I've implemented multiple parsers for existing and my own languages. It has been exactly 0 times that exceptions had a signle use case. This is a bad example for an educational video. Error codes are compact and meaningfull, and also local to the problem;
- your token struct is way too large. Just the std::string in x64 Release build is 32 bytes. This token is 48 (without the duplicate "type" member), which almost the whole cache line. At least a union would be useful;
- 32:32: this throw statement just crashes the lexer instead of handling the error. The robust implementation would report the lexical error and continue the analysis. The same applies to other lexical error, not covered in this video. This is a bad excuse to justify the use of exceptions.
Correction: this is not a parser, this is a lexical analyser, or lexer.
LUT with constexpr is a good advice. std::string.find is an extremely bad implementation that shouldn't be even mentioned (that's 4 procedure calls in Release mode, 2 if the string is stored as const).
why not program all operators as unary operators? so if you have 5, it is the unary +5 etc...
So this whole video is about constructing a lexical analyzer? Cause you've not called this part of your language that at all.
I don't know how to interpret what a unary divide means.
Correct. I've chosen not to use any compiler terminology, or established techniques, tricks or algorithms. Buy a book about compiler design if formal training is preferred. I code things for fun.
First comment