How regexes got catastrophic

Поділитися
Вставка
  • Опубліковано 17 січ 2025

КОМЕНТАРІ • 217

  • @focando-lol
    @focando-lol Місяць тому +243

    i understand nothing but lowkey enjoying what is happening

    • @Musikvidedo
      @Musikvidedo Місяць тому +10

      if you keep at it you'll understand most of it pretty soon

    • @NithinJune
      @NithinJune Місяць тому +7

      kay is so pretty how am i supposed to focus 😵‍💫

    • @JordanManfrey
      @JordanManfrey Місяць тому +7

      That’s most developers experience using RegEx in a nutshell

    • @v0id_d3m0n
      @v0id_d3m0n Місяць тому +1

      It makes more sense if you've watched the first two videos :)

    • @MarcosElMalo2
      @MarcosElMalo2 11 днів тому +1

      Kay had a very nice voice.

  • @yaboi1525
    @yaboi1525 Місяць тому +148

    Please don't stop uploading. I found this channel by chance and it has to be one of the best things happening this year

    • @moosehunter248
      @moosehunter248 Місяць тому +3

      Sounds like you should consider sponsoring the channel so it doesn’t happen 😊

  • @Julienraptor01
    @Julienraptor01 Місяць тому +136

    I now understand why i'm good at regex where most people fail
    I never use backreferences
    Like idk why but they never felt right logically
    Even when i made crazy things like a youtube URL parser to clean those in regex, i've found ways to just do it without backreference when i could have used some
    And it's kinda cause when i'm building the regex, i'm running it in my mind and backreference just makes that impossible
    Like tracking what it does become too complex
    So big thanks for this vid, very informative and great !

    • @Eunakria
      @Eunakria Місяць тому +17

      for most CFGs I've always found it easiest to just build a tokenizer with regex and a parser with other techniques
      leaving the abstract computer science world and entering practical software engineering world - it is possible to express any CFG as just that, a parser, but the tokenizer/parser distinction often yields faster, more readable code with advisable properties for most practical applications of parsers (e.g. DSLs such as SQL or serialization formats such as JSON)

    • @JordanManfrey
      @JordanManfrey Місяць тому +1

      I’ve never used them because when I learned them in school they would bring up stuff like that as being dangerous bullshit

    • @Eunakria
      @Eunakria Місяць тому

      @@JordanManfrey I've had teachers and profs like this before. totally disconnected from the real world. they treat regexes like the devil, while pretending that memory safety errors in languages like C/C++ are a non-issue.
      regexes are an incredibly powerful tool if you know how and where to use them. definitely, part of "how and where" is knowing when you as a programmer, or your team, may be too error-prone to use the tool in confidence. I definitely would avoid using regex, or at least seriously consider and verify my use of it, in any high-availability or high-reliability context; but it's not so bad, all things said and done

    • @AySz88
      @AySz88 Місяць тому +5

      ​​@@EunakriaBizarrely enough, I've more often encountered new learners going the other direction - they "in practice" want to abuse regex features, because it's built in, and looks powerful, so it must be fast. (Think "I built a calculator in Minecraft" style shenanigans, and Parker's usual stuff for that matter.) And parsing and tokenizing is more of an abstraction for them, especially if they never got a sense for how to use the toolkit of functional programming.

    • @Noname-67
      @Noname-67 Місяць тому +1

      Maybe your brain is simulating a finite automaton when doing regex.

  • @wojciechostrowicz
    @wojciechostrowicz Місяць тому +25

    YT algorithms just decided that my constant work with regex deserves this video. Thank you algorithm.
    It was very pleasant to watch.

  • @rkvkydqf
    @rkvkydqf Місяць тому +6

    I can't believe this video has only 31K views. All this work the amazing visualizations, the quality of the explanations, the lined exercises in the description. I truly hope all this work would get rewarded some day. Thank you so much.

  • @9_-_-_-_-_swo
    @9_-_-_-_-_swo Місяць тому +47

    i got a (.*?) tattoo when i barely knew what regexes were lol. thank you so much for this series, it has done wonders for my ability to live with that decision (and is also some of the best comp sci content I've ever seen on this platform

    • @NickiRusin
      @NickiRusin Місяць тому +1

      that's a really good idea for a tattoo

    • @TVDaJa
      @TVDaJa Місяць тому

      Someone please explain for my lil brain

    • @SimonBuchanNz
      @SimonBuchanNz Місяць тому +20

      ​@@TVDaJaaccepts anything, but makes everything after it backtrack as much as possible if it fails. An incredibly efficient way to be as inefficient as possible!

    • @9_-_-_-_-_swo
      @9_-_-_-_-_swo Місяць тому +2

      @@SimonBuchanNz u can just put it between stuff

    • @SimonBuchanNz
      @SimonBuchanNz Місяць тому

      @@9_-_-_-_-_swo yeah, it's fine for input you know is going to be a reasonable size.

  • @JasonWalter
    @JasonWalter Місяць тому +12

    Reminded of what JWZ said: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

  • @LeeDanielCrocker
    @LeeDanielCrocker Місяць тому +72

    FYI, Rust's standard regex library uses an O(mn) algorithm without backreferences.

    • @GregWhiteley
      @GregWhiteley Місяць тому +10

      Puzzled by this. Rust has _no_ standard regex library, and rust-lang discussion groups trumpets this as a feature, not a bug. Or has something changed.

    • @albertweber1617
      @albertweber1617 Місяць тому +8

      The de-facto default regex crate "regex" does not implement look-around or backreferences

    • @LeeDanielCrocker
      @LeeDanielCrocker Місяць тому +8

      @@GregWhiteley The regex crate is not de jure standard, but de facto the one everyone uses.

  • @erik_with_a_kay
    @erik_with_a_kay Місяць тому +6

    Where [regex] blood makes [regex] blood unclean! First-time watcher and this was great.

  • @jholloway77
    @jholloway77 8 днів тому

    You are seriously a great teacher!
    Wonderful video and wonderful discussion on the topic

  • @AstonishedByTheLackOfCake
    @AstonishedByTheLackOfCake Місяць тому +21

    I've always had a sneaking suspicion that regexes were too versatile and did too many things to actually be performant, turns out I was right
    though I did not expect back references to break the entire concept of a language being regular
    will probably use Google's regex engine in the future since I barely ever use backreferences anyway

  • @dr.strangelove5622
    @dr.strangelove5622 Місяць тому +19

    This is amazing!! I read about Thompson's algorithm last week when I was studying non-deterministic automata and the fact that regex engines in most modern and/or popular programming languages are slower than it and suffer from exponential blowup for longer expressions (if I remember correctly). The visualizations of algos was great and helpful in understanding them. Thumbs up for that!!
    All this increases my respect for these giants: programming all these using ed, the standard editor, on a teletype connected to a computer which was much slower than our present day handheld gadgets.

  • @somdudewillson
    @somdudewillson Місяць тому +9

    My personal position is that once one has spent a few hours trying to convince various HTML parsing libraries to _only_ parse the input string and stop reformatting it, parsing with a regex starts looking pretty good.

  • @Niki1A_
    @Niki1A_ Місяць тому +43

    Why not use the lockstep algorithm, when the regex has no backreference? It would be easy to just store a boolean with the automaton, that indicates whether it has backreferences, and pick the algorithm accordingly. This would limit catastrophic cases to regexes that actually use backreferences, which could be taught as something to be avoided.

    • @Rock6Sixes
      @Rock6Sixes Місяць тому +10

      wanted to comment this, but looked through the comments first. Seems very obvious, but I can't think of the reason why. Maybe the automata structures get reused when possible and you can't know where to use which engine.

    • @rngwrldngnr
      @rngwrldngnr Місяць тому +11

      It's possible some systems do, these days. The problematic cases all have backreferences, so it wouldn't actually help in those edge cases.

    • @Ziraya0
      @Ziraya0 Місяць тому +1

      The only reason I can imagine is that with and without backreferences would be two literally different algorithms and it's going to be hard to make the algorithms otherwise have exactly the same outputs for the same patterns & inputs. I've never used a backreference before, I can't really conceptualize why I would, but in part that's because I already know that regex can't parse html or email addresses, so I've never tried to do a thing you would use them for with regex; the obvious answer to me is basically what you said, but instead of a flag, the language has two literally distinct regex implementations that you're responsible for picking between, and writing regex that produces the result you want for the one you're using. One is fast, one has this weird trick.
      I do a lot of manual regexing stuff in text editors, spreadsheets and microsoft's PowerRenamer, so I'm already used to tailoring patterns and methodology to the environment I'm in. PowerRenamer in particular has an alternate engine it can use that's supposedly faster, but they're not 100% congruent in features and behaviors. In particular, PowerRenamer's default, a wrapping of .Net regex does something wrong with lookbacks so if you want lookbacks to work, you have to use the other one.

    • @PizzaRollExpert
      @PizzaRollExpert Місяць тому +5

      This might arguably be the best solution, but it's still far from ideal. Adding a backreference would completely change the performance characteristics of the regex, which is pretty unintuitive. It's better to have lockstep by default and force the programmer to use some special syntax for regexes with backtracking to make it clear that they have different performance characteristics. You might also want to use the ones with backtracking in some special cases anyway, since they can have a better best-case performance on certain input strings.

    • @ThomasWilde
      @ThomasWilde Місяць тому +1

      They can and do. That boolean flags the difference between O(n) and O(2^n). That's the problem.

  • @faldarith
    @faldarith Місяць тому +3

    Thank you for sharing the stackoverflow answer, I’ve never felt so seen

  • @4l6ag3
    @4l6ag3 Місяць тому

    What a great overview of this! Great refresher of things I haven't though much about since college, and explained more concisely than any of my professors managed to.

  • @nolanmccarthy3335
    @nolanmccarthy3335 Місяць тому +2

    I'm really drunk and I don't know what you're talking about but I'm enjoying it

    • @mmmhorsesteaks
      @mmmhorsesteaks Місяць тому +1

      Best way to enjoy regexes, really.

  • @sardonyx001
    @sardonyx001 17 днів тому

    I studied this 4 years ago at uni and totally forgot this is how we reason about regexes! Super helpful video thank you!

  • @bntegor
    @bntegor Місяць тому +5

    10:25 this transition cracked me up haha great video!

  • @JamesChurchill
    @JamesChurchill Місяць тому +9

    I use a regex to finite state machine code generation tool occasionally for tricky problems, and it's always violently obvious when I've accidentally added nondeterminism to my input regex - the state machine that comes out of the other side blows up spectacularly.

  • @fugoogle_was_already_taken
    @fugoogle_was_already_taken Місяць тому +32

    Chomsky hirearchy is not a guideline. If you adhere to proper definitions, it is provable. My formal languages teacher would tear you apart for this 😂

  • @drelephanttube
    @drelephanttube Місяць тому +17

    Haven't even watched this yet but a new Kay Lack video is automatic thumbs up.

  • @Tumbolisu
    @Tumbolisu Місяць тому +6

    I first learned about the concept of regular expressions at university. (yeah, my interests are not in parsing text, so i literally never saw them until then.) I was taught about the whole hierarchy, with context-free grammars and all that. a year later, i am working on a python project made by other students across the years, and it uses regular expressions all over the place. i sit down and properly learn about them, immediately notice that almost all of the code can be improved (such as some regexes potentially matching the wrong things), and of course, notice that the entire concept of matching groups and back references just... doesn't make sense. it's called a regular expression, but it actually isn't? the fact that every regex engine is forced to use the backtracking algorithm Precisely Because modern every-day regex isn't a regular language is just the cherry on top.
    this video mentions that people want to use regex to match html tags... just write a parser!? html is dead simple, it's the easiest thing on the planet to parse. i have manually written parsers for more complex things for fun.
    besides, i never liked the way regex strings look. they are just this ugly mess of punctuation and backslashes. i know some people who exclusively let ChatGPT write regex strings for them because they trust an LLM more with that mess than themselves.

  • @stephenJpollei
    @stephenJpollei Місяць тому +1

    This was a fantastic video and I did the first exercise and found strings that caused issues that made them go over 5 seconds. Most of the strings were well under 80 characters and I think the max was around 300. I likely wasn't efficient with the 300 one and could have made a shorter one.
    I think that this really made me understand the issue much better, but I will have to look into some more.
    Thank you, for making such an excellent set of resources.

  • @websiteuser7926
    @websiteuser7926 Місяць тому +2

    great video, i read that russ cox page years ago and it was fascinating. its surprising that so many languages still use slow regex algorithms

  • @chrispy2117
    @chrispy2117 Місяць тому +1

    Do i know enough to understand this video? No, absolutely not. Did your voice and explanations make this a really calming video? Yes, 100%. So despite the fact I'm probably not your target audience, I'm absolutely dropping a sub 😄

  • @MichaelLouard
    @MichaelLouard Місяць тому +2

    I can only validate you by experience and not the full level of theory you know, but I’ve had to refactor SEVERAL notebooks containing regex once I learned about what you’re talking about (albeit not as well as you’ve explained it here)
    please take this as a sign you are not just teaching people who know nothing, but also people who know some, but need to be better (people like me!)
    Thanks!

  • @yash1152
    @yash1152 Місяць тому +5

    0:40 lemme guess, backtracking is that least efficient method most Rgex engines use to support the backreference matching, with exception of RE2 (in c++) made by google's engineer with doesnt support that feature.
    Fun fact, backref matching is not a regex capability in first place.

  • @ouroya
    @ouroya Місяць тому +1

    watching this video is making me very glad i've spent the past few months reimplementing my friend's 32-line sed across multiple hundred json files into a program that actually parses and understands json properly. who would've guessed that this works infinitely better and is infinitely more capable?

    • @PauldeVrieze
      @PauldeVrieze 11 днів тому

      Json, like many other formats, is recursive and as such not regular by definition. Writing a (simple) parser is generally superior (but ruined by the dragon book).

  • @gronki1
    @gronki1 Місяць тому +7

    I love your channel! Just recently wrote my first parser in Fortran, language everyone tells me that I should not waste my life on, but it was possible!

    • @boombeet678
      @boombeet678 Місяць тому

      Actually a really good idea by the way, many people would recommend more modern alternatives like Python or Julia in terms of languages (even though they have enough differences not to necessarily be considered straight up replacements) for it but considering the time in which it came out, I would say FORTRAN is more akin to Lisp than COBOL in terms of it's use cases. Alike Haskell, probably wonderful to look at not necessarily for it being a swiss army knife (sometimes even getting outclassed in terms of specialization when it comes to which language is "the right tool for the job") but for what it does theoretically and technically, as a construct that holds up throughout time in that fashion.
      I'm wishing I also get to using it sooner or later alike Python and Julia, although the latter two from what I've gathered are probably going to come first for me. Doesn't change the fact that it's a language to take notes from nonetheless.

  • @BlueIsLeet
    @BlueIsLeet Місяць тому +20

    Use the KISS approach when creating your regexes and you'll be fine. For anything extremely complicated, use a parser library.

    • @mage3690
      @mage3690 Місяць тому +5

      Or just parse. It's not that hard, really.

    • @AnimeGIFfy
      @AnimeGIFfy Місяць тому +1

      Using regex is like the opposite of KISS. Parsing is as simple as you can get

    • @bensalemi7783
      @bensalemi7783 25 днів тому +1

      Some things are problems that need solutions. Some things are solutions that are looking for a problem.
      RegExes are problems trying to convince you they are a solution.

    • @PauldeVrieze
      @PauldeVrieze 11 днів тому

      Non-regular doesn't mean complex. Even math expressions with parentheses are not regular.

  • @generessler6282
    @generessler6282 13 днів тому

    Excellent rundown. I've been using RE2 for years. You might have mentioned that a Thompson non-deterministic automaton can be converted to a deterministic one, where at most one state is active at a time. See "subset construction." This is what lex/flex and similar tools do. Run time is linear in the input length. Zero penalty for number of states. Of course there's no free lunch. Pathologically bad regex'es can yield minimum machines (yes the algorithm can yield the unique minimum-state machine) still exponential in regex size. But unless you are doing something silly like computing regexes on the fly, this is easily caught at compile time. There exist libraries that provide such recognizers (vice flex et al that generate code).

  • @JackSlinger10
    @JackSlinger10 Місяць тому +2

    Really interesting stuff, thank you for sharing.

  • @PatFarrellKTM
    @PatFarrellKTM Місяць тому +1

    I want to meet Kay Lack Turing Award Winner. I'll wait a bit. So much fun watching early 'go' videos about how Russ Cox, Rob Pike, Ken Thompson and Rob Griessimer used their decades of experience. Plus lots more refugees from Bell Labs....

  • @HerbieBancock
    @HerbieBancock Місяць тому +2

    "I object to binary searches because I'm non-binary" -Pasty Zoomer

  • @Jalae
    @Jalae Місяць тому +6

    seems like plan9 is the branch of reality where ken thompson and ross cox fixed their regexps. I was confused why it was always greedy and didn't have back references.

  • @bjorgemeulemeester1398
    @bjorgemeulemeester1398 Місяць тому +3

    Stumbled here somehow. Quality of production and depth of content are absolutely unmatched.

  • @NithinJune
    @NithinJune Місяць тому +11

    yikes i think i missed some prerequisites

  • @yoctometric
    @yoctometric Місяць тому +2

    Thank you so much for producing this kind of content! Seconding those who enjoyed the transition to the html srackoverflow post. You explain things very well

  • @Leadbraw
    @Leadbraw Місяць тому +1

    keep up the great work

  • @benarcher372
    @benarcher372 Місяць тому +1

    Great video! I liked very much "... from somebody called Rob Pike..." 🙂I'm sure he doesn't mind!

  • @brandonyong301
    @brandonyong301 Місяць тому

    Cool gem of a channel

  • @doomcake2020
    @doomcake2020 15 днів тому

    This is fantastic!

  • @nik90001
    @nik90001 Місяць тому

    I think it's pretty common in search stuff to evaluate regexes as finite automata. This lets them intersect the automata with the search index to iterate all marching strings. That's nice because you can OR together the list of matching documents for all strings and get the list of matching documents for the whole regex.
    Also interesting, there is another sort of attack. Usually these algorithms want a DFA but regex make an NFA. Converting between the two *can* use a ton of memory. I accidentally crashed search on Wikipedia using it. That was exciting.

  • @mrflipmrflip
    @mrflipmrflip 12 днів тому

    One appealing use of regexes: they're terse and (if you follow house conventions) readable. If I have to gate on a string enum, I can write a case statement spread out over lines, or say `if (str == this or str == that or ....` to column 140..., or construct an array or a set and then test on it. But the expression ` if (str ~= /^(this|that|the other)/)` is readable and it's easy to reason about its impact (test against the others if you'll put it in a tight loop, otherwise whatevs)

  • @checkmate080
    @checkmate080 Місяць тому +1

    video got me to start using this option in C#:
    [GeneratedRegex("[^a-zA-Z0-9]", RegexOptions.NonBacktracking)]
    from microsoft docs:
    Enable matching using an approach that avoids backtracking and guarantees linear-time processing in the length of the input.

  • @remrevo3944
    @remrevo3944 5 днів тому

    You can definitely generate text from regexes with backreferences.
    The actual hard thing is to generate is text that *doesn't* (as in can't ever) match the regex.

  • @catcatcatcatcatcatcatcatcatca
    @catcatcatcatcatcatcatcatcatca 11 днів тому

    From now on whenever I use wildcards both around and inside a capture group within the interactive regex substitution feature of my text editor, I’ll make sure to thank my modern CPU for not only tolerating my expressions, but previewing them while I type.

  • @hugmynutus
    @hugmynutus Місяць тому +2

    14:30 appreciate this story. I never realized it was Stream Ed, which is cool as I use sed a lot.

  • @hanpham5790
    @hanpham5790 Місяць тому

    I love your videos and contents so much Kay!! :^)

  • @ThePlodger
    @ThePlodger 12 днів тому +1

    I do actually understand regexes, but to get to that point required implementing a couple of regex engines from scratch...

  • @ms-fk6eb
    @ms-fk6eb Місяць тому +18

    I hate this, the kind of "yes, but actually no" type stuff that, like the SO answer, consumes all that is pure and beautiful

    • @neoeno4242
      @neoeno4242  Місяць тому +3

      would you mind saying a little more about this @ms-fk6eb?

    • @logickedmazimoon6001
      @logickedmazimoon6001 Місяць тому +3

      nothing is black and white, nothing is known for sure, our concept of objective is how repeatable something is in one or more cases but there will always be one or more cases that repeatable thing is not... repeatable. When we say "yes" in response to a "does this happen?" actually means, "More than likely yes"

    • @Criz454
      @Criz454 Місяць тому +2

      @@logickedmazimoon6001 things literally are known for sure but ok

    • @logickedmazimoon6001
      @logickedmazimoon6001 Місяць тому +1

      @@Criz454 In what sense?

    • @wumi2419
      @wumi2419 Місяць тому +1

      ​@@Criz454no. Everything has it's limits of applicability. Any physical model breaks apart or is too complicated to be computed. Sometimes both.

  • @fugoogle_was_already_taken
    @fugoogle_was_already_taken Місяць тому +4

    Why are nondetermenistic automata not compiled to deterministic ones? Wouldnt it solve the algorithmic complexity?

    • @imacds
      @imacds Місяць тому +4

      I haven't done the math so I am not sure about this, but my guess would be that the resulting deterministic automata would be the same as traversing with backtracking through the nondeterministic one. The memory consumption of a deterministic automata can also grow quite quickly, so it would likely let you write OutOfMemory dos attacks.

    • @Vaaaaadim
      @Vaaaaadim Місяць тому +10

      Given an NFA you can make a DFA that is equivalent, however the number of states in the DFA may be exponentially large.

    • @schwingedeshaehers
      @schwingedeshaehers Місяць тому

      how can it grow quickly? you just need to store the state and the rest string​@@imacds

    • @JordanManfrey
      @JordanManfrey Місяць тому

      @@Vaaaaadimyeah this, it just sets off the “I’m creating a monster” alarm bells so you don’t do it

    • @moarjank
      @moarjank Місяць тому

      It's mathematically impossible, as far as we're aware

  • @NillKitty
    @NillKitty Місяць тому +1

    I love your hair and your accent. But to echo others, how do you do your animations?

  • @FROZENbender
    @FROZENbender 19 днів тому

    yeeah this is why I avoid regexes in most languages. as soon as they break the limits of their chomsky hierarchy I can't seem to wrap my head around them. they turn into this brutally terse unreadable embedded language that I find impossible to maintain.
    great explanations in this video. I was under the assumption that regex parsing already has exponential time complexity but learning that that is only actually tied to the "bonus" features that are the very reason I avoid regex was enlightening!

  • @blacklistnr1
    @blacklistnr1 26 днів тому

    regex defaults always seemed weird to me, the usual case is N serial chunks of: start parsing when x, stop parsing when y, check that what is captured along the way follows some rules.
    I feel like this idea could be expressed so much better if regex didn't try to be so smart, it would also make the runtime faster while avoiding the troublesome edge cases.
    I'd be really interested to find out if there's a legitimate production use case for the smarter regexes (even the 'a repeated pattern from start to end' one you mentioned seems far-fetched to me)

  • @neilsvedberg
    @neilsvedberg Місяць тому

    I think raku's approach to more verbose regexes is a step in the right direction.

  • @LeeSmith-cf1vo
    @LeeSmith-cf1vo 11 днів тому

    It seems that the choice of algorithm is a trade-off between CPU and memory.
    While the backtracking algorithm may have a worst-case performance problem (i.e. catastrophic backtracking) , the other algorithm very likely has a worst-case memory problem (i.e. too many concurrent states)
    It also seems to me that its easier to accidentally create memory-eating regexes with the 2nd algorithm (e.g. /.*/) than it is to accidentally create a cpu-starving regex with the 1st.
    So yes, even if you ignore the possible feature set, there are still pros and cons.

  • @sommanker
    @sommanker Місяць тому +1

    Ehat about lookaheads and lookbehinds?

  • @MaxG628
    @MaxG628 28 днів тому

    Setting aside the maintenance of two code paths, it seems like a regex library should be able to detect if back references are used, and if not, use the more efficient algorithm. Are there any of that do that?

  • @Sandromatic
    @Sandromatic Місяць тому

    It seems like a good idea to do lockstep by default and then go to backtracking if a back reference is included in the regex? Most of the time you don't use back references, but if you do it's very useful, but obviously then causes the inability to use the efficient algorithm. Since regexes are usually compiled ahead of time, the algorithm can be chosen ahead of time depending on if back references are included. Just include a warning in your documentation that backrefs can cause major slowdown.

  • @fugoogle_was_already_taken
    @fugoogle_was_already_taken Місяць тому

    Did not know about redos, thx

  • @maxmustermann5590
    @maxmustermann5590 Місяць тому +2

    I love you kay

  • @Lantalia
    @Lantalia Місяць тому

    The presence of backreferences are a signpost that you have moved beyond the lexer and should be writing a parser

  • @Il_Dilettante
    @Il_Dilettante Місяць тому +1

    I'm a feynman bro except instead of feynman it's regex
    Thank God for RegEx

  • @dungeonrobot
    @dungeonrobot Місяць тому

    Great video! Not sure how it’s only at 1k views.
    Very disappointed to discover after this video that the game engine I was using only supports PCRE2

  • @brendand3765
    @brendand3765 Місяць тому

    Is there a way to analyze regexp to use the backtracking algorithm only when back references are used ? It doesn't to seem to be used that often.

  • @peterc-s6423
    @peterc-s6423 Місяць тому

    1. use a regex that matches if a regex causes catastrophic backtracking
    2. if it doesnt, match to see if it uses backreferences
    3. if it uses backreferences, use backtracking
    4. if it doesnt use backreferences, use lockstep

  • @Flourish38
    @Flourish38 Місяць тому

    Being unable to negate generation in the same way you can negate recognition is reminiscent of how SAT is in NP, but UNSAT is in EXP because there’s no way to read a sufficient proof in non-exponential time.

  • @vitalyl1327
    @vitalyl1327 10 днів тому

    A simple rule: every time you think you should use regular expressions, you'd better use PEG instead.

  • @Lucas-pj9ns
    @Lucas-pj9ns Місяць тому

    thanks for the interesting video! i used to hate regex cause it looked like arcane magic but the theory is so interesting. I've been trying to figure out how to implement capturing groups and I'm a bit stuck. I have some ideas about storing more state and I've optimized it so that it only takes O(capture groups^2) more memory but I don't know if that's good or not. I skimmed Russ Cox's papers you linked and I can't seem to find the part about implementing capture groups. Could you point me to which section talks about capture groups or some other link about that?

    • @neoeno4242
      @neoeno4242  Місяць тому

      Hi Lucas! This is the one in which he talks about it: swtch.com/~rsc/regexp/regexp2.html - roughly the idea is to treat each active state as a ‘thread’ and then maintain a list of all the unique threads active. You don’t have to go the whole way implementing a virtual machine to get something to work but it’s an interesting exercise. If that doesn’t help though drop me an email at hello@0de5.net and I can share some code.

  • @rogo7330
    @rogo7330 Місяць тому +1

    I just avoid regexes as first thing when I parse something. If you ever thought to parse XML *or* HTML (they are quite not the same and I hate it) with regexes, consider to read specs on them, realise that people who wrote all that crap are just sociopaths and hate you, and go back to just splitting file and writing your automata. At least in that case you will find a bug right away or find it more easielly, almost always you will know what parts of the spec you are ignored, and in best case it will be much simpler and faster since you can say «nagh, I know I parsing this xml from that program, I just gonna be sure to preserve information that program produces and just skip parsing of the XML all-together».

  • @i_am_lambda
    @i_am_lambda Місяць тому

    Why not scan the regex for back references and use lockstep if none exist?

    • @rngwrldngnr
      @rngwrldngnr Місяць тому +2

      Some tools probably do, but the specific website-taking-down cases were all backreferences related, so it wouldn't help with the really icky edge cases.

  • @mike94560
    @mike94560 8 днів тому

    This is all well and good. However not all languages implement REGEX syntax the same way. I used a chart to translate for the language I am writing in.

  • @bensalemi7783
    @bensalemi7783 25 днів тому

    If I wanted to right unreadable code, I'd enter an obfuscated C contest.
    I don't care what the theory is. I don't care how efficient you can make a regEx implementation. If you want to make things as efficient as possible you can always write in assembler, but people don't do that unless they absolutely have to (or for fun) and the reasons are obvious. For those same reasons, try to avoid regEx and realize they are from another time and solved a different problem than the ones we need to solve today.

  • @02052645
    @02052645 Місяць тому +1

    OK, but I need to know: can you parse HTML with a regex?

    • @stephenJpollei
      @stephenJpollei Місяць тому +3

      I think at best you can do a poor-man's tokenizer somewhat sanely using regexps. The whole grammar includes full recursion etc so no you can't properly parse arbitrary html just using a regexp. You can scrape stuff from html that follows known pattern or template but that can be extremely fragile when even minor changes to the format is done.

  • @zweitekonto9654
    @zweitekonto9654 Місяць тому +5

    what's a regular language tho. I don't think you ever got into the definitions in the series.

    • @drdca8263
      @drdca8263 Місяць тому

      A regular language is one that can be recognized by a finite state automaton

  • @rich_in_paradise
    @rich_in_paradise Місяць тому

    Maybe because I learnt compilers (Dragon book, yay!) before I programmed in a language (or used an OS) that had regexes, I always thought people coverted their NFAs into DFAs before recognition.

  • @mysteryman7877
    @mysteryman7877 Місяць тому

    Wait, do you issue regex licenses?

  • @ricardojunior4334
    @ricardojunior4334 27 днів тому

    "Someone called Rob Pike"

  • @JohnSmall314
    @JohnSmall314 Місяць тому

    Has anyone done the quantum computation version of this?

  • @caniggiasyabil470
    @caniggiasyabil470 Місяць тому +4

    bro you are GORGEOUS omg

    • @denisvolin7489
      @denisvolin7489 9 днів тому

      Oh, I was struggling to identify that ... whatever it is... Thank you, for clearing that it is he.

    • @caniggiasyabil470
      @caniggiasyabil470 8 днів тому

      @@denisvolin7489 What do you mean by that? If you're here to stir up some tiny discourse, stop it and take it elsewhere. The dude, as I perceived, clearly puts effort into his appearance, beautiful hair, radiant skin, and a cute pink sweater. That alone deserves appreciation.

  • @jammin023
    @jammin023 Місяць тому +1

    You keep saying "in big-O notation..." but then giving the time complexity's *name* rather than its big-O notation... e.g. quadratic time would be O(n²)

  • @viniciusrolandcrisci272
    @viniciusrolandcrisci272 Місяць тому +6

    how do you create your animations?

  • @v0id_d3m0n
    @v0id_d3m0n Місяць тому

    That stackoverflow answer is hilarious. Thanks for sharing xD

  • @studogYT
    @studogYT Місяць тому +2

    Perl predates Linux by 3 or 4 years.

  • @Klosterhasi
    @Klosterhasi Місяць тому

    10:25
    no way its actually been there the whole time holy shit

  • @disasterarea9341
    @disasterarea9341 12 днів тому

    damn i built a process at work that uses dynamically generated regexes for data processing but im ngl i have no idea whats going on in this video lol. doesnt make any sense to me apart from the O(n) stuff, i come from a maths background + learning programming on the job, not compsci

  • @bellaF
    @bellaF 20 днів тому

    i agree

  • @omegahaxors9-11
    @omegahaxors9-11 Місяць тому

    I still don't know what a regex is.

    • @williamdrum9899
      @williamdrum9899 Місяць тому

      Regular Expression. It's a computer program that lets you tell if a piece if text fits a common format, such as an email address, a phone number, etc. Theoretically it works with any kind of text pattern. The downside is that it looks nonsensical to a human reader

  • @Sub0x-x40
    @Sub0x-x40 Місяць тому

    thank the Gods for gpt doing all my regex when I need it

  • @gasun1274
    @gasun1274 Місяць тому +1

    r u enby

  • @tommihommi1
    @tommihommi1 Місяць тому +3

    most things done with regexes should instead be done with plain, readable, testable, debuggable code.

  • @randomchannel-px6ho
    @randomchannel-px6ho Місяць тому

    How they got catastrophic xD

  • @stupidteous
    @stupidteous Місяць тому

    u look like finlay christie

  • @Tobiky
    @Tobiky Місяць тому

    banger

  • @ducemano
    @ducemano Місяць тому +2

    Love me some diversity in the mathmetics field

  • @killerwrox
    @killerwrox Місяць тому

    Why do you know so much about regex?

  • @yxyk-fr
    @yxyk-fr Місяць тому +1

    0:35 yeah let's hack !!!

  • @leeschumacher8285
    @leeschumacher8285 Місяць тому

    G … N … U ?!?