Це відео не доступне.
Перепрошуємо.

Rust Web Development (Search Engine Ep.02)

Поділитися
Вставка
  • Опубліковано 18 сер 2024
  • References:
    - Source Code: github.com/tso...
    Timestamps:
    00:00:00 - Intro
    00:00:42 - TF-IDF Recap
    00:02:41 - What we are developing
    00:03:10 - Web UI
    00:03:29 - Syncthing
    00:04:18 - About Electron
    00:04:41 - What I wanted to do today
    00:05:03 - RuSt Is ThE mOsT lIkEd LaNgUaGe
    00:05:34 - Subs
    00:06:21 - Picking the Web Framework
    00:07:46 - tiny-http
    00:09:54 - Adding tiny-http to the dependencies
    00:11:04 - Introducing subcommands
    00:13:08 - Recursive indexing
    00:15:19 - Subs
    00:15:29 - Plans to replace JSON with Sqlite
    00:16:14 - Creating a new subcommand
    00:17:41 - English is weird
    00:18:53 - Starting up the Server
    00:21:08 - Handling incoming Requests
    00:24:39 - The Power of Simplicity
    00:25:32 - Serving HTML
    00:26:07 - Google y r u so bad?
    00:27:21 - Setting up correct Content-Type
    00:31:03 - First Try God Cooder dab-dab-dab-dab-dab
    00:31:48 - Please don't take me seriously
    00:32:50 - Designing the query form
    00:33:38 - Unbaking the HTML
    00:36:46 - How I handle errors these days
    00:40:36 - tiny-http Response ownership approach
    00:41:14 - Finishing HTML unbaking and fixing compilation errors
    00:43:25 - Changed the shirt and ready to implement the router
    00:47:09 - Rust developers are Java developers in disguise
    00:47:52 - Continue implementing router
    00:49:26 - Easy route aliases
    00:49:51 - Adding index.js
    00:53:06 - Factoring out static files serving
    00:55:41 - Why do I copy-paste code so much
    00:56:04 - Separation of different method
    00:56:48 - Factoring out serving 404
    00:57:22 - Rambling about code aesthetics
    00:57:47 - Going through compilation errors
    00:58:29 - Our small Web Framework
    00:58:58 - Subs
    00:59:26 - REST API for Search
    01:00:05 - Getting the Body of HTTP Request
    01:01:54 - Reading everything from Reader
    01:03:03 - Struggling to convert bytes to string 'cause Rust
    01:06:33 - Fixing compilation errors
    01:07:36 - JavaScript fetch()
    01:10:27 - Got some data from the Client!
    01:10:52 - Refactoring JavaScript code
    01:11:17 - Designing the search query format
    01:12:33 - Being Web Developer is hard!
    01:13:18 - Tokenizing Search Query
    01:14:48 - Tokenizer must convert characters to upper case
    01:15:46 - Subs
    01:16:01 - Compiler Assisted Refactoring
    01:18:44 - Tokenizer can handle punctuation
    01:19:12 - Loading up the Document Index
    01:21:26 - TF-IDF recap
    01:23:04 - Implementing tf computation
    01:29:11 - Constructing a smaller set of documents for testing
    01:31:11 - Computing tf for each document
    01:35:31 - Sorting documents by TF
    01:37:12 - Rustaceans are scared of floats lol
    01:39:32 - TF reflects the relevancy!
    01:40:08 - Studying IDF
    01:42:03 - Rediscovering logarithmic scale
    01:44:20 - Math is like Programming from alternative universe
    01:44:56 - Implementing IDF computation
    01:46:37 - Suskell
    01:49:30 - Combining TF and IDF
    01:50:39 - Rustaceans were right! Floats are scary!
    01:51:51 - Negative IDF bug
    01:56:23 - Is that a mistake in Wikipedia?!
    01:57:56 - Fixing the denominator adjustment
    01:59:37 - TF-IDF works!
    02:00:43 - Computing final ranking
    02:02:26 - Testing on bigger data
    02:06:37 - Trying other queries
    02:08:23 - We need stemming
    02:09:18 - The importance of owning your data
    02:10:02 - What is stemming
    02:11:02 - Performance sucks and I don't know why
    02:12:14 - UI/UX improvements ideas
    02:13:39 - --release
    02:15:23 - Outro
    02:15:33 - Smooch

КОМЕНТАРІ • 74

  • @dromedda6810
    @dromedda6810 Рік тому +19

    as a fancy overcomplex web developer, it feels good to finally be able to understand what tsoding in glabbering on about

  • @anafabula
    @anafabula Рік тому +14

    17:38 "cargo tree" shows a tree of all dependencies

  • @lamprospitsillou6325
    @lamprospitsillou6325 Рік тому +18

    Thank you so much for indexing the streams! Must have been a lot of work

  • @dimitardimitrov3421
    @dimitardimitrov3421 Рік тому +8

    What an incredible series! You explain things very well and work on cool projects! Here is me hoping for more Rust streams! Keep up the good work!

  • @jacobpolicano7289
    @jacobpolicano7289 9 місяців тому +5

    I think maybe some of the slowness is from calling idf() for every single document at 1:49:45 instead of just computing it upfront for the given tokens, doesn't that blow up to exponential? Love your content! Just subscribed :)

  • @trikynguyen9757
    @trikynguyen9757 8 місяців тому +2

    you are probably the teacher that everyone wants to have when they start learning programming. Watching you arguing and explaining things is such a great experience ........

  • @remrevo3944
    @remrevo3944 Рік тому +24

    I think it might be worth to calculate the rankings of the words at the indexing step and then to remove values that are below a certain cutoff. That might increase search performance considerably.

  • @1vader
    @1vader Рік тому +14

    1:39:20 There actually is a total_cmp function which implements the total order predicate specified in the IEEE standard which specifies an order for all floats, even NaNs and Infinity. So you can just do `vec.sort_by(f32::total_cmp)`. Edit: Although watching on, I guess it panicking on NaNs is maybe not a bad idea 😅

  • @RuslanKovtun
    @RuslanKovtun Рік тому +11

    To avoid division by zero you could increase both N and M by 1.

  • @hedlund
    @hedlund Рік тому +6

    Oooh! You being you, I'm just assuming Tauri, Actix, et al., won't be the stars of this show. Right or wrong I'm looking forward to watching this tonight :)
    On an unrelated note, I'd just like to thank you, mate. I know I may not look it, but I've found myself picking up the pieces of my mind and trying to stick them back together more times than I can count now. I don't have a single fucking clue why, but you're a tremendous help to that process. So, thank you, very, very much.
    Edit: And on yet another note - thanks for exposing me to Pretzel! Shit's absolutely awesome :)

  • @shekharxparmar
    @shekharxparmar Рік тому +3

    you're single handedly making me love rust more even though I like the language, looking at someone else program and see how they look up documentation or solve a problem is amazing. great vid

  • @YOOOOOOOOOOOOOOOOOOOOOOOOOOOO
    @YOOOOOOOOOOOOOOOOOOOOOOOOOOOO Рік тому +71

    How long did this take to time stamp?

  • @xelaxander
    @xelaxander Рік тому +11

    You can use the "include_str!" macro to directly dump text files as string literals into the source code.

    • @guywald1
      @guywald1 Рік тому +8

      True (and great macro) but then he would lose the ability to change HTML/CSS/JS and simply refresh the web page without rebuilding.

  • @nofeah89
    @nofeah89 Рік тому +1

    I've learnt now simplicity is the key

  • @Mirko_ddd
    @Mirko_ddd Рік тому +3

    I don't use rust but it is fascinating to see dudes like Alexey do stuff. Congrats

  • @pinchoboo736
    @pinchoboo736 Рік тому +10

    Can you not just precalculate TF-IDF for all tokens or is there something i am missing?

    • @vitfirringur
      @vitfirringur Рік тому +6

      Yup, you can. You calculate all the idf values for each token, then for each document you multiple its tokens' tf by the its idf and store that with the index, or something like that.

  • @sher1x165
    @sher1x165 Рік тому +5

    Pleae, can you provide name of your chair?

  • @TheAmadeus4
    @TheAmadeus4 Рік тому

    Timestamps! Yay, thank you Mr tsoding 😄

  • @bouhaddamohammeddjaoued2381
    @bouhaddamohammeddjaoued2381 11 днів тому

    Stringifying a string is a next level problem

  • @norndev
    @norndev Рік тому

    King of the keyboard shortcut

  • @chrly00
    @chrly00 Рік тому +1

    Nice series, I have been a lurker for some time, but now converted into a subscriber (I am a bit picky about my subscriptions, and beginning to realize that I should be even more "pickier"). I really enjoy your sarcasm, skills, and train of thoughts. Continue to do what you what you enjoy, you are brilliant!

  • @MonkeeSage
    @MonkeeSage Рік тому +2

    47:00 bro you didn't check what is convertible to StatusCode. There is a From impl for all the ints -> StatusCode on that docs page... you could have just written with_status_code(404)

  • @noahwinslow3252
    @noahwinslow3252 Рік тому +1

    The difference between a few and quite a few is the same as a minute to a hot minute

  • @Zielino
    @Zielino Рік тому +5

    suskell

  • @TheDuerden
    @TheDuerden Рік тому +2

    Why is your lexer noting punctuation? It isn't really of any value for searching? I need to go watch your lexer video, but my thought on time is that it is 1,611 * avg number of words/punctuation in docs * 5 - which is not 8,055 - but somewhere in the region of 400,000 if it is only 50 words/punctuation per doc - which I think is a really low count for the docs you indexing. I expect they are really running to many hundreds of words - but was planning to go watch the lexer json creation now and see whether I am missing something - great content btw - nothing like this out there anywhere from what I can tell!

  • @ac130kz
    @ac130kz Рік тому +4

    TCP on localhost is total bloat anyways, unix sockets is the way

  • @inujung8224
    @inujung8224 10 місяців тому

    1:37:08 i just love watching other's eaction with float not implementing Ord in rust. oh yea, i've been there too lol.

  • @rupen42
    @rupen42 5 місяців тому

    I think Wikipedia is just taking the unadjusted formula as the canonical one. Pretty sure it shouldn't matter anyway, since what matters in the ordering. Your adjustment however may break the ordering in some cases, so it's a small deviation from the algorithm.
    Also, in math "log" is assumed to be natural log, but that doesn't matter here (again because we only care about the relative score).

  • @MateHomolya
    @MateHomolya Рік тому +1

    @25:25 there is a joke I like, when Americans wanted to write in space they designed a zero G ballpoint pen. The Russians used a pencil.

    • @tildessmoo
      @tildessmoo 4 місяці тому

      That's actually a myth. Both countries used pencils at first and abandoned them as soon as possible, because graphite floating around a spacecraft is a disaster.

  • @rodelias9378
    @rodelias9378 Рік тому

    That was great!! Thanks man

  • @justinpeter5752
    @justinpeter5752 Рік тому +1

    there’s no way that the amount of documents containing a word could be less than the amount of total documents. there is a bug which counted more documents that contained a term than the total number of documents. 1:57:41

  • @rian0xFFF
    @rian0xFFF Рік тому +3

    Mozilla docs are very good

  • @naplesnola
    @naplesnola Рік тому

    I love this man 😂

  • @LennyBakkalian
    @LennyBakkalian Рік тому +1

    1:12:36 Nowadays you usually do this via proper frontend (Angular, React, Svelte etc...) and backend frameworks that do this for you.

    • @jodufan8754
      @jodufan8754 Рік тому +2

      Proper Frontend (Framework) "Angular" KEKW

    • @LennyBakkalian
      @LennyBakkalian Рік тому

      @@jodufan8754 i rate you as someone who watches videos from influencers who claim that react is the "best" framework and angular sucks, but don't give any reasons. Angular is actually used as much in enterprise applications as react. The only difference with react is that angular tells you how to build apps (which bothers most influencers who have no idea because they've never worked in a company with a large code base). I am annoyed by these clueless people who want to rate a framework without any reason. You probably think of Angular as AngularJS because you listen to the influencers and blindly blurt out their unqualified opinion. One advantage of Angular in my eyes is that the basic functionalities like (Routing, HttpClient, Guards, Auth, SSR, DateFormatting and much much more) are builtin and force the programmers to stick to guidelines instead of building a techstack themselves like with react, which makes it much more difficult to integrate new employees.
      And just because you see a graph somewhere where Angular is "the most hated" framework doesn't mean that this statistic is somehow relevant. e.g. many still reference angular with angularjs or have NEVER worked with angular and still give their opinion. If angular was so hated, there wouldn't be as many companies using it as react. Sure angular has some weak points, but so does every framework.

    • @jodufan8754
      @jodufan8754 Рік тому +1

      @@Tolrias i sadly know how common it is

  • @kawaikaede2269
    @kawaikaede2269 Рік тому +1

    ❤‍🔥

  • @j4n1x19
    @j4n1x19 Рік тому +3

    I do believe that using Log10 for the idf is false here. The regular natural logarithm should be the correct one.

    • @Kartoflaszman
      @Kartoflaszman Рік тому +10

      I don't think the base matters, it's the nature of the function -- converting a wide range of data into a smaller one -- that is important. So imo the base should be chosen to the one that is the fastest to compute (probably 2)

    • @hypnogri5457
      @hypnogri5457 Рік тому

      @@Kartoflaszman @j4n1x19
      To clarify, we use the logarithm not to scale down the number but to increase the weight of the token based on its "informational content" in the set of documents. This content is proportional to the number of bits needed to locate the token in the dataset.
      To illustrate this, let's use log base 2 because all logarithms differ only by a constant. Suppose a token appears in one out of 32 documents. To find it, we would need 6 bits (binary search). We get this number by calculating log2(32/1) = 6. Therefore, we are essentially scaling up the token weights based on their "importance," or more specifically, their "informational content."
      So it's not just a random logarithm that we use to scale down the values. We are using the logarithm because the logarithm is needed for the correct mathematical definition of the information content in Shannon bits (look up information theory on Wikipedia). Of course, the ranking algorithm will probably still perform even without the logarithm, but it wouldn't be a mathematically sound ranking, and it would probably (as you theorized) be overweighting rare tokens. I hope you now understand why the logarithm was used and not some other function that squishes numbers.

    • @mishaerementchouk
      @mishaerementchouk Рік тому +1

      This tf-idf thing specifies how much information the search terms provide for identifying particular documents. The base of the logarithm (2, e, or 10) just defines in which units we measure this information, in bits (shanons), nats, or dits (hartleys). Surely, for some cases, some particular units may be preferential. In the same way, as buying jewelry in a store calls for grams, while tons are better for discussing annual jewelry production in the world. Considering that factors relating bits, nats, and dits are rather moderate, this doesn’t really matter.

  • @ilovepeaceandplaying8917
    @ilovepeaceandplaying8917 Рік тому

    your videos amazing, my bad I didn’t have time to watch all of it, because of poor programming job

  • @Kniffel101
    @Kniffel101 Рік тому +2

    At one point it sounded like you didn't know how to implement smoothstep. It's a lerp between two parabolas. This here is the most straightforward explanation of it I've seen:
    ua-cam.com/video/60VoL-F-jIQ/v-deo.html
    Maybe it'll help you or someone else reading this at some point! =D

  • @fgdou
    @fgdou Рік тому

    Log != Log 10
    Because Log = Log base exponential I think

  • @Izzy_ez_
    @Izzy_ez_ Рік тому

    you're actually funny lmao

  • @polioann
    @polioann Рік тому

    How to be smart and productive like u?

  • @judahmatende3769
    @judahmatende3769 9 місяців тому

    rust is a slow language
    you heard it here first

  • @Nodsaibot
    @Nodsaibot Рік тому +1

    Wonderful watching the low level code wizards struggle with HTML xD

  • @pyMarek
    @pyMarek Рік тому

    Last

  • @user-pm2ru6ir6n
    @user-pm2ru6ir6n Рік тому +1

    Rust is terrible. In terms of syntax, and in terms of speed ... I vomit ... C is clean )

    • @GegoXaren
      @GegoXaren Рік тому +1

      RAmen, Brother.
      Not to mention the micro dependency hell that languages like Rust promotes.

    • @dmitriidemenev5258
      @dmitriidemenev5258 Рік тому +9

      Rust's syntax is more complex to express things that can't be expressed in C (e.g. lifetimes). Double colons in paths (such as in core::mem::take) are used in C++ namespaces too. Generics are needed for higher abstractions. Macros are needed for elimination of code repetition.
      Rust is a complex tool that allows to tackle complex tasks easily. If you have a good library, lots of stuff becomes simple.

    • @dmitriidemenev5258
      @dmitriidemenev5258 Рік тому +4

      ​@@GegoXarenSmall libraries are good because they're more reusable than big ones. Isn't it a philosophy behind Linux CLI tools?

    • @GegoXaren
      @GegoXaren Рік тому

      @@dmitriidemenev5258
      Stop chilin for a language that, for all functional purpose, does not support dynamic linking, and can't for all functional reasons use system libraries.
      The idea of Cargo is just as flawed as NPM. A package manager that does not actually manage package, and updating the "libraries" in cargo does not actually mean that any program that uses the are actually updated. It is a flawed system. It is a broken system.
      Not no mention the fact that you require expontionaly more ram for each line of code to compile, becouse you can, functinaly only do unity builds, and not staged builds

    • @dmitriidemenev5258
      @dmitriidemenev5258 Рік тому

      ​@@GegoXaren
      > Stop chilin for a language that, for all functional purpose, does not support dynamic linking, and can't for all functional reasons use system libraries.
      "cargo:rustc-link-lib=LIB" does provide a way to link dynamically together with #[link] attribute for an extern block. The canonical way of using systems dependencies in Rust is `system-deps` or `vcpkg` (whose name now is a misnomer as it supports pkg-config too).
      > The idea of Cargo is just as flawed as NPM. A package manager that does not actually manage package, and updating the "libraries" in cargo does not actually mean that any program that uses the are actually updated. It is a flawed system. It is a broken system.
      It's a system that does not break user's code. Whatever have compiled once will be compiled the next time too. Any update comes with a certain risk of breakage and the user should decide whether they want to update their dependency.
      > Not no mention the fact that you require expontionaly more ram for each line of code to compile, becouse you can, functinaly only do unity builds, and not staged builds
      If by staged builds you mean incremental compilation, rustc does support that. Rustc in general does consume quite a lot of RAM during compilation, yet not for a reason you think it does. Rust's procedural macros are ubiquitous and they make the lives of the developers easier because they eliminate the boilerplate code. However, they come at a compile cost and a small piece of Rust code can expand 500x. Macro expansion is rarely optimized in Rust because "who cares about compile-time performance anyway?".

  • @karl4813
    @karl4813 Рік тому +2

    What do you tsink about war in Ukraine? And how is it going in Russia in general now with sanctions?

    • @friren_elf
      @friren_elf Рік тому

      Now it's no coca cola, it's dobryi cola.

    • @karl4813
      @karl4813 Рік тому

      @@friren_elf kurwa