How to Build ML Solutions (w/ Python Code Walkthrough)
Вставка
- Опубліковано 6 чер 2024
- 👉 More on Full Stack Data Science: • Full Stack Data Science
This is the 4th video in a series on Full Stack Data Science. Here, I explain why experimentation is critical to the ML lifecycle and walk through the development of a semantic search tool for my UA-cam videos.
More Resources:
💻 Example Code: github.com/ShawhinT/UA-cam-B...
🤖 RAG: • How to Improve LLMs wi...
📚Text Embeddings: • Text Embeddings, Class...
References:
[1] / software-2-0
[2] arxiv.org/abs/2012.07919
--
Book a call: calendly.com/shawhintalebi
Homepage: shawhintalebi.com/
Socials
/ shawhin
/ shawhintalebi
/ shawhint
/ shawhintalebi
The Data Entrepreneurs
🎥 UA-cam: / @thedataentrepreneurs
👉 Discord: / discord
📰 Medium: / the-data
📅 Events: lu.ma/tde
🗞️ Newsletter: the-data-entrepreneurs.ck.pag...
Support ❤️
www.buymeacoffee.com/shawhint
Introduction - 0:00
Why ML is Different - 0:39
Role of Experimentation - 3:04
Semantic Search (Design Choices) - 5:09
Example Code: Semantic Search of YT Videos - 8:17
Preview of Final Product - 10:06
Step 1: Experimentation & Evaluation - 11:17
Step 2: Build Video Index - 34:14
Step 3: Build UI - 35:49
What's Next? - 43:43
More on Full Stack Data Science 👇
👉 Series Playlist: ua-cam.com/play/PLz-ep5RbHosWmAt-AMK0MBgh3GeSvbCmL.html
💻 Example Code: github.com/ShawhinT/UA-cam-Blog/tree/main/full-stack-data-science/data-science
Brilliant, thanks
Great video, really interesting.
A question on the encoding process. Does condensing transcripts into an embedding with 384 dimensions lose much information, or does the encoding process truncate the text at a point?
How would something like this manage a lengthy transcript where you cover several different topics?
Does the embedding get too "noisy" in that case to be able to really stand above your threshold if only perhaps 5 lines out of 100 contain the information relating to the search?
That's a great question. Whether (much) information is lost depends on the specific use case. For example, if you have simple text chunks that either say "True" or "False" then even a 1 dimensional embedding will preserve all the information. However, as your describing, the longer the chunks the more information can be lost. This is why experimentation is so critical because you can't really know 1) how much "information" is preserved by embeddings and 2) how that impacts your use case, before just trying it out.