Text Embeddings Reveal (Almost) As Much As Text

Поділитися
Вставка
  • Опубліковано 6 чер 2024
  • This paper outlines how, under certain circumstances, text embeddings can be used to reconstruct the original embedded text.
    OUTLINE:
    0:00 - Intro
    6:50 - Vec2Text: Iterative Embedding Inversion
    12:20 - How to train this?
    21:20 - Experimental results
    26:10 - How can we prevent this?
    31:20 - Some thoughts on sequence lengths
    Paper: arxiv.org/abs/2310.06816
    Abstract:
    How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naïve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes. Our code is available on Github
    Authors: John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M. Rush
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 93

  • @AlignmentLabAI
    @AlignmentLabAI 6 місяців тому +103

    Whats hilarious is that this is a revelation despite the point of text embeddings being to represent the text as perfectly as possible

    • @terjeoseberg990
      @terjeoseberg990 6 місяців тому +1

      Exactly.

    • @scign
      @scign 6 місяців тому +18

      I bet this paper comes from the same authors of the hit classic "A blurry representation of a photo reveals (almost) as much as the original photo"...

    • @amanbansal82
      @amanbansal82 5 місяців тому +2

      Ikr, I genuinely don't understand why this topic was studied.

    • @donnychan1999
      @donnychan1999 5 місяців тому +2

      I think it depends on the model, for IR systems, the point of embeddings it to map queries and corresponding passages to closer locations, that does not necessarily have to be inversible.

    • @seadude
      @seadude 5 місяців тому +1

      @@amanbansal82Agreed. Seems like the authors applied pattern recognition to embeddings to determine the plaintext inputs. It’s *cool* but isn’t the same process used whenever trying to determine the input for any encoded text where algorithm = unknown? What would be truly paper worthy, is if the authors reverse engineered the embedding model algorithm using only vector output! (I didn’t read the paper, just arm-chair quarterbacking here)

  • @lobiqpidol818
    @lobiqpidol818 6 місяців тому +126

    Wow it's almost as if text embeddings are some sort of mathematical representation of the text they are based on.

    • @terjeoseberg990
      @terjeoseberg990 6 місяців тому +8

      LOL

    • @sunnohh
      @sunnohh 5 місяців тому +3

      Almost as if this were a parlor trick or something😂

    • @yakmage8085
      @yakmage8085 5 місяців тому +9

      Yeah it’s not a huge revelation sure, but the work should be applauded for its results as simply a benchmark paper. 50 steps, 32 tokens 92%. Awesome thanks

  • @agenticmark
    @agenticmark 6 місяців тому +8

    LOVE your videos! You make these papers come alive.

  • @mohammadxahid5984
    @mohammadxahid5984 6 місяців тому +7

    YK is back with paper summary. Thank you, YK

  • @jddes
    @jddes 6 місяців тому +19

    Your paper breakdowns are great, I'd love more

  • @krzysztofwos1856
    @krzysztofwos1856 5 місяців тому +16

    Could this become a benchmark for embeddings? If you can develop an embedding that preserves the information for longer sequences, it would be a more useful embedding than the one that does not.

  • @holthuizenoemoet591
    @holthuizenoemoet591 5 місяців тому

    I like this kind of research, and its amazing that it hasn't be done before

  • @chenmarkson7413
    @chenmarkson7413 5 днів тому

    27:35 I found it really interesting how the nuance of a text embedding correlates with the highest "spatial frequency" it has, as if information is constructed from an overlay of frequencies as in Fourier transform.

  • @avb_fj
    @avb_fj 6 місяців тому +20

    Lesson: if you can, anonymize your text and remove PII before putting your embeddings into a third party vector db…

    • @bgaRevoker
      @bgaRevoker 5 місяців тому

      I wonder if scrambling the dimensions (consistently across all vectors) wouldn't preserve desirables properties (similarity search, distance) with an additonnal way to obfuscate initial content.

    • @whatisrokosbasilisk80
      @whatisrokosbasilisk80 5 місяців тому +1

      ​@bgaRevoker I guess you'd have to assume that the embedding model and permutation matrix are unknown to an attacker.
      The permutation space is factorial length of the vector - so it isn't unthinkable that this would actually be viable encryption.
      However, if your shuffled vector is leaked - its possible to statistically reconstruct it, especially if the attacker knows key features in your dataset.

  • @SloanMosley
    @SloanMosley 6 місяців тому +4

    I realised this when I was considering it for searches data. When I realised that it was not secure I wound up setting the meta data to the actual text to save on a database query. But this isn’t that worrisome. If you were getting your embeddings from an api then you have already lost privacy. And if you can host your own embedding model, chroma DB is easy enough!

  • @HoriaCristescu
    @HoriaCristescu 6 місяців тому +21

    text uses 32*15b=480b, embedding has 1536*4b=6144b, which corresponds to 6144/15b=409 tokens

    • @herp_derpingson
      @herp_derpingson 6 місяців тому +5

      No, thats not an apples to oranges comparison. Lets say I have number array [3,5,7,9,11,...101] thats 50 numbers which if one byte each would represent 50*1 = 50 bytes.
      However, we can see that we can just plot a line y = 2x + 1, so, all we really need to store is the m and c values then we can use this function to generate the entire series. So, this series can be compressed down to just 2 bytes.

    • @TheRyulord
      @TheRyulord 6 місяців тому

      Different embedding models produce different sized embeddings but yeah, even small embeddings will usually be larger than the text they embed.

    • @AntoshaPushkin
      @AntoshaPushkin 6 місяців тому

      Even if we consider 1 bit of relevant information stored in each float, it's still a lot, 1536/15 ≈ 102 tokens. This is actually not even a sentence, but rather a short paragraph of text that consists of ~3-5 sentences

    • @Restrocket
      @Restrocket 5 місяців тому

      @@herp_derpingson but you have no imformation about the function. You can't recounstruct this function from just two numbers. It can be any other type of function like y=2x^(1)

    • @Terszel
      @Terszel 5 місяців тому

      ​@@Restrocket the amount of information you'd need to distinguish this function from all other possible functions is infinite. In the domain of the function y=mx+b you only need 2 bytes, but other domains you might need more or less. This is why its impossible to determine how many parameters are optimal for learning a specific function, since it may not even be possible to represent the data in the domain you chose (ie. y=a/x^b)

  • @googleyoutubechannel8554
    @googleyoutubechannel8554 6 місяців тому +11

    Why is this surprising? Vector embeddings consist of a huge amount of data compared to, say, the ascii byte representation of the text per token, like orders of magnitude more. Embeddings are basically the opposite of compression, they're sort of a maximal representation of text.

    • @Laszer271
      @Laszer271 5 місяців тому +1

      Embedding models were never trained to keep the "maximal representation of the text". That's just a characteristic that naturally emerges during the training. It's not surprising that we can decode the encoded sentence but it's interesting to see how much could we decode. The method used is also quite interesting and very simple. You can try to think of what other task could benefit from a language model iterating on its own predictions.

  • @jsalsman
    @jsalsman 6 місяців тому +6

    Are there any open source text embeddings with a context window as large as OpenAI's Ada (2048 tokens)?

  • @makhalid1999
    @makhalid1999 5 місяців тому +1

    BRO IS BACK WITH REGULAR UPLOADS 🎉🎉🎉

  • @mattanimation
    @mattanimation 6 місяців тому +1

    radical dude, thanks.

  • @user-oj9iz4vb4q
    @user-oj9iz4vb4q 5 місяців тому +1

    "Text Embeddings Reveal (Almost) As Much As Text"
    Wow, good to know they do the thing we designed them to do. Oh wait, you though they were some sort of secure representation....... you're fired.

  • @ahmadalis1517
    @ahmadalis1517 5 місяців тому +3

    Hi Yannic, Please consider making a video for this paper: "Representation Engineering: A Top-Down Approach to AI Transparency". It is one of the most interesting papers in this year!

  • @bentationfunkiloglio
    @bentationfunkiloglio 6 місяців тому +5

    Really interesting topic. Great video.
    In a way, the embedding acts as a lossy compression, kinda-sorta. “Uncompressing” requires one to extract information encoded as statistical relationships. …. or so I’ll claim. :)

    • @andybrice2711
      @andybrice2711 6 місяців тому +1

      I'm not sure it even is compression though. Isn't it just a transformation? I think each word is a 512-dimensional vector, which is probably about 2048 bytes.

    • @drdca8263
      @drdca8263 6 місяців тому +1

      @@andybrice2711depends on the length of the text then, I guess?
      Or...
      Well, a compression algorithm is allowed to make some inputs larger...

    • @present-bk2dh
      @present-bk2dh 5 місяців тому +2

      @@andybrice2711 each word is a 512-dimensional vector, but how much of that 512-dim is about the word? you seem to make a big assumption.

    • @AntoshaPushkin
      @AntoshaPushkin 5 місяців тому +2

      Umm... 512 vector is just an internal representation. If the vocabulary is 32k entries, each token is just 15 bits of information

    • @bentationfunkiloglio
      @bentationfunkiloglio 5 місяців тому +2

      I'm definitely using the word, "compression", very loosely. The research is interesting because the results were surprising. The vectors contain more info than expected and more info than is (perhaps) required. In particular, surprise was that short text passages could be recovered. Some of the info isn't explicitly encoded in tokens. To extract it, one must know how data was originally encoded.
      Seems "compression-y" to me.
      This stands in contrast to one-way transformations, where original input data cannot be extracted post-transform.

  • @StephenRoseDuo
    @StephenRoseDuo 6 місяців тому +2

    Wait, seriously tho, why is this surprising?

  • @Laszer271
    @Laszer271 5 місяців тому +1

    Could we make it even dumber? Disregard the model M1 and only train M0. Make M0 do a couple of predictions, take 2 best, and tell an LLM in which direction should it go from these predictions to predict the real thing (like is the real sentence embedding between the two or closer to the one than the other, should we move even further in the direction of one of those predictions, etc.). Make it do many iterations with beam search and see how the result compares to results from this paper. Could be an interesting experiment and not even very expensive to pull off.

  • @baz813
    @baz813 5 місяців тому

    @yannic I for one, would be happy to have this patreon perk of the notes!

  • @ethansmith7608
    @ethansmith7608 5 місяців тому +5

    UnCLIP showed you could decode image embeddings with high fidelity, CapDec did the same for text embeddings. Not only does this feel like a plain sight revelation but its been done at least two times that im aware of so far.

  • @brandomiranda6703
    @brandomiranda6703 6 місяців тому +3

    Why is this surprising?

  • @Kram1032
    @Kram1032 5 місяців тому +1

    Not far into the video but given that names are often very relevant for how completions ought to go - they may be highly informative concepts - I'm not at all surprised you can invert to get back to names. I'm guessing generally speaking anything highly informative to continuation is going to get a lot attention and will not be fully abstracted away in the embedding.

  • @PMX
    @PMX 6 місяців тому +4

    But.. vector databases are *supposed* to be able to recover the information, they are often used as auxiliary memory for LLM, so you can have a smaller model paired with a vector database containing the information you want to query, and the retrieved information would be inserted into the LLM context for it to answer based on "stored facts", reducing the chances of hallucinations. Being able to retrieve names, dates, etc. accurately is an intended use, not something unexpected.

    • @endlessvoid7952
      @endlessvoid7952 6 місяців тому +9

      That’s because vector databases store the text alongside the embedding. The embedding is for search then it returns the text. This approach in the video is to get to text purely from an embedding.

  • @ChocolateMilkCultLeader
    @ChocolateMilkCultLeader 6 місяців тому +1

    Another massive win for the noisy input gang

  • @woongda
    @woongda 6 місяців тому +3

    use sentence transformer, much better than ada and host your model and vectors in-house to keep safe.

  • @gr8ape111
    @gr8ape111 6 місяців тому +8

    Oh no the vectors we trained to compress the text can be decompressed to reconstruct the text!!!
    Anyways...

    • @andybrice2711
      @andybrice2711 6 місяців тому +7

      WHY IS THE INFORMATION RETRIEVAL MACHINE RETRIEVING THE INFORMATION WE GAVE IT?!?!

    • @gr8ape111
      @gr8ape111 6 місяців тому

      @@andybrice2711its a mystery

  • @SamplePerspectiveImporta-hq3ip
    @SamplePerspectiveImporta-hq3ip 5 місяців тому +1

    This procedure actually kind of reminds me of diffusion.

  • @herp_derpingson
    @herp_derpingson 6 місяців тому

    Where do you find all these papers?

    • @islandfireballkill
      @islandfireballkill 6 місяців тому +2

      This paper is on arXiv which is like defacto standard for AI stuff.

  • @seadude
    @seadude 5 місяців тому +1

    Do embedding models advertise themselves as cryptographic hash functions? If so, this would be news. If not, protect them as you would plaintext information.

  • @Bengt.Lueers
    @Bengt.Lueers 6 місяців тому +2

    Important to know the privacy implications of embedding models: You can optimize the input string for reconstruction of the embedding until you match it, which means you found the original input.

    • @ekstrapolatoraproksymujacy412
      @ekstrapolatoraproksymujacy412 6 місяців тому +1

      No, this would only be true if the embeddings were calculated and stored with infinite precision

  • @kevinaud6461
    @kevinaud6461 6 місяців тому +5

    Have you heard of the game semantle? It is virtually this exact concept but you do it as a human.

    • @oncedidactic
      @oncedidactic 5 місяців тому

      I was thinking the same thing!

  • @drzhi
    @drzhi 6 місяців тому +7

    ❤️ Amazing content with several valuable takeaways:
    00:00 Text embeddings can reveal almost as much information as the original text.
    02:09 Text embedding inversion can reconstruct the original text.
    06:35 The quality of the initial hypothesis is crucial for text embedding inversion.
    12:00 Editing models are used to refine the text hypothesis.
    14:32 The success of text embedding inversion depends on the absence of collisions in embedding space.
    21:20 Training a model for each step in the inversion process can be simplified.
    21:24 Text embeddings contain a significant amount of information.
    26:10 Adding noise to embeddings can prevent exact text reconstruction.
    26:15 The level of noise in embeddings affects reconstruction and retrieval.
    31:20 The length of sequences can impact reconstruction performance.
    35:24 The longer the sequence length, the more difficult it becomes to represent the index of each token.
    Crafted by Notable AI Takeaways.

  • @noot2981
    @noot2981 6 місяців тому +1

    Really interesting, but anyone with a business perspective should have taken this into account anyway. Don't put any truly confidential data into your vector database. That still leaves a huge opportunity for enterprise search. Nice overview though!

    • @whatisrokosbasilisk80
      @whatisrokosbasilisk80 5 місяців тому

      If I'm running it locally and implement the security controls that I do on my other databases - why would I give af?

  • @alxsmac733
    @alxsmac733 5 місяців тому

    I truly dont understand how people are surprised by this. Text embeddings are basically the opposite of something like cryptographic hashes.

  • @hasko_not_the_pirate
    @hasko_not_the_pirate 5 місяців тому +1

    Now I want to know what comes out when you feed it random vectors. Maybe deep insights about humanity are hidden there. 😄

  • @user-yd6mp6vw2c
    @user-yd6mp6vw2c 6 місяців тому +1

    check information theory

  • @TheMemesofDestruction
    @TheMemesofDestruction 6 місяців тому +1

    Dank AI Memes Inc. ^.^

  • @andybrice2711
    @andybrice2711 6 місяців тому +11

    I still don't really understand why this is a surprise. We told a machine to learn a bunch of information. And now we're like "OH MY GOD WHY IS IT TELLING PEOPLE THE INFORMATION??!!!!" Obviously that was going to happen. We never taught it to keep secrets. We never even taught it the notion of secrets.

    • @user-se3zz1pn7m
      @user-se3zz1pn7m 6 місяців тому

      Totally agreee

    • @dinoscheidt
      @dinoscheidt 6 місяців тому +4

      Yup. It’s like you lossy compressed a large image into a jpg and are surprised to make out large amounts what the original raw image was. It’s literally called encoder-decoder in LLMs

    • @SiiKiiN
      @SiiKiiN 6 місяців тому +2

      It seems like there is this false assumption that embeddings revealing data is bad. If you need embeddings but don’t want to data to be revealed just preprocess the text to replace any specific word with a generic word

    • @andybrice2711
      @andybrice2711 6 місяців тому

      @@SiiKiiN I can see why it might be useful to develop a LLM which innately knows how to keep secrets.
      Like imagine the power and good which could be achieved by using it to find patterns in medical records.
      But yeah, unless you deliberately engineer that in, I don't know why anyone ever expected them not to leak training data.

    • @andybrice2711
      @andybrice2711 6 місяців тому +1

      ​@@dinoscheidt Are embeddings even compression though? Converting a word into a 512-dimensional vector is surely making it bigger?

  • @hanyanglee9018
    @hanyanglee9018 5 місяців тому

    Diffusion is all you need.

  • @kmdsummon
    @kmdsummon 6 місяців тому +1

    Text contains information. Embeddings + model contain partial (or even full) information from the text. I am a bit unsure what is so surprising in discovering that you can revert embedding to text back... It's kinda similar to find out that you can recover 90% of original PNG image from JPEG. Method of how exactly they are doing is useful to know though.

  • @imagiro1
    @imagiro1 5 місяців тому

    I wonder what that means for hashes, as they can be seen as multidimensional vectors, derived from a text. Kind of similar like text embeddings.

    • @whatisrokosbasilisk80
      @whatisrokosbasilisk80 5 місяців тому

      Not even remotely, embeddings conserve information - hashes are cryptographically secure lossy compression.

  • @Bluelagoonstudios
    @Bluelagoonstudios 5 місяців тому

    While this going down, the EU already made some rules for AI, but if this going to be enough I doubt it. Problem is in the EU, that every country has to vote laws and there are many countries that are as good as dictatorships. This brings a big weakness in their end conclusions. I think we need more EU as a whole. So laws got streamlined more.

  • @SloanMosley
    @SloanMosley 6 місяців тому +1

    I’ve said this for so long

  • @JoshBuckm
    @JoshBuckm 5 місяців тому

    Seems like this could be a nice product if used to reverse engineer prompts used to generate social media posts.

  • @jatinkashyap1491
    @jatinkashyap1491 5 місяців тому +1

    And here poor me believing all those years that it was common sense 🙂

  • @servrcube6932
    @servrcube6932 6 місяців тому +1

    second