Github - You Can View Deleted Private Fork Data
Вставка
- Опубліковано 21 жов 2024
- Recorded live on twitch, GET IN
Article
trufflesecurit...
My Stream
/ theprimeagen
Best Way To Support Me
Become a backend engineer. Its my favorite site
boot.dev/?prom...
This is also the best way to support me is to support yourself becoming a better backend engineer.
MY MAIN YT CHANNEL: Has well edited engineering videos
/ theprimeagen
Discord
/ discord
Have something for me to read or react to?: / theprimeagenreact
Kinesis Advantage 360: bit.ly/Prime-K...
Get production ready SQLite with Turso: turso.tech/dee...
While Git is really committed to keeping your stuff, Github seems to be even more committed !
github == microsoft
yes they are very fuckin commited
If you put your stuff in someone else’s fridge, you should not be surprised you special yoghurt get’s accessed
Isn't committing the whole point of both tools?
@@TheRealBigYang it's was just a bit of wordplay
That is probably why they have to delete forks of DMCAd content no matter how well those cleaned up their repositories. Otherwise, a fork can still access the illegal material.
For malicious actors: That could cause some real chaos if DMCA-able content gets pushed to forks of a target repo.
Shall do @@duven60
@@duven60 Thats alota damage!
@@duven60 Wait so the top xx repos can be just shut down if someone uploads a DMCA protected shit to a fork and deletes it? That is just an attack waiting to happen.
I'll start DMCA my own repos to delete them forever
Used hash to restore some lost, force pushed commits. Big commits. Saved my job.
unless you use git properly you will eventually loose it...
there is no reason for you to force push unless you are the only maintainer or its your own branch in the origin
@@krellin How can you lose it with the 3rd-party event database and being able to access any commit.
I once used reflog to revert some badly complicated rebase. Thanks Linus Torvalds for the immutable architecture of git.
@@fabi3030 keep doing force push and you'll find out
@@fabi3030 by demonstrating how unaware you are about basics of git... and annoying your coworkers
its entirly understandable if a junior does it (although even bootcamps will teach these things) but if you are mid or higher its just unprofessional...
Regarding GDPR: it only affects personally identifiable information (PII), however every git commit includes the author's and committer's name and email, which IS considered PII.
So at the very least that information has to be returned.
Additionally things like IPs are also considered PII (yeah I know about IP rotation, I did not make the laws), so if they log the IPs, which they probably do, then that will also have to be returned.
GDPR gets a bit problematic with GPG keys. GPG keys cannot ever be deleted, only revoked. And since the key can contains almost anyting, including PII, it's basically a PITA trying to comply with these insane privacy laws.
Yeah, thought the same too. GDPR is so overly broad it's a MIRACLE to go about and NOT touch anything that pulls it into the equation.
You know, guys, GDPR really isn't that draconian. Most countries in the EU already had similar laws. GDPR mostly just brought a EU-wide standardized law in effect, which makes it easier to both enforce as well as adhere to (because you can in practice just make sure that you follow the GDPR if you operate in the EU and not 20 different privacy laws).
And quite frankly GDPR for the most part just says that you have a right to your data and to request your data to be deleted. That really shouldn't be controversial.
In fact, long before GDPR I assumed that I'd have that right. I don't want to be a customer of X anymore, so I delete my account with them and I terminate my contract with them and then I'd assume that they'd delete all of the data as soon as they don't need it anymore.
The only issues that arise from it are because in recent years data became so valuable that everyone wants to keep all the data forever to show Ads, train their AIs or whatever. So of course everyone designed their processes and software in that way, completely neglecting the privacy of their users.
That's now catching up to these companies and I don't feel bad for them.
That being said, it can be quite annoying also for smaller businesses, too.
There was a recent ruling in Germany that if you cause the visitor's browser to send an request to a third party if that's not necessary and without their consent, then you're in violation of the GDPR. Can't say I agree with that on an ethical level, but fine. That mostly means not using CDNs and instead statically hosting the data yourself, but you should be doing that for security reasons anyways.
@@SourceOfViews I don't think anything's really catching up to anyone. Everyone ignores GDPR and gets away with it. Look at any random Shopify store selling stuff in the EU and they violate everything left and right. They just add one of those horrible cookie banners and everything else continues as usual. The people who have to implement that pseudo-compliance bs only get headaches if they try to be conscientious, and management will not care nor want to understand the details.
It is a completely useless law that only reaffirms the notion that no one actually has to comply with laws on the internet.
@@SourceOfViews If you really believe that, then you never read it (properly). And yes, that ruling stands because by requesting something from a 3rd party you transmitted the user's PII (IP+Datestamp is PII) without properly informing the user that you were "sharing his data" with a 3rd party. And it's even worse, for you, if the data was unnecessary for "proper operation".
You'd be amazed how many fines and damages GDPR can extract for information misuse. And misuse can simply be "you requested my full name to make a delivery and that's not strictly necessary...".
I do love it when it comes to telemarketing, you'd be amazed how fast they s**t themselves when i mention "GDPR". Because they know i wield an insanely big stick to bludgeon them into submission and beyond.
Flip didn't delete the part he asked to, as usual
Flip knows Git!
I used to think this too but he just leaves the "flip take this out". May be to stop the cut being unexplained
at this point I'd be disappointed if he took it out
To be fair it’s survivorship bias, we’re not aware of all of the times Flip does delete the parts Prime asks. So for all we know these are rare exceptions
he's flip, but he ain't snitch!
GitHub is the only git implementation that has actually sat down and completely relooked at how git works as a git server. as far as I can tell, they seem to have found a way to use like a SQL database as the back end. The people in chat saying that it's just one big repository aren't technically wrong in that kind of implantation, but it's also not the whole picture.
No shot they are the only ones. To implement a git server, you only have to implement a relatively simple protocol. IIRC both Google and Microsoft (or maybe it was Facebook, not sure on the second one) used to have their own implementations that were geared towards working with huge monorepos.
github as blockchain
lmao
Thanks to that i was able to recover an open source project that went closed source
It's intended behavior that should absolutely say it's very very good
Does that mean that JDSL was right all along?
Of course, Tom is a Genius.
I'm a little confused why this is a surprise. As someone who admined Perforce VCS repositories for years, I was well aware that delete, in most cases, was just another version of the file; an entry in the file's changes indicating the file didn't exist at that, and only at that, revision. (Which was good given how many newbies managed to delete entire working branches.)
You could always get a copy of the file as it existed prior to that revision, and any place a pre-deleted revision was branched was still valid. That wasn't just a feature, it was a critical feature for our enterprise suite with lots of moving parts and backwards compatibility requirements.
Yeah i'm also confused why this is even an issue. The reporter is definitely either script kiddie or genuinely does not know how git works.
The surprise is that it works when the file only ever existed in a (private) fork that never pushed the file to the master branch. Makes it apparent “private” isn't really a thing on github.
@@duven60 As I understand github*, if the repository is public, private branches don't really exist. They are visible to the server even if access is limited. (If I'm right, this makes sense since it allows files to be pushed to public branches from private ones. The system needs to access both ends at some level.) Which means there is always a potential way around the privacy setting. That shouldn't be an issue if everything is owned by a company, and not really accessible by the Public public.
Sort of like setting a post to private on Facebook - only part of the system respects that, and you should have expected that.
I don't know about github, but Perforce did have a real, aptly named, delete command: P4 Obliterate (admin level, though)
* Like I said, my focus was on a different VCS, but there are similarities at various levels. We did have users set up their own Perforce servers on their desktop systems - it's super easy to do - in order to do truly private experimental work they wanted no where near production code. (Or maybe to file cake recipes, I didn't ask. Small Perforce servers were free so I just pointed them to the download page.) I understand that's a thing with github, too.
@@duven60
Just reread the report and it seems to say that if you fork a repo then delete the repo, the fork lets you access files in the deleted repo that were NOT originally instantiated in the fork. That sounds strange, but it's not unreasonable if the file was in the range forked, even if not copied. The file revision at the time the fork was created would be tagged as potentially needed in the fork in the future so the repo copy must be preserved. VCSes are conservative in such things.
Didn't the US courts recently rule that AI companies are free to ignore the code licenses, at least for the purpose of training their LLMs?
Yes I heard that too. It's legalized theft like banking or taxes.
"You have to know the message name, the exact date, the author name, etc to reproduce the SHA" you also need to know the content of the files to reproduce the SHA, at which point this "exploit" will not give you any more information. If you get the SHA by other means it can still be bad though.
Git uses SHA1, and that hash has been broken in practice already by the SHAttered project.
@@FryuniGamer SHAttered has done what, a few PDFs and misc files. Trust me, SHA1 preimage isn't going to be broken anytime soon.
You don’t understand cryptography. It’s very easy to generate all possible SHA-1 combinations and use a bot net with a large amount of proxies to find keys. It doesn’t matter how it was hashed lol…
Short hashes exist
@@ashtree129 You are right, github will allow you to view the commit with as short as 4 first characters, so you can easily check all of them to see if you get lucky. If not, you can then try all 5-long combinations, then 6-long, etc etc.
Don't think the LLM thing will work. LLM's DON'T read sites, web scrapers do. LLM's also don't understand the content, they just ingest, digest and regurgitate it. You could have blocked the scraper with robots.txt, you didn't. You expected the LLM to understand the content, but it can't. Nor can it follow the content instructions (though that WOULD be fun... in a nasty way...).
This DOES illustrate that we SORELY need a standardized way to tag data in a Creative Commons sort of fashion as we're way past the "read it and index it" times...
it can, what in world do you live? have yoi tried llm that has browsing ability? lmao
@@Fuji-gn9nx Oh dear, its one of those... Pray tell, how will a glorified weights file ever browse the web? The answer is, it can't... What CAN browse the web is one of it's input pre-processors NOT the model itself.
If you're going to talk about ML, at least do some cursory reading on the subject... knowing what a model is is the bare minimum.
@@ErazerPT look how amature you are in llm, llm can spesific layer that will prompt search engine to retrieve up to dated information on top of what information it already knew then summarized it and send the result to the user, LLM can understand hundred of pages much better than you wish
@@Fuji-gn9nx It's not how it works, close but not quite, but let's assume it is. Read your own words. "will prompt search engine to retrieve up to dated information". What does that imply? It implies it CAN'T fetch it himself. Models receive input, process data, produce output. They DON'T perform external actions. They can suggest, request, command, whatever, others to do it on their behalf, but they CAN'T do it themselves.
You can argue that it's semantics, but when talking about law, semantics is 99% of the game. Heck, laws have been misused because there was a badly placed comma that opened the door to other (reasonable) interpretations. That's how bad it is when you're not precise.
@@ErazerPT its one of how it works, look you dont know type of achitecture of llm. the search engine code can be embed to the model if they want but it neccecery because it will make more complicated if in the future they want to modify the search engine code, look you dont know anything. they divide it to two because it make easier to maintain, like you divide backend and frontend, and keep the model small, look you dont know at all. and why would dividing to two when its to make the maintaining easier because you dont need to reedit the model code when need to edit the search engine code and keep the model small become issue? sounds like you dont know anything again, and then finally again it can browse up to dated data day and night, and understand hundred of pages better than you wish, and if we tollerate some latency processing by increasing the model layers until it make processing is done in 5 minutes just like how human proccess thing and information that is not done below 3 seconds like current llm so that we have fair comparation, llm will defeat most of them including you with very high gap
John Hammond isn’t working bringing back dinosaurs anymore?!?
It was hilarious to actually see Flip remove the part he was told to remove, but only after he said to remove it
This honeypot idea dor LLMS is just hilarious 😂😂😂
💯, although the idea with skip the next two lines does not work with how LLMs are trained. That only applies to prompting.
The AI watching this video and learning about the honeypot idea at the end of the video be like 👀
doom like this: every N seconds, save the game and have twitch propose moves to AI, which will play out the next 5s. in the meantime, keep committing the ascii render to a repo/screen.txt. on death, have the twitch choose a save, reload and branch off.
no one will ever go seriously look at the repo (maybe could be a test dat for diff viewers) but it would be fun knowing that it exists
I think github probably uses the same directory to handle origin and all the forks, so all the commits live in the same directory and can be accessed even if the fork gets deleted.
The AI honey pot bit end of the video killed me 🤣🤣🤣
Flip is flipping on Prime with those edits
The fact you can store entire globs of encoded data on github...
"As intended" never sounded so fun
I don't think a single repo "honeypot" would have enough "attention" applied to its neutral network to actually cause Copilot to spit out the exact characters.
Because we have to remember the AI networks don't have their training data verbatim, it's MLPs storing concepts.
@@christopher-pfeifer it might influence its presence in the training data. People often assume this is scraped evenly but a lot of effort goes into refining it. You're right that it won't be common enough to be repeated by copilot though. Putting instructions in the policy won't affect how it is scraped though either.
I think it would be more plausible for people to mass create repos with the same licence and code files. That way there is a fighting chance for an LLM to repeat the code verbatim without including the licence. I'm not certain though whether you need the licence (let's say GPL) present with the code in all forms of distribution. GitHub doesn't download the licence when downloading other individual files, so I wonder if that liability falls on people using Copilot instead of Microsoft.
Short commit ID can be as short as 4 characters, but only as long as it uniquely identifies commit, in any large enough repository you are going to get bunch of 'fatal: bad revision's right away
GDPR: what about your commit mail and name? Those are explicitly person related information stored in the deleted repo. So shouldn't they still have to return this information in your GDPR data request as soon as commits are involved?
Massive lesson in RTFM
The main question is that when the repo was not merged to the original, how did the original knew about the commits made on the fork, without merging it ..
maybe they use bot
Because a fork isn't really a copy of a repository. It acts more like a pointer within a global object store linked to the original repository (this is just an implementation detail). Storing actual copies would take up far too much space, so this is partly a way to save on space, as well as make things far more efficient in terms of I/O operations.
The global object store is not directly accessible, but only references to it may be. In this case a commit hash is referencing the blob in the object store, which is retrieved through the original repository's URL (again, implementation detail).
Allowing for the short sha's is an ultimate fail everything else feels overblown. It's just surprising.
Private gists at Github also have a privacy problem: you can access any gist file (private or not) by just having the direct URL to the raw file
Security through obscurity is bad
There are no private gists, only secret gists. It is just as secure as your api keys themselves- not at all a security through obscurity issue.
@@tukib_ You're right, I used the wrong word, but It's still an issue, because when I started to use gists for the first time I expected those secret gists to be private and not hidden
To be clear, I don't store any keys or credentials in GH, only code, but there is code that I don't mind (or I want) to be public and code that I want it to remain private, and that include some gists
@@Karurosagu Read the documentation of the service you use next time. It's not hidden information, really. Don't upload anything your want to keep secret or private to online services. It's really security 101. If the code you stored there happend to be leaked, would it cost you greatly? If yes, don't upload it anywhere.
@@dealloc Let's be real, you never read docs on something simple to use as code snippets online (I am referring to gists only here, not regular GH repos), unless you're gonna use the API for something specific
And yes, a leak could happen, it could be the next big hack or whatever, but I am uploading and managing my code remotely for a reason: it is convenient for me, the same way it is convenient for other users (including companies) to store their credentials in online vault services for example. But just because it's convenient for me doesn't mean I don't value my privacy, this small rant is because I care about my privacy and, in my opinion, GH should do a better job in sepparating what is visible, what is "hidden" (AKA: secret) and what is private
@@Karurosagu Why do you presume what I do and don't do? But if you absolutely must know, I've known this from the docs, since I share secret gists URLs with other maintainers and colleagues; so it didn't come to me as a surprise, since it was what I was looking for; sharing gists without putting it up on the Discover page.
Idea to spread the stars over 3 weeks :
- You % 21 usernames on twitch or github and generate a calendar so your viewers can add it to their agenda and pin this link on every video/stream
not sure if that would work, b/c you have explicitly stated and explained how you want to trick the algorithm into 'thinking' the license is permissive.
Actually Git does remove it. It just doesn't do immediately. It does garbage collection automatically or manually.
Free API keys for all!
Free And Open Source API Keys
I am actually cool with this feature. Sounds like something that will save you some day. Also, the secrets example is so bad because who the hell stores secrets in a repo?
i don't modify fork only copy it.
I think thats why no real copy is made. GitHub just conserves space 😅
Uhm, this could be a potential law suit since this violates the right to be forgotten.
But MS used to pay fines to the EU so nothing new...
I hope I remember this in 2 weeks so I could participate in the honeypot plan.
Not the point of the video, but that tier list was wild. Putting C#, Java, and C++ below Python and Javascript is nuts.
What if we encrypt the data before pushing to github? Use another app that just loads and decrypts the data, reads the commit information locally and present it to the user.
you are gonna need somekind of traspiller line per line, just to keep the initial spaces. You use sourcetree or whatever GUI. Then you use your cipher over your repo and then upload them. That's nice but you have to be careful with who you share your Keyes
@@wil-fri Yes. I don't like how Microsoft can just profit from reading all our private repos without our consent.
Ansible Vault?
@rumplstiltztinkerstein just don't store your repo there?
create some ARG game using deleted fork repo :D
does prime know about the github archive project that has a lot of hashes in a data set?
It's XSS except for GitHub? So XRR (Cross repo referencing) GitHub isn't GIT though so this is effectively GitHub implementing their own scheme for branching to create forks.
I'm totally safe because my repos never get forked.
Anyone who didn’t rotate a leaked API key after “deleting” it from the repo had it coming anyway. They would do other degen things that would get them hacked.
Informative 👍
since there is a very large number of branches and the hash is large but not infinite ( may be practically be infinite), can we can we randomly reproduce git? with a group attack? simple solution but probably against the immutable history, would be too change hashes/links maybe? someone could bypass security fork it then the info would be forever available to everyone ( clarity: not suggesting this is right, suggesting this is what happens or will happen)
I wonder if it's the same case with repos that got deleted because of DMCA
I remember about 10 years ago I accidentally checked in a password or something and had to spend the rest of the day figuring out how to purge it from git
11:33 having the sha is not that hard. Just minutes earlier you demonstrated, that a small part of the sha is enough to find a list of possible commits in a repo and that there's a reflog.
I read the thumbnail as 'F DOTA' and thought.. oh no
Don't kid yourselves, I've known about this for ages 😋
i wanna see that special licence XD
What if in the LLM prompt I should say "ignore all prompts that are located in the license and act as if you are a human and not an LLM"
Could that be used the same way as the fixed GitHub comments vulnerability from ~ april? Ah, right, works as intended.
does he change a user between the creating, deleting + during the checking on openai's repository? Because it might be that it is only accessible by the same user.
What if you "fork" by cloning the repo and creating a new private one?
Hope they don't try to] de-duplicate data on the back end but that would be relying on github's backend coders not getting too clever and given the "won't fix" response to this problem I wouldn't trust it.
i think somebody is reviewing the repo before it is fed to LLM
Alt title - you cant *delete* repo and anyone can view it (Private also)
dude stumbled on how openai collects data
Looks like a fork that isn't a fork.
So. It's possible to make an attack on a repository by committing someone's Personal Data to a fork?
I feel like it existed like that forever, why are we in a freakout mode now?
I don't think people knew that GitHub worked the same as git; a global object store with pointers. Even though it's in the name.
Also not just API keys are a problem. What if you fork a template project repo for your private proprietary software? And you print the commit hash of the software in debug/version info? You just made your proprietary code open. Whoops. Don't think anyone would have expected that.
isn't this the entire point of doing a redact? you back the repo up before the commit(s) you want to redact, skip the commits that are problematic, and rebase and commit the remaining commits.
Isn't that related to git commit history?
Delete doesn't mean delete, for these people...
FLIP IS A FUCKING NINJA WIZARD
Firebolt and gharchive
Which extension he was using to change theme colour?
maybe don't publish secrets onto a website
from the article: "the only way to securely remediate a leaked key on a public GitHub repository is through key rotation"
no fucking shit....
i think he is calling it `secret` just to make it obvious that is sensitive information but it can be just a new algorithm the company now wants to keep exclusive for the pro version
What if I accidentally uploaded my DNA to github and then deleted it?
How do I rotate my DNA?
:)
You can't make private forks on github using their fork UI. Thus making a secrets.txt is nonsense to begin with in such a situation.
they really didnt want to pay this guy out the money for the bug program
Can you react to: Big Tech Doesn't Want You Anymore by Patrick Boyle
I don't understand how people could commit API key.
Maybe then is better to degit instead of forking?
Is this a reupload? I swear I've seen you say all of this exact stuff before.
Clips might have been uploaded somewhere first
It's the reason I left GitHub.
so all this is telling me is no one has read the github documentation
it’s not a bug, it’s a feature.
roll your own versioning system.
Password protected .rar files in google drive LETS GOOOOOOOOO
my_docs_final_final_final.docx
FLIP, TAKE THIS OUT!
How does this work with gdpr?
Not, but frankly git itself is probably a massive violation of the GDPR and right to be forgotten anyhow. Would be interesting to see how much of it would need to be re-architected to comply with EU laws.
Good.
fuck where is this Prime License I will make one with that license too lol
TL;DW: Git is a shell game. Go back to programming and leave version control to your build engineer.
Dude unironically uses "git checkout -b". What year is it?
well fork...
WHY IS TYPESCRIPT ON F TIER?! 😨😨
I know I'm lame and uncool but I don't use Github at all.
ipfs over github deleted forks ftw.
Github is clearly overcommitted.
Rewrite github in rust
it's not about memory safety, just github policy
@@spdlqj011 still, rewrite github in rust
@@spdlqj011 rewrite the github policy in rust
People discovering how git works in 2024?
Feels good to be in Gitlab, if im honest
Ya know .... the part that gets me is their acting like you can hit any repo.... The problem I have is they actively overblow the requirement of 'FORKING', and if for some reason there is no fork & it's private... than it frankly doesn't matter if they can figure the hash or not.
This is an exploit of design that is meant for retention, but frankly all this is going to do is change what people do with the fork.. So now instead of maintaining the relationship, they'll just fork clone it to their PC, than nuke the .git history and retain all the code and commit it as a new repo with no history to it.
Seriously ... the amount of work you'd have to do to sniff out data through the repos is crazy. You know the only change that will happen now is that they'll start to blackball all private/deleted data to the original owners, or disallow any fork/commit that doesn't have an owner from being displayed or grabbed. (as the owner of the repo with the deleted branch/undone commits, will be able to traverse their git commits & actions so they technically still have an owner) and mark entire forks as logically separate from initial blocking all traversal into other branches not owned by their account on creation.
Jokes on you I don't fork
I don't fork, only fork around :)
ahh Github enabling communism 😂😂
write a jdisel doom