grab a yubikey at yubi.co/lowlevellearning-2024 and secure yourself online with two factor authentication. thanks again Yubico for sponsoring todays video!
Isn't the Yubikey 5 series vulnerable to side-channel attack? (Ninjalab's EUCLEAK). Edit: I see the firmware version is 5.7, where the vulnerability has been patched.
@@StephenKingston To my knowledge that requires physical access. You likely already know, but for others reading my comment that are unaware, for a system that is in use, and therefore is at least partially decrypted but likely effectively unencrypted, is basically hacked if a malicious actor gains physical access. There is very little you can do. So basically, in the situation you could physically grab the Yubikey and take it, you can also digitally copy it.
So we are going to essentially compute hashes on addresses, and every single time we want to use said pointer? I wonder what kind of impact on speed it could have?
Probably very small impact, each memory access already involves things like virtual to physical address translation and caching, they can probably squeeze hashing in there without increasing latency
Shouldn’t it have none? I only understand the theory and haven’t implemented this yet, but isn’t this akin to putting a primative data type onto a pointer?
And there I was thinking, after reading the title, using "signed" pointers meant to allow them to have negative values, and could not think how that would be feasible. This makes more sense.
there is no such thing as a signed pointer..... it is a binary number, pointer is just a collection of bits..., you could designate any bit a stupid name... like "sign bit" , or "hacker defeat bit" , but it means FA, becasue it is only down to the WAY YOU interpret those bits that gives them intrinsic functionality. you could take that "pointer" or signed pointer, then put it into a box that says ascill characters in 8 bit groups does that now mean the pointer has magically transitioned into ASCII? nope, it means you just redefined how you want it interpreted. now let's say you wrote a shitty ASCII interpretation routine that only assumes 6 bits are passed and the top 2 are always zero bits.... and that "magic hacker proof pointer" potentially just became your worst nightmare
Actually, since existing x86-64 implementations don't implement a full 64 address bits and the architectural specification requires that all unimplemented bits have the same value for a pointer to be valid, x86-64 pointers are effectively signed in the positive/negative sense. If you consider them unsigned, the set of valid pointers consists of two separated subsets at opposite ends of the address space. If you consider them to be signed, it's all one set in the middle of the address space (immediately on either side of zero).
6:30 - So ARM is limiting their virtual address space per program? 29 bits of a 64 bit address space being reserved seems like it would mean you can use up to just over 34 million addresses. While that may seem like a lot for most programs, I do wonder if it is kinda a "640k is all you'll need" sort of situation.
My answer disapeared, I'm not sure if it's just pending, or because I added sources in the form of links 😂 Just in case, here is the gist (sourceless, sorry) : A Qualcomm paper on PAC and the linux kernel doc mention that PAC signature size depends on the virtual address space. Qualcomm mentionned a defaut value of 39bit, leaving 24 bit for the signature (~16.8 million values) . Basically, the signature is stored in currently unused bits. clang library also mention that nothing prevent the signed pointer to have the same size an unsigned pointer. That's just convenient for backward compatibility reason, and becaus arm HW support preserve the pointer size
Hi, sorry to interrupt your day, but could I get your assistance? I keep USDT TRX20 in my OKX wallet, and my recovery phrase is (clean party soccer advance audit clean evil finish tonight involve whip action). What’s the best way to transfer it to OKX or Bybit?
No, but maybe you want to limit the number of people that write software that gets distributed, especially OS stuff. You couldr put resources on checking the software. There is a lot of private use and educational use of software.
I once found a weakness in this where you can use tail call optimisation and exfiltration and longjmp (if you can line all those up) as a pointer signing widget. Tail call optimisation causes your choice of return address to be rewritten with a predictable corruption of a correct signature before being passed to the next function, but this doesn't cause a fault, yet. If you can then exfiltrate that address then you can fix it yourself to get your pointer with a correct signature. After that you need to avoid the exception, which you can do via longjmp(). Next time around use the pointer you exfiltrated last time, with the correction, and return normally rather then via longjmp() to the destination of your choice.
I’m using this channel to understand the spoken English, but idk why in my native language (Spanish) nobody is talking about these features, that’s cool man, you really know a lot of this stuff I appreciate your content buddy
I was so distracted by this, not because it bothers me but because i was trying to figure out if im just stupid or if its out of sync because I felt something wasnt quite right but was to stupid to actually figure out if it was..
@@dennisestenson7820 are you sure? Because I watched it on a computer. I don't think it has anything to do with what you're saying. I think it's just like 50ms out of sync.
I worked on z/os kernel for many years, this problem was solved by verifying pointers by making sure something in the operating system points to it. No storage is copied into directly.
@@thecodemachine I've worked with compilers that use it compiling code from the 80's. Lot of hidden undocumented features in the compiler that made modifying the code error prone with no way to figure out why other than trial and error.
The boundaries aren't really clear with modern processors. There is multiple contexts that could be considered kernel and user spaces, each with their own security features and simultaneous thread.
Seems like this would interfere with NaN packing that many dynamically typed languages use in order to store pointers and other data types inside the diagnostic information portion of NaN floating point numbers. (Basically turning 64-bit floats that are NaN into enums that store both type information and value).
Won't that only (mostly?) be for it's own language-specific data types, and less so pointers expected to "interface" with the rest of the system? Like, I don't think a pointer to a JavaScript object is going to be passed to anything outside the JS VM. I'm hardly an expert here though, corrections are welcome.
@@mnxs I'm no expert myself, but wouldn't things like function pointers and the like still need to be passed around in NaN boxes, if, for example, you're passing a function implemented in C as a callback parameter to a JS function?
@@stevesteve8098 It only has to be valid for the length of a kernel function call. It is not unreasonable to require kernel functions to reget memory if a bad return code is called.
@@thecodemachineI don't know if that's true, necessarily. I'm hardly an expert in these things, but I don't see why a long-running server application can't have shared resources or something that is allocated for, say, the lifetime of the program, that it'd have to make calls with (the pointer of).
Stack limit register might have been even better option. Imagine a hidden register that could be modified only by special, for example, "branch, link and limit" instruction family, to point to address of return address value in stack and any access beyond it that is not by 'ret' instruction will generate an exception. Should be cheap to implement at hardware level and transparent to use, but hell to implement in backwards-compatible way though, since having an extra value alongside return address would definitely be an ABI-breaking change.
@@zetash1be441 Why should backwards compatibility be a big issue? Just make it into an instruction set extension, like AVX. Branching to a dynamic address can be fun and useful, e.g. for coroutines. You could even have a CPU mode (or a preprocessor script) where the semantics of the old branch instructions are "updated" if your executable doesn't need this form of dynamic dispatch.
Correction at 5:34: the PML4, PDPT, PD, PT tables are NOT inside the CPU! The PDB inside the CR3 register is the only thing inside the CPU that bootstraps the Virtual2Physical address translation process (apart from the general purpose register containing the virtual memory address). All the tables are inside RAM, except for the TLB cache.
@ItsCOMMANDer_ what? Both in 32 bit and 64 bit mode it's in the upper bits of cr3, and it's saved as a pfn (page frame number): since every translation table is 1 page size, the lower 12 bits are reserved for other flags. Anyway, the point is that translation tables are not in CPU, excluding the cache, and this should be pretty much the same for ARM architecture, but with some differences in the implementation of the page tables with respect to x86
4 дні тому+68
A simple solution is to disallow modifying pointers altogether. On the AS/400, for example, *only* the platform firmware can hand out pointers. You can't create pointers out of thin air, and you can't modify them. Once you perform arithmetic on a pointer, it's no longer a pointer but just a number. In addition, pointers are typed, and they are tagged with an owner and with access restrictions. Another nice feature of the AS/400 is that only the OS can compile to native code. You cannot write native code, and you can't compile to native code. You can only compile to intermediate code, the OS then compiles to native code. And the intermediate code is "safe", i.e., it contains no instructions that could violate memory safety, pointer safety, or type safety. This idea also exists in Microsoft Research's Singularity OS. All code is delivered as high-level byte code, together with a manifest that contains a list of all the privileged actions the code needs to perform. The compiler will first prove that the code only uses the privileges listed in the manifest, otherwise it will reject the code. Only then will it compile it to native code. And just like OS/400, the compiler is a privileged OS service that cannot be invoked by a user. Singularity uses what they call SIPs (Software-Isolated Processes) for process isolation. In fact, all of Singularity runs in Ring 0 in a single address space. Why? Because the language itself already guarantees stronger isolation properties than the CPU and MMU can anyway, so why bother? Each SIP has its own object space with its own garbage collector. There are no instructions in the byte code which can access memory or manipulate pointers. All data exchange between SIPs is through message-passing. Messages are only ever owned by one SIP: before sending, they are exclusively owned by the sender, after sending, they are exclusively owned by the receiver - there is no shared memory, ever. SIPs define their messaging protocols statically, this allows the compiler to generate code that actually operates on a shared memory region. So, you effectively get the safety and semantics of message-passing shared-nothing concurrency with fully isolated processes, but with the performance characteristics of shared-mutable-memory concurrency using lightweight threads. Sadly, because of backwards-compatibility requirements with C, approaches like this never see the light of day.
So according to this bullshit... it's not possible to "hack" MS programs, which we all know it nonsense... RING 0 has been hacked so many times it's now more open than a clowns pocket. Plus your fantasy makes ABSOLUTELY NO correct assumptions about "glitching"
This fascinating- have you got a recommended article I can google to read up on this? Or if you’ve written a blog article or something on it perhaps? The concept (although Microsoft) is very interesting- taking the lowest level of computing and just whacking the OS right on top of it sounds like (for most these days) an absolute nightmare. If what you’ve stated is anywhere near accurate then it’s an interesting approach and I want to understand it and the caveats that come with it (some of which you’ve listed already)
@@TheBlegghas I understand it, indexing an array requires a system call. This call will check that the offset is valid. The way the architecture works this isn't as slow as it sounds. You don't have address spaces or context switching any more iirc.
It's possible to do this and be compatible with C. The CHERI project out of Cambridge and ARM Morello both accomplish this. As for performance, your CPU is already doing this check when it does virtual address translation. If the objects are finely grained enough, it's basically free.
Pointer tricks like this are terrible. Back when I was putting Linux username apps to aarch64 we noticed certain apps and libraries advise the upper bits of 64bit pointers. The problem was 64bit arm was enabling 52bit addressed memory... Some super computer somewhere required that much size, but I digress... We were not able to compile things like Mozilla warez because of these clever hacks. These are nice until one day the entire address space is taken for memory. Yes people, even on 64 but machines not the entire 64 bird are given over to memory, because that's crazy talk... But every 12 years or seems two more bits go to memory. These kind of trucks will eventually be gone.
on big endian systems like IBM Power with tagged memory the tricks do not eat into allocatable memory like that, only one bit per many bytes to hash the page table, in little endian the address is the 'wrong' way around so more overhead i think
@foobarf8766 that's not how I remember it working, not the big endian IBM machines with tested memory, but rather abusing pointers. Most 64 bit architectures simply don't use the full 64 bit, it's 48 bits, 52 bits, etc... so 12 or 16 bits left over for any number of tricks, such as pointer hashtag but it could be anything. Like I mentioned before, when we raised the bits boundary from 48 to 52, it suddenly broke certain things because a handful of people realized they could do pointer tricks. Perhaps this is where IBM power-be would help, with memory tags? Incidentally I was also on the team that bootstrapped IBM power9 little-endian to Linux, but I don't remember the tag stuff, but that's possibly because it simply wasn't broken such as this idea for pointer hashtags... We didn't have to fix anything in the port.
This makes sense, E2K uses 128-bit descriptor-"adresses" for secure mode, 64-bit adresses for normal mode, and 32-bit for x86-compatibility. And E2K has separate 4 bit tags per 64 bits of data/address.
Thanks for the video - It’s a very interesting concept. My only hot take question is theoretical and perhaps naive since I’m not well versed in security - Since we expect compiled executables to be portable between machines that are of identical platform and spec, wouldn’t the executable require the decryption keys to be bundled in? And if yes, how would those be protected?
I had to comment because this is a project I worked on for awhile (not in industry, in school). I even have a verilog circuit rigged up to do just this sort of verification in "real-time" behind the scenes. Definitely a cool idea, I'd love to see it brought to fruition. As someone else mentioned there of course is overhead associated with it but there is a litany of mechanisms you can use to distribute that cost across idle cycles or preemptively as opposed to in-lining it like I was doing.
Does anyone else ever fantasize about what we could do with our computers if half of our development time and half of the computational energy didn't have to be dedicated to security?
No. What I fantasize about is what we could do with our computers if "hello world" would compile to less than ten gigabytes (with 29345 files in tow, all irreplaceably ESSENTIAL!) these days. And yes, this is hyperbole, but it unfortunately holds true for ANYTHING more complex than "hello world".
Simple fix. Switch to temple OS! No security, the user is king! But ya, we also have inbuilt protected memory addresses and stuff for operating systems.
@@AttilaAsztalos roll everything yourself. I built my own graphics framework to replace SDL2 in my FOSS projects. One of my builds dropped from 8 MB to 2 MB. I replaced GLM (which is practically an industry standard for graphics programming) with a custom linear algebra library and gained a 30x FPS boost in my UI. Granted, I don't need several thousand FPS for something that isn't a game engine, but it still amazes me how inefficient GLM was considering how widely used it is and how easy it was to reimplement.
@@vrclckd-zz3pv I would if I could. But I'm a chip monkey, not a code monkey, and condemned to remain so (yes I did try). On the other hand, the full-3D full-parametric CAD I use almost daily is a SINGLE FILE, and is, in total, smaller than ten MEGABYTES. It's called "Solvespace", in case you were wondering... Now, maybe, you understand why half-terabyte abominations make me mad.
This has been a 'solved' problem on IBM Power for decades, in big-endian mode there are memory tagging extensions that use a hashed page table. Some other great stuff in Linux like wipe/fill on free and alloc works great on Power too. In little endian the overhead becomes heavier because check bits go at the 'wrong' end of your address so you need 'software' solutions like to XOR or put the check bits where they eat into allocatable memory. It's good that Arm are bringing this feature back into general purpose processors, it's been a thing on IBM Power for like 30 years.
Yeah! That's one of the techniques that can be used to avoid buffer overflows: make sure you only copy as much into the buffer as space it has available.
Yes. However, he's just using the obviously-unsafe gets() as an example. Real-life code is much more complicated, and mistakes happen, it's not always that easy. For example many low-level protocols are of the form "2 bytes encoding the length, followed by data" Now if the attacker lies about the length of the data... There are tons of other ways to trick software to read more (or less) than it should.
You still have the problem that the CPU assumes that you are returning to the same place you came from. A buffer overrun hack abuses the CALL and RET machine code operations. When the CPU executes a function return, it assumes the top 8 bytes of the stack are the correct address to return to and will go there, and doesn't check if the value of those bytes have changed since the last CALL instruction wrote them there.
I have been under impression that we don't access physical addresses anymore? That raises the question:does row pointers perform jump instruction to the address they point to, or they move instruction pointer to the said address?
Hey stupid question that you probably won't see but somewhat relevent to the video : how much of a security risk from cuda allowing to create memory buffer that is visible by both the GPU and CPU ? I'm asking myself that question because ( from my understanding) it's allocating memory and directly give access to the physical adress of the memory pointer so that both the GPU and CPU knows exactly where the data is. The gpu drivers then is supposed to keep this memory range allocated (cudaMallocHost) until you dealocate it(cudaFreeHost) ; however I remember a while back a bug that the drivers never dealocate the memory even if we deallocated the memory in the program. The result was that my benchmark program for gpu testing was able to crash a 128 GB computer with 2 xeon and 2 top of the line GPU at the time with SLI split in 2 numa banks. Also worked on a standard laptop. The test was benchmarking specifically targeting memory transfert between CPU & GPU and running it engough time it was able to reserve most of the memory to itself even after we stop the process. I followed the RAII principle using classes for all of the GPU related memory aquisition and using the standard library for anything else. I even mannaged to make a template that is able to alocate the correct type of memory for a CPU lambda that you want to send to the GPU, however the issue was still there afterwards.
Cryptographic authentication of pointers... that's, um, interesting. Refusing to make fundamental changes to the architectures we still use leads to all kinds of delirious approaches. Completely separating return stacks from data stacks would solve 90% of all security issues due to stack corruption. You could still corrupt data, but not make the code go where it shouldn't. And if you add to that proper validation of your inputs everywhere it's at least moderately critical, you solve probably 99% of issues. There will always be this annoying 1% left, but it may be deluded to think we can get to 100% anyway. And, these changes would imply a bit more work for sure, both on CPU design and on software, so we prefer "enhanced status quo" (= do everything as usual, but with new tools).
@@joseoncrack to be fair, proper validation of input may very well solves 99% of the issues on its own. The issue with return specific stack alone is that it wouldn't prevent JOP, or call through pointers, so that would just change how stuff gets exploited. That might solve 90% of current issues, but not 90% of all issue.
...Wouldn't shift-arithmetic on the pointer (sig = iptr >> 35) reveal it? And if the signing algo is public knowledge (it _will_ be whether it's intended or not), you could do shenanigans like (iptr = getSig(badptr)
It's stored on the actual value of the pointer, you only ever have access to the virtual memory address of your program. You ask to sign 00000000, the value in the return register, the mmu maps that to 00001024 where the first four bits are reserved for things other than addressing, it sets the signature in the bits it doesn't need but in your program you still read 00000000 thanks to the mmu. If you don't use the mmu then you probably could, supposing the instruction is even supported in that mode, but it wouldn't make much sense to use real mode and care about this.
@@WackoMcGoose signed != encryption You want to verify the pointer didn't change, not obfuscate the pointer. If you strip the pointer from its signature, while you get access to its original value (which was never hidden anyway), you still can't return to it, since the signature check will not match
In C programming, the return 0; statement at the end of the main function serves a specific purpose: Purpose: Indicates Successful Termination: It signals to the operating system that the program has executed successfully and terminated without encountering any errors. Exit Code: The value returned by the main function (in this case, 0) is known as the "exit code" or "return code." The operating system uses this code to understand the status of the program's execution.
Is pointer signing enough? It's still possible to do arbitrary function call: 1. Find an exploitable function. 2. Somehow read signed pointer to that function. 3. Overwrite the stack with a valid frame mimicking call to that function. 4. Overwrite LR to our signed pointer. 5. Return.
one thing people often dont know is that the absolute minimum requirement to jump to an arbitrary address (im aware of other methods, but this one is very simple) in x86. all you need is 6 bytes, equivalent to"push [32 bit address]; ret;" in assembly. this jumps to an arbitrary address, as "ret" is surprisingly more complicated than one may expect
Maybe I'm missing something, but what prevents an attacker from exploiting something else to get their pointer signed? The clang API has to result in actual machine code; why can't they execute the same code? I suppose that if currently, the most common path to exploitation is rop gadgets, then you can't start the execution of those via retaa. But I'll be surprised if no other path exists.
@bryankadzban1159 If an attacker gets the ability to execute code, this becomes an irrelevant meassure. However, this is supposed to prevent an attacker from gaining execution in the first place. If the way an attacker wants to get execution of their code is to change a pointer to anything else, this prevents it. The attacker doesn't yet have gained execution to ask the system to sign the pointer, they just have the ability to write an essentially infinite amount of data that they want to use to manipulate the way execution happens. They can't pre-sign the pointer due to the key necessary for signing being part of the system. So in the end, if implemented correctly, an attacker never gains access.
3:03 If hacker changed value of return address, they would do it inside of the "gets" function, meaning the "return" inside "gets" would execute malicious code, not the one inside main. That's also why I think your code at 7:12 is not secure, since you check MAC only after "gets" returns, which would be after the malicious code executes.
5:40 - it’s pretty common for the virtual & physical addresses to share the bottom N bits (ie. They aren’t remapped). Where N relates to the page size, as in 2^N. Your example is different - maybe you could have used 0x9e90 vs 0xface in the physical example? Actually for a 4K page side that’s just the 0xe90 part (12 bits). #JustThinkingOutLoud
I wonder how this handles systems where the programmer i.e. has an array of some struct referenced by a char pointer add 3*sizeof(the_struct) to some position in the array to get the bytes of the 4th struct instance in the array after the current pointer.
I'm curious about how that signing works. Maybe it works against overwriting pointers using byte-by-byte or number over pointer writes. But I doubt it can save us from situations when code already writes pointers based on input. It sound more like randomization of memory layout, but with more predictable behavior.
If I remember correctly, the IBM AS/400 and its descendants have done this for decades. Their pointers are extremely long. Disclaimer: my memory comes from the dim and distant past. Please correct me if I am mistaken.
When it comes to this types of vulnerabilities, the C/C++ was always mentioned, but I wanted to know: does programs written on Rust, Python, Javascript (Node.JS/Deno/Bun) can be hacked by changing pointers? Or those languages already used some sort of encryptions and doesn't need more protection than out-of-the-box?
@@XCanG well they all can, kinda. Low level languages are more vulnerable, because pointer are exposed. Rust does a bit better because pointer manipulation is restricted to unsafe block, and this is enforced by the compiler (that's however not perfect, so there are ways). Higher level languages usually provide a runtime which takes care of pointer manipulation and prevents the user from exposing such issues. (Js/python/java/c# run in a kind of virtual machine). However said runtime can have buffer overflow exploit, since they are usually written in lower languages
@@Darkyx94 Runtime language doesn't even need to do ROP/messing with pointer. They use JIT, so an attackers could just write the instructions they want to execute into RWX memory.
@Masq_RRade@@nowave7 Offtop: There is no such thing as 'it's just bytes, you can interpret it however you want' in C, because strict aliasing rule, at least not without using functions that implicitely create objects like memcpy.
So if I understand correctly, it prevents bufferoverflows, but not out of bound reads or anything else. Return address spofing might also be a little bit tricky, after performing hooks.
I wonder if 128 bit machines are going to be come sooner than we think. If mitigations like signed pointers will become a thing, then the more bits in addresses etc, the better. The more bits for encryption, the better right? Or I guess, the better the algorithm, since FS EC is less bits than RSA right, like 256bits upwards?
Unlike normal encryption, pointer encryption is not the first line of defense. A single wrong guess crashes the program, so any attack that doesn't always work will be detected very quickly and the vulnerability in the code can be fixed. This means that the encryption doesn't have to be very strong. The downside of using more bits is of course that at the very least one needs more silicon area on the CPU chip. In addition larger hashes take longer so there can be a performance penalty. As long as the hash is quick enough that doesn't matter because memory access is slow compared to other CPU tasks, but if the time rises above a certain amount of clock cycles there will be an impact. (Kind of like putting an extra object into a box that still has space is no issue, but if the extra object becomes too big you need to get a bigger box)
If that sense, doesn't it means the "malicious pointer" still exists somewhere? So that mal-function can still have a signed pointer? How signing the pointer helps in that sense?
Things are signed with a discriminator. The mal-pointer elsewhere would likely be signed with a different discriminator than the pointer at the location we were trying to replace.
Wouldn't it be simpler to just create an instruction for assigning pointer ranges (and an instruction for removing those ranges)?. The CPU can just store the addresses in a dedicated cache or dedicate a part of existing cache for the addresses. It already has something for page ranges so why not something for pointers within those pages. Personally I would like dynamic memory from malloc/new/etc to only be accessible via dedicated functions like memcpy or something. For example if you create a list with malloc the memory given should not be directly addressable but instead to get any element in the list you'd have to use memcpy to copy it into a stack variable. I'm currently designing an arena allocator with just such a property, it doesn't hand out addresses but offsets from the base address. My malloc replacement I made for testing purposes just adds the base address back because I can tell the arena on creation AND on growth that I don't want it to move. At the moment it uses actuall memory for it since I need to iron out bugs but later once I've got it working I'll convert it to a variant that uses something similar to /proc/self/mem. If I could make access to the file/pages require a key then I could protect them both from externally controlled read/writes. The problem would be how would debuggers be able to work on it then 🤔
It's called CHERI and it's still being developed. On 64bit it extends pointer to 128bit and encodes some properties (length and allowed access) into the pointer itself. Unfortunately there's ton of poorly written shovelware that assumes "void *" is the same as "unsigned long" and lots of code that needs to be recompiled.
Ahh, I wish the title of the video was properly explored. Naturally, it does a great job explaining what pointer signing is about, but I wish there was a discussion section in the video focusing on adoption and whether it ever makes sense to sign *all* pointers. Cool topic though!
I was thinking it would be baked in the compiler, and it is, just not for all the use-cases. It really looks clunky for, lets call them user introduced pointers, for a lack of a better term.
its just one more assembly instruction, it might be a few cycles slower than the return call that doesnt check but we are probably wasting way more cycles per call doing other kinds of checks to mitigate this problem. Resource hungry malware like Denuvo is being installed into hardware to stop these kind of exploits, if we can do away with any of those, we end up reclaiming wasted overhead.
Yes but there's a performance penalty since it's an optional feature. ARM Trustzone is a parallel system that is mostly invisible to the executing environment.
in the video it looked like it required addition manual code? it also sounded like it does more work on the chip to sign, verify, and authorize all the pointers? shadow stack is just a compare of pointers in a seperate hardware pointer stack and doesn't require addition code, iirc?
@@nightshade427 Yes it's "just" that but it's not a separate system. It's built into the chip and it's dark silicon unless it's used. That means sometimes it's using features that are LOTO until a normal thread can use it so that's the performance penalty which also precludes side-channel attacks.
The really secure way to hold return pointers inside of registers instead of the stack registers, I am too lazy to search for the source but I am pretty Intel and Arm both work on this
certainly makes sense, but also only helps for the immediate return of the current function. All layers above would still need to store their return pointers on the stack.
Another mitigation is implementing a shadow stack that only holds saved return addresses. Calling a routine pushes both to the actual stack and the shadow stack, returning from a routine pops it from the stack and checks against the shadow stack. This is already implemented by Intel processors
Firstly, as someone has said, I wonder what kind of impact this has on code execution. Because this is entirely done in hardware, from what I've read, I think the performance might not be that bad. However, I also KNOW that this kind of pointer authentication has already been successfully bypassed. Moreover, if you can modify the return address, there's a high likelihood that someone can modify retaa so that it returns without authentication.
About your last point: retaa is an instruction located in the part of memory where the program is stored which is usually write-protected. While the actual return address is simply a value on the stack. If an attacker manages to remove write protection on the program they already have arbitrary code execution and don't need to mess with returns anymore.
The whole point is the hacker wants the program to return to an address they specify. If you put the original pointer back then that won't happen. You also can't just put the "signature" (top bits) on top of your own pointer as that will fail the check.
This would require a separate data structure/memory location to store return addresses in. Forth does something similar with its distinction between the data and return stacks. Say the programmer was holding a function pointer in a stack variable and calling that later. Having a separated stack isn’t going to stop the user from getting an arbitrary execute primitive if there’s a buffer overflow on the stack. I don’t think the extra security potential it adds outweighs the performance impacts that might have.
@@imciviled So I'm not an expert on this at all but I don't think you've described the mechanism correctly. My understanding is there is no separate pointer storage. The pointers are in the usual places but simply with additional decoration (hashed high bits). This decoration is done by the kernel when requested and when the function return pointer is popped with the secure version of that instruction then the kernel verifies the decorated pointer and throws an exception if it doesn't pass. You are right that this will certainly impose performance restrictions simply due to the extra work needed to compute and verify decorated pointers. There will also likely be cache locality impacts.
retaa checks the upper bits for signature. If you overwrite it with the original address, the part it checks would essentially be "0000", which is an invalid signature and cause an exception.
I don't agree. I feel we need a new executable format entirely with a new kernel memory layout that is more secure. PE/COFF and ELF both came out around 1992 or 1993. about 64 bit. aren't our CPUs 48 bit or 57 bit for practical reasons?
Someone explain what I am not understanding. Can this not just be implemented across all architectures when the code for physical and virtual memory is written? Why is this specific to ARM?
While the operating system kernel plays an important role in memory mapping, most of the heavy lifting is done by the memory management unit in the CPU. This is essential for good performance. That means that some features need support in the MMU hardware to be effectively implemented.
modern CPUs and their NPUs allow direct memory access. So they seem like a perfect attack vector for privileged hardware (inside the CPU) that can run code easily... Like from a website doing some NN inference locally via webNN or something.
How can you disable NX with a simple write? You would need to write to the page tables in kernel memory, which wont work unless you already exploited some other bug….
The buffer is stored on the stack, the same part of memory where the return address is located. However the instructions are stored in a different (usually write-protected) part of the memory so they can't be changed by this kind of attacks.
A failed authentication causes a program crash. If an attack crashes your program 9 times out of 10, the likelihood that the vulnerability is detected and fixed before any real damage is done is extremely high.
I don't understand how this works if return address is pushed onto the stack before control switches to the function, how did they overwrote it by overflowing the buffer that is located after the stack addreess that stores the return address? Does stack buffers are being written in reverse order they are allocated in? Or is the return address is written to stack after variables and buffer in the stack is allocated? I don't see any reason why that would be the way it works. Would be glad if someone shed light.
@@rb1471 this mitigate against exploit when no instruction injection is done. The exploits being mitigated here are ROP (return oriented Programming) and JOP (jump oriented). The idea is to corrupt data, so the existing instructions behave differently (rop corrupt return address, jop corrupt jump target). Memory space is usually protected so that you either have write or execute access, to it, but never both. This means that return/jump addresses must be stored in RW memory space (the return address is on the stack). The feature shown here adds a way to confirm the address did not change before jumping/returning to it
Could you explain how triggering exceptions, like the hypervisor, changes execution states? Moving memory accesses that aren't already in the currently running program's address space should cause an exception that is on a hardware-enforced security measure, at the operating system level. ZEN Project comes to mind, but not sure if they actually use that mechanism. Most hardware has had unused VT hardware that has gone unused, because operating systems programmers felt it was unnecessary. I'm not so sure of that, but if what I imagine is true, then it would make a whole lot of old tech already a viable option for a more secure operating system without glow worms up in yo sheeeeeeeeeeit. Stay frosty, pals.
Does the program crash if someone tampers with it? Then it would be a quick way to exploit and crash other programs like anti maleware… If not 29 bits is not enough, as you could just keep guessing and within a few minutes you likely gain an overflow
The program crashes. But an attacker can only crash the program that has the vulnerability, not other programs. They could crash this program anyway if they wanted to.
I don't get it. I thought all of this was prevented by using virtual memory? ARM is used for lots of embedded systems, so maybe there is some benefit there where you don't already have segmentation faults.
Memprotect is broken and will be fixed… with all pages handling user input permanently marked as non-executable in addition with a memory safe language there is much security to be gained. Hashing pointers needs hardware AND compiler support, it‘s going to take time.
Do you think these kinds of mitigations lead to some indifference towards vulnerable code? "It's fine, it'll be caught by pointer authentication anyway", and then a decade later when someone finds a flaw in the pointer authentication we suddenly have a ton of known vulnerable programs out there. It just feels like a significant number of developers who aren't that mindful about security are going to use it as an excuse to not write secure code in the first place.
what is the cost of usinng such features? also the adaption of such feature will take time, in which we wont see accross the industry for a long time but its good their is a feature that exist.
So the real question is how good is the hashing function that you can't recover the key just from leaking 2 or 3 pointers. As it seems that there will be a compromise between this and execution speed, I fear the worst...
29 bits isn't exactly a "strong" MAC. It's better than nothing, I suppose. MACs are awesome though. Stick em on the end of a public facing ID and you can reject requests with inaccurate/mistyped/brute forced information without needing to check your database if the information was correct 99.99999% of the time. Can be applied to license codes, API keys, or any sort of ID that you distribute to another party or system
Everyone in security: You should never use gets. People from other languages: So they're going to fix it, or deprecate it, right? Security people: :) Other people: .... RIGHT?!?!?
grab a yubikey at yubi.co/lowlevellearning-2024 and secure yourself online with two factor authentication. thanks again Yubico for sponsoring todays video!
Don't forget to tell people to NOT use honey on that nice little affiliate link lmao. Liked the vid!
Isn't the Yubikey 5 series vulnerable to side channel attacks? 😄
ninjalab.io/wp-content/uploads/2024/09/20240903_eucleak.pdf
Isn't the Yubikey 5 series vulnerable to side-channel attack? (Ninjalab's EUCLEAK).
Edit: I see the firmware version is 5.7, where the vulnerability has been patched.
@@StephenKingston To my knowledge that requires physical access. You likely already know, but for others reading my comment that are unaware, for a system that is in use, and therefore is at least partially decrypted but likely effectively unencrypted, is basically hacked if a malicious actor gains physical access. There is very little you can do. So basically, in the situation you could physically grab the Yubikey and take it, you can also digitally copy it.
What I would need to do if I would loose a Yubikey? What if I wouldn’t know that I have lost it?
So we are going to essentially compute hashes on addresses, and every single time we want to use said pointer? I wonder what kind of impact on speed it could have?
Probably very small impact, each memory access already involves things like virtual to physical address translation and caching, they can probably squeeze hashing in there without increasing latency
@@nowave7 If it's built into the cpu instruction set, theoretically it could be done with little or no performance impact.
I think you can already try it out on ARM devices (or Apple Macs) with pointer authentication
Shouldn’t it have none? I only understand the theory and haven’t implemented this yet, but isn’t this akin to putting a primative data type onto a pointer?
Good point. Profile it and check, I'd love to know the results.
Security is always a tradeoff, so you have to think if it's worth it or not.
And there I was thinking, after reading the title, using "signed" pointers meant to allow them to have negative values, and could not think how that would be feasible.
This makes more sense.
me too ! signed for signature
there is no such thing as a signed pointer..... it is a binary number, pointer is just a collection of bits..., you could designate any bit a stupid name...
like "sign bit" , or "hacker defeat bit" , but it means FA, becasue it is only down to the WAY YOU interpret those bits that gives them intrinsic functionality.
you could take that "pointer" or signed pointer, then put it into a box that says ascill characters in 8 bit groups does that now mean the pointer has magically transitioned into ASCII?
nope, it means you just redefined how you want it interpreted.
now let's say you wrote a shitty ASCII interpretation routine that only assumes 6 bits are passed and the top 2 are always zero bits.... and that "magic hacker proof pointer" potentially just became your worst nightmare
Lol, lmao even
Actually, since existing x86-64 implementations don't implement a full 64 address bits and the architectural specification requires that all unimplemented bits have the same value for a pointer to be valid, x86-64 pointers are effectively signed in the positive/negative sense. If you consider them unsigned, the set of valid pointers consists of two separated subsets at opposite ends of the address space. If you consider them to be signed, it's all one set in the middle of the address space (immediately on either side of zero).
lol, same.
6:30 - So ARM is limiting their virtual address space per program? 29 bits of a 64 bit address space being reserved seems like it would mean you can use up to just over 34 million addresses. While that may seem like a lot for most programs, I do wonder if it is kinda a "640k is all you'll need" sort of situation.
My answer disapeared, I'm not sure if it's just pending, or because I added sources in the form of links 😂
Just in case, here is the gist (sourceless, sorry) :
A Qualcomm paper on PAC and the linux kernel doc mention that PAC signature size depends on the virtual address space.
Qualcomm mentionned a defaut value of 39bit, leaving 24 bit for the signature (~16.8 million values) .
Basically, the signature is stored in currently unused bits.
clang library also mention that nothing prevent the signed pointer to have the same size an unsigned pointer.
That's just convenient for backward compatibility reason, and becaus arm HW support preserve the pointer size
You got sponsored by Yubico ? 😳😳 bro they don’t even put out ads, You basically bagged the most legit sponsor any YTber has ever had.
Hi, sorry to interrupt your day, but could I get your assistance? I keep USDT TRX20 in my OKX wallet, and my recovery phrase is (clean party soccer advance audit clean evil finish tonight involve whip action). What’s the best way to transfer it to OKX or Bybit?
At this point we should disallow everyone who is not a level 3 magician access to a c compiler
Ken Thompson hack. There goes the rabbit hole.
Good old IBM Days
No, but maybe you want to limit the number of people that write software that gets distributed, especially OS stuff. You couldr put resources on checking the software. There is a lot of private use and educational use of software.
Or maybe we should help people become a level 3 magician instead of scaring them away from foundations. In the long term this seems more sustainable
Ah yes so then C just does with the wizards because novices are not allowed to learn it with practical knowledge.
Thanks, I will talk about this in my next Bumble date
I think you meant Grindr.
FurryMate
will wait for screenshot
If you find a girl, listening to this, go get her and never leave her. Good luck soldier
I once found a weakness in this where you can use tail call optimisation and exfiltration and longjmp (if you can line all those up) as a pointer signing widget. Tail call optimisation causes your choice of return address to be rewritten with a predictable corruption of a correct signature before being passed to the next function, but this doesn't cause a fault, yet. If you can then exfiltrate that address then you can fix it yourself to get your pointer with a correct signature. After that you need to avoid the exception, which you can do via longjmp(). Next time around use the pointer you exfiltrated last time, with the correction, and return normally rather then via longjmp() to the destination of your choice.
I’m using this channel to understand the spoken English, but idk why in my native language (Spanish) nobody is talking about these features, that’s cool man, you really know a lot of this stuff
I appreciate your content buddy
The audio in this vid is ever so slightly out of sync, just FYI
I was so distracted by this, not because it bothers me but because i was trying to figure out if im just stupid or if its out of sync because I felt something wasnt quite right but was to stupid to actually figure out if it was..
@@robshaw2639 I just noticed, god damn
Thank God I haven't noticed it yet with my headache.
@@dennisestenson7820 are you sure? Because I watched it on a computer. I don't think it has anything to do with what you're saying. I think it's just like 50ms out of sync.
Looks correct on a Pixel 8 Pro. Weird.
00:25 wait, YOU have to ask for money to get your files back? Isn't that the ransomer's part? 😂
Freudian slip?
I worked on z/os kernel for many years, this problem was solved by verifying pointers by making sure something in the operating system points to it. No storage is copied into directly.
Do you think it's still in use today or has it been replaced with hardware pointer verification?
@@YolandaPlayne It is still in use today. That code was written in the early 80s.
@@thecodemachine I've worked with compilers that use it compiling code from the 80's. Lot of hidden undocumented features in the compiler that made modifying the code error prone with no way to figure out why other than trial and error.
The video is more about memory safety within a single execution context, not really about kernel/user memory space isolation
The boundaries aren't really clear with modern processors. There is multiple contexts that could be considered kernel and user spaces, each with their own security features and simultaneous thread.
Seems like this would interfere with NaN packing that many dynamically typed languages use in order to store pointers and other data types inside the diagnostic information portion of NaN floating point numbers. (Basically turning 64-bit floats that are NaN into enums that store both type information and value).
Won't that only (mostly?) be for it's own language-specific data types, and less so pointers expected to "interface" with the rest of the system? Like, I don't think a pointer to a JavaScript object is going to be passed to anything outside the JS VM. I'm hardly an expert here though, corrections are welcome.
@@mnxs I'm no expert myself, but wouldn't things like function pointers and the like still need to be passed around in NaN boxes, if, for example, you're passing a function implemented in C as a callback parameter to a JS function?
You can even put a Hash in it that has an expiration time, so dangling pointers don't stick around for days and days and days.
how you gonna define that time ?
and where u gonna get it...
ever had your mother board battery run out ?
@@stevesteve8098 It only has to be valid for the length of a kernel function call. It is not unreasonable to require kernel functions to reget memory if a bad return code is called.
So a stack alloc
Not as cool as my 42GB-consuming jvm tho.
@@thecodemachineI don't know if that's true, necessarily. I'm hardly an expert in these things, but I don't see why a long-running server application can't have shared resources or something that is allocated for, say, the lifetime of the program, that it'd have to make calls with (the pointer of).
Stack limit register might have been even better option. Imagine a hidden register that could be modified only by special, for example, "branch, link and limit" instruction family, to point to address of return address value in stack and any access beyond it that is not by 'ret' instruction will generate an exception. Should be cheap to implement at hardware level and transparent to use, but hell to implement in backwards-compatible way though, since having an extra value alongside return address would definitely be an ABI-breaking change.
@@zetash1be441 Why should backwards compatibility be a big issue? Just make it into an instruction set extension, like AVX.
Branching to a dynamic address can be fun and useful, e.g. for coroutines.
You could even have a CPU mode (or a preprocessor script) where the semantics of the old branch instructions are "updated" if your executable doesn't need this form of dynamic dispatch.
Oh, wait, your idea has another indirection? That's genius!
Correction at 5:34: the PML4, PDPT, PD, PT tables are NOT inside the CPU! The PDB inside the CR3 register is the only thing inside the CPU that bootstraps the Virtual2Physical address translation process (apart from the general purpose register containing the virtual memory address). All the tables are inside RAM, except for the TLB cache.
On x86 (32 bit mode in cr3, in 64 bit in cr4 iirc) , not on arm, there its probably in another register
@ItsCOMMANDer_ what? Both in 32 bit and 64 bit mode it's in the upper bits of cr3, and it's saved as a pfn (page frame number): since every translation table is 1 page size, the lower 12 bits are reserved for other flags. Anyway, the point is that translation tables are not in CPU, excluding the cache, and this should be pretty much the same for ARM architecture, but with some differences in the implementation of the page tables with respect to x86
A simple solution is to disallow modifying pointers altogether. On the AS/400, for example, *only* the platform firmware can hand out pointers. You can't create pointers out of thin air, and you can't modify them. Once you perform arithmetic on a pointer, it's no longer a pointer but just a number. In addition, pointers are typed, and they are tagged with an owner and with access restrictions.
Another nice feature of the AS/400 is that only the OS can compile to native code. You cannot write native code, and you can't compile to native code. You can only compile to intermediate code, the OS then compiles to native code. And the intermediate code is "safe", i.e., it contains no instructions that could violate memory safety, pointer safety, or type safety.
This idea also exists in Microsoft Research's Singularity OS. All code is delivered as high-level byte code, together with a manifest that contains a list of all the privileged actions the code needs to perform. The compiler will first prove that the code only uses the privileges listed in the manifest, otherwise it will reject the code. Only then will it compile it to native code. And just like OS/400, the compiler is a privileged OS service that cannot be invoked by a user.
Singularity uses what they call SIPs (Software-Isolated Processes) for process isolation. In fact, all of Singularity runs in Ring 0 in a single address space. Why? Because the language itself already guarantees stronger isolation properties than the CPU and MMU can anyway, so why bother? Each SIP has its own object space with its own garbage collector. There are no instructions in the byte code which can access memory or manipulate pointers. All data exchange between SIPs is through message-passing. Messages are only ever owned by one SIP: before sending, they are exclusively owned by the sender, after sending, they are exclusively owned by the receiver - there is no shared memory, ever.
SIPs define their messaging protocols statically, this allows the compiler to generate code that actually operates on a shared memory region. So, you effectively get the safety and semantics of message-passing shared-nothing concurrency with fully isolated processes, but with the performance characteristics of shared-mutable-memory concurrency using lightweight threads.
Sadly, because of backwards-compatibility requirements with C, approaches like this never see the light of day.
So according to this bullshit... it's not possible to "hack" MS programs, which we all know it nonsense... RING 0 has been hacked so many times it's now more open than a clowns pocket.
Plus your fantasy makes ABSOLUTELY NO correct assumptions about "glitching"
How do arrays work if you're not allowed to do pointer arithmetic?
This fascinating- have you got a recommended article I can google to read up on this? Or if you’ve written a blog article or something on it perhaps?
The concept (although Microsoft) is very interesting- taking the lowest level of computing and just whacking the OS right on top of it sounds like (for most these days) an absolute nightmare.
If what you’ve stated is anywhere near accurate then it’s an interesting approach and I want to understand it and the caveats that come with it (some of which you’ve listed already)
@@TheBlegghas I understand it, indexing an array requires a system call. This call will check that the offset is valid. The way the architecture works this isn't as slow as it sounds. You don't have address spaces or context switching any more iirc.
It's possible to do this and be compatible with C. The CHERI project out of Cambridge and ARM Morello both accomplish this.
As for performance, your CPU is already doing this check when it does virtual address translation. If the objects are finely grained enough, it's basically free.
Random bits on an integer isn't really encryption and doesn't stop anything from looking for pointers with undefined behavior.
sign != encrypt
Pointer tricks like this are terrible. Back when I was putting Linux username apps to aarch64 we noticed certain apps and libraries advise the upper bits of 64bit pointers. The problem was 64bit arm was enabling 52bit addressed memory... Some super computer somewhere required that much size, but I digress... We were not able to compile things like Mozilla warez because of these clever hacks. These are nice until one day the entire address space is taken for memory. Yes people, even on 64 but machines not the entire 64 bird are given over to memory, because that's crazy talk... But every 12 years or seems two more bits go to memory. These kind of trucks will eventually be gone.
on big endian systems like IBM Power with tagged memory the tricks do not eat into allocatable memory like that, only one bit per many bytes to hash the page table, in little endian the address is the 'wrong' way around so more overhead i think
@foobarf8766 that's not how I remember it working, not the big endian IBM machines with tested memory, but rather abusing pointers. Most 64 bit architectures simply don't use the full 64 bit, it's 48 bits, 52 bits, etc... so 12 or 16 bits left over for any number of tricks, such as pointer hashtag but it could be anything. Like I mentioned before, when we raised the bits boundary from 48 to 52, it suddenly broke certain things because a handful of people realized they could do pointer tricks. Perhaps this is where IBM power-be would help, with memory tags? Incidentally I was also on the team that bootstrapped IBM power9 little-endian to Linux, but I don't remember the tag stuff, but that's possibly because it simply wasn't broken such as this idea for pointer hashtags... We didn't have to fix anything in the port.
This makes sense, E2K uses 128-bit descriptor-"adresses" for secure mode, 64-bit adresses for normal mode, and 32-bit for x86-compatibility. And E2K has separate 4 bit tags per 64 bits of data/address.
Thanks for the video - It’s a very interesting concept.
My only hot take question is theoretical and perhaps naive since I’m not well versed in security -
Since we expect compiled executables to be portable between machines that are of identical platform and spec, wouldn’t the executable require the decryption keys to be bundled in? And if yes, how would those be protected?
I had to comment because this is a project I worked on for awhile (not in industry, in school). I even have a verilog circuit rigged up to do just this sort of verification in "real-time" behind the scenes. Definitely a cool idea, I'd love to see it brought to fruition. As someone else mentioned there of course is overhead associated with it but there is a litany of mechanisms you can use to distribute that cost across idle cycles or preemptively as opposed to in-lining it like I was doing.
Isn't this similar to what CHERI is doing? I believe CHERI is a bit more exhaustive and/but requires the software to be aware of it
Mint video
Can anyone help me to set this theme in vim
Does anyone else ever fantasize about what we could do with our computers if half of our development time and half of the computational energy didn't have to be dedicated to security?
I fantasize about the day when base64 becomes obsolete. Also hex, but that would require bytes to be 4bits wide
No. What I fantasize about is what we could do with our computers if "hello world" would compile to less than ten gigabytes (with 29345 files in tow, all irreplaceably ESSENTIAL!) these days. And yes, this is hyperbole, but it unfortunately holds true for ANYTHING more complex than "hello world".
Simple fix. Switch to temple OS! No security, the user is king! But ya, we also have inbuilt protected memory addresses and stuff for operating systems.
@@AttilaAsztalos roll everything yourself. I built my own graphics framework to replace SDL2 in my FOSS projects. One of my builds dropped from 8 MB to 2 MB. I replaced GLM (which is practically an industry standard for graphics programming) with a custom linear algebra library and gained a 30x FPS boost in my UI. Granted, I don't need several thousand FPS for something that isn't a game engine, but it still amazes me how inefficient GLM was considering how widely used it is and how easy it was to reimplement.
@@vrclckd-zz3pv I would if I could. But I'm a chip monkey, not a code monkey, and condemned to remain so (yes I did try). On the other hand, the full-3D full-parametric CAD I use almost daily is a SINGLE FILE, and is, in total, smaller than ten MEGABYTES. It's called "Solvespace", in case you were wondering... Now, maybe, you understand why half-terabyte abominations make me mad.
This has been a 'solved' problem on IBM Power for decades, in big-endian mode there are memory tagging extensions that use a hashed page table. Some other great stuff in Linux like wipe/fill on free and alloc works great on Power too. In little endian the overhead becomes heavier because check bits go at the 'wrong' end of your address so you need 'software' solutions like to XOR or put the check bits where they eat into allocatable memory. It's good that Arm are bringing this feature back into general purpose processors, it's been a thing on IBM Power for like 30 years.
I'm just learning, but would it be possible to limit the input that the user is able to input to prevent overflow before it can happen?
Yeah! That's one of the techniques that can be used to avoid buffer overflows: make sure you only copy as much into the buffer as space it has available.
Yes. However, he's just using the obviously-unsafe gets() as an example. Real-life code is much more complicated, and mistakes happen, it's not always that easy. For example many low-level protocols are of the form "2 bytes encoding the length, followed by data" Now if the attacker lies about the length of the data... There are tons of other ways to trick software to read more (or less) than it should.
You still have the problem that the CPU assumes that you are returning to the same place you came from. A buffer overrun hack abuses the CALL and RET machine code operations. When the CPU executes a function return, it assumes the top 8 bytes of the stack are the correct address to return to and will go there, and doesn't check if the value of those bytes have changed since the last CALL instruction wrote them there.
@@williamdrum9899 very well explained
I have been under impression that we don't access physical addresses anymore? That raises the question:does row pointers perform jump instruction to the address they point to, or they move instruction pointer to the said address?
Hey stupid question that you probably won't see but somewhat relevent to the video : how much of a security risk from cuda allowing to create memory buffer that is visible by both the GPU and CPU ? I'm asking myself that question because ( from my understanding) it's allocating memory and directly give access to the physical adress of the memory pointer so that both the GPU and CPU knows exactly where the data is.
The gpu drivers then is supposed to keep this memory range allocated (cudaMallocHost) until you dealocate it(cudaFreeHost) ; however I remember a while back a bug that the drivers never dealocate the memory even if we deallocated the memory in the program.
The result was that my benchmark program for gpu testing was able to crash a 128 GB computer with 2 xeon and 2 top of the line GPU at the time with SLI split in 2 numa banks. Also worked on a standard laptop. The test was benchmarking specifically targeting memory transfert between CPU & GPU and running it engough time it was able to reserve most of the memory to itself even after we stop the process.
I followed the RAII principle using classes for all of the GPU related memory aquisition and using the standard library for anything else. I even mannaged to make a template that is able to alocate the correct type of memory for a CPU lambda that you want to send to the GPU, however the issue was still there afterwards.
You can use IOMMU to ensure that the CPU and the GPU see the same addresses. Support is improving!
afaik microsoft's kernel allocator doesn't actually zero on free but that's optional on linux too, need to turn it on
Cryptographic authentication of pointers... that's, um, interesting. Refusing to make fundamental changes to the architectures we still use leads to all kinds of delirious approaches.
Completely separating return stacks from data stacks would solve 90% of all security issues due to stack corruption. You could still corrupt data, but not make the code go where it shouldn't. And if you add to that proper validation of your inputs everywhere it's at least moderately critical, you solve probably 99% of issues. There will always be this annoying 1% left, but it may be deluded to think we can get to 100% anyway. And, these changes would imply a bit more work for sure, both on CPU design and on software, so we prefer "enhanced status quo" (= do everything as usual, but with new tools).
@@joseoncrack to be fair, proper validation of input may very well solves 99% of the issues on its own.
The issue with return specific stack alone is that it wouldn't prevent JOP, or call through pointers, so that would just change how stuff gets exploited. That might solve 90% of current issues, but not 90% of all issue.
5:37 bro thought he could slide that in there without us noticing
...Wouldn't shift-arithmetic on the pointer (sig = iptr >> 35) reveal it? And if the signing algo is public knowledge (it _will_ be whether it's intended or not), you could do shenanigans like (iptr = getSig(badptr)
@@WackoMcGoose you would need the secret key to produce a valid tag on attacker controlled pointer data.
It's stored on the actual value of the pointer, you only ever have access to the virtual memory address of your program. You ask to sign 00000000, the value in the return register, the mmu maps that to 00001024 where the first four bits are reserved for things other than addressing, it sets the signature in the bits it doesn't need but in your program you still read 00000000 thanks to the mmu.
If you don't use the mmu then you probably could, supposing the instruction is even supported in that mode, but it wouldn't make much sense to use real mode and care about this.
@@WackoMcGoose signed != encryption
You want to verify the pointer didn't change, not obfuscate the pointer.
If you strip the pointer from its signature, while you get access to its original value (which was never hidden anyway), you still can't return to it, since the signature check will not match
@lowleveltv at what point does the pointer get signed? Is there an opportunity to change the pointer at a stage before it gets signed?
Once that buffer over floor has been patched, another one will crop up somewhere. That’s unfortunately life.
lol authentication overflow or maybe someone will write a code that is turing complete only using authentication pointers only like movfscator
In C programming, the return 0; statement at the end of the main function serves a specific purpose:
Purpose:
Indicates Successful Termination:
It signals to the operating system that the program has executed successfully and terminated without encountering any errors.
Exit Code:
The value returned by the main function (in this case, 0) is known as the "exit code" or "return code." The operating system uses this code to understand the status of the program's execution.
JWT is going to be brought to the whole next level lol
5:38 Cool Babe Face
😂 I see what you did there
Is pointer signing enough? It's still possible to do arbitrary function call:
1. Find an exploitable function.
2. Somehow read signed pointer to that function.
3. Overwrite the stack with a valid frame mimicking call to that function.
4. Overwrite LR to our signed pointer.
5. Return.
one thing people often dont know is that the absolute minimum requirement to jump to an arbitrary address (im aware of other methods, but this one is very simple) in x86. all you need is 6 bytes, equivalent to"push [32 bit address]; ret;" in assembly. this jumps to an arbitrary address, as "ret" is surprisingly more complicated than one may expect
Maybe I'm missing something, but what prevents an attacker from exploiting something else to get their pointer signed? The clang API has to result in actual machine code; why can't they execute the same code?
I suppose that if currently, the most common path to exploitation is rop gadgets, then you can't start the execution of those via retaa. But I'll be surprised if no other path exists.
@bryankadzban1159 If an attacker gets the ability to execute code, this becomes an irrelevant meassure.
However, this is supposed to prevent an attacker from gaining execution in the first place.
If the way an attacker wants to get execution of their code is to change a pointer to anything else, this prevents it.
The attacker doesn't yet have gained execution to ask the system to sign the pointer, they just have the ability to write an essentially infinite amount of data that they want to use to manipulate the way execution happens.
They can't pre-sign the pointer due to the key necessary for signing being part of the system. So in the end, if implemented correctly, an attacker never gains access.
3:03 If hacker changed value of return address, they would do it inside of the "gets" function, meaning the "return" inside "gets" would execute malicious code, not the one inside main.
That's also why I think your code at 7:12 is not secure, since you check MAC only after "gets" returns, which would be after the malicious code executes.
@@bartolhrg7609 gets write to the address given to it as a parameter. Why would its return address be overwritten ?
5:40 - it’s pretty common for the virtual & physical addresses to share the bottom N bits (ie. They aren’t remapped). Where N relates to the page size, as in 2^N. Your example is different - maybe you could have used 0x9e90 vs 0xface in the physical example? Actually for a 4K page side that’s just the 0xe90 part (12 bits). #JustThinkingOutLoud
Reminds me of the IBM 7040 which could set a word without updating the parity bit. It was used to detect uninitialized variables.
Holy shit... is signing in Ring 0. that sounds mind-blowing
I wonder how this handles systems where the programmer i.e. has an array of some struct referenced by a char pointer add 3*sizeof(the_struct) to some position in the array to get the bytes of the 4th struct instance in the array after the current pointer.
The pointer values are typically constrained to effective 48 bits by forcing the most significant bits to be all zeroes or all ones.
Not pointers, virtual addresses
I'm curious about how that signing works. Maybe it works against overwriting pointers using byte-by-byte or number over pointer writes. But I doubt it can save us from situations when code already writes pointers based on input.
It sound more like randomization of memory layout, but with more predictable behavior.
If I remember correctly, the IBM AS/400 and its descendants have done this for decades. Their pointers are extremely long.
Disclaimer: my memory comes from the dim and distant past. Please correct me if I am mistaken.
When it comes to this types of vulnerabilities, the C/C++ was always mentioned, but I wanted to know: does programs written on Rust, Python, Javascript (Node.JS/Deno/Bun) can be hacked by changing pointers? Or those languages already used some sort of encryptions and doesn't need more protection than out-of-the-box?
@@XCanG well they all can, kinda.
Low level languages are more vulnerable, because pointer are exposed.
Rust does a bit better because pointer manipulation is restricted to unsafe block, and this is enforced by the compiler (that's however not perfect, so there are ways).
Higher level languages usually provide a runtime which takes care of pointer manipulation and prevents the user from exposing such issues. (Js/python/java/c# run in a kind of virtual machine). However said runtime can have buffer overflow exploit, since they are usually written in lower languages
@@Darkyx94 Runtime language doesn't even need to do ROP/messing with pointer. They use JIT, so an attackers could just write the instructions they want to execute into RWX memory.
@@kienH yeah, but OP specifically asked about changing pointers
1:02 that is not a pointer to data, that is a function pointer!
It's all just bytes
@@Masq_RRade With different access permissions...
@@Masq_RRade Well sure, even an address is data, it's all data, it just depends how you interpret it. :)
@@roeetoledano6242 The page can be read/execute, it can be read/write, doesn't matter. It's all data. It's all just bytes
@Masq_RRade@@nowave7
Offtop: There is no such thing as 'it's just bytes, you can interpret it however you want' in C, because strict aliasing rule, at least not without using functions that implicitely create objects like memcpy.
So if I understand correctly, it prevents bufferoverflows, but not out of bound reads or anything else. Return address spofing might also be a little bit tricky, after performing hooks.
I wonder if 128 bit machines are going to be come sooner than we think. If mitigations like signed pointers will become a thing, then the more bits in addresses etc, the better. The more bits for encryption, the better right? Or I guess, the better the algorithm, since FS EC is less bits than RSA right, like 256bits upwards?
I heard that 128 bit is verry ineffuciant which is why they dont exist today
Unlike normal encryption, pointer encryption is not the first line of defense. A single wrong guess crashes the program, so any attack that doesn't always work will be detected very quickly and the vulnerability in the code can be fixed. This means that the encryption doesn't have to be very strong.
The downside of using more bits is of course that at the very least one needs more silicon area on the CPU chip. In addition larger hashes take longer so there can be a performance penalty. As long as the hash is quick enough that doesn't matter because memory access is slow compared to other CPU tasks, but if the time rises above a certain amount of clock cycles there will be an impact. (Kind of like putting an extra object into a box that still has space is no issue, but if the extra object becomes too big you need to get a bigger box)
didnt the xbox 360 had something similar where the cache was used to hold the signature of the memory to detect tampering?
I want my pointers to have a tracking id and signature verification... heck throw in the insurance too!
“Babyface” got me
If that sense, doesn't it means the "malicious pointer" still exists somewhere?
So that mal-function can still have a signed pointer?
How signing the pointer helps in that sense?
Things are signed with a discriminator. The mal-pointer elsewhere would likely be signed with a different discriminator than the pointer at the location we were trying to replace.
Wouldn't it be simpler to just create an instruction for assigning pointer ranges (and an instruction for removing those ranges)?. The CPU can just store the addresses in a dedicated cache or dedicate a part of existing cache for the addresses. It already has something for page ranges so why not something for pointers within those pages. Personally I would like dynamic memory from malloc/new/etc to only be accessible via dedicated functions like memcpy or something. For example if you create a list with malloc the memory given should not be directly addressable but instead to get any element in the list you'd have to use memcpy to copy it into a stack variable.
I'm currently designing an arena allocator with just such a property, it doesn't hand out addresses but offsets from the base address. My malloc replacement I made for testing purposes just adds the base address back because I can tell the arena on creation AND on growth that I don't want it to move. At the moment it uses actuall memory for it since I need to iron out bugs but later once I've got it working I'll convert it to a variant that uses something similar to /proc/self/mem. If I could make access to the file/pages require a key then I could protect them both from externally controlled read/writes. The problem would be how would debuggers be able to work on it then 🤔
It's called CHERI and it's still being developed. On 64bit it extends pointer to 128bit and encodes some properties (length and allowed access) into the pointer itself. Unfortunately there's ton of poorly written shovelware that assumes "void *" is the same as "unsigned long" and lots of code that needs to be recompiled.
Ahh, I wish the title of the video was properly explored.
Naturally, it does a great job explaining what pointer signing is about, but I wish there was a discussion section in the video focusing on adoption and whether it ever makes sense to sign *all* pointers.
Cool topic though!
what is your colorscheme in vim? I thought it was kanagawa but yours is more readable than mine. I mean eye friendly. Cool video. take care!!!
sounds like a lot of overhead
I was thinking it would be baked in the compiler, and it is, just not for all the use-cases. It really looks clunky for, lets call them user introduced pointers, for a lack of a better term.
I was referring to the execution speed but yeah that too. Although I'm sure they'll improve the programmer experience over time..
compute power scaling is much bigger than memory speed, it probably doesn't hurt much + accelerator on chips doing hard work
for my favorite government backdoor
its just one more assembly instruction, it might be a few cycles slower than the return call that doesnt check but we are probably wasting way more cycles per call doing other kinds of checks to mitigate this problem. Resource hungry malware like Denuvo is being installed into hardware to stop these kind of exploits, if we can do away with any of those, we end up reclaiming wasted overhead.
would shadowstack achieve similar?
Yes but there's a performance penalty since it's an optional feature. ARM Trustzone is a parallel system that is mostly invisible to the executing environment.
in the video it looked like it required addition manual code? it also sounded like it does more work on the chip to sign, verify, and authorize all the pointers? shadow stack is just a compare of pointers in a seperate hardware pointer stack and doesn't require addition code, iirc?
@@nightshade427 Yes it's "just" that but it's not a separate system. It's built into the chip and it's dark silicon unless it's used.
That means sometimes it's using features that are LOTO until a normal thread can use it so that's the performance penalty which also precludes side-channel attacks.
The really secure way to hold return pointers inside of registers instead of the stack registers, I am too lazy to search for the source but I am pretty Intel and Arm both work on this
certainly makes sense, but also only helps for the immediate return of the current function. All layers above would still need to store their return pointers on the stack.
Another mitigation is implementing a shadow stack that only holds saved return addresses. Calling a routine pushes both to the actual stack and the shadow stack, returning from a routine pops it from the stack and checks against the shadow stack.
This is already implemented by Intel processors
yeyeh, yubikey is awesome, but the services that don't allow you to register a second device as backup just suck.
ARM already has similar spec called "memory tagging extension" what is advantage of this approach vs mte ?
Firstly, as someone has said, I wonder what kind of impact this has on code execution. Because this is entirely done in hardware, from what I've read, I think the performance might not be that bad.
However, I also KNOW that this kind of pointer authentication has already been successfully bypassed.
Moreover, if you can modify the return address, there's a high likelihood that someone can modify retaa so that it returns without authentication.
About your last point: retaa is an instruction located in the part of memory where the program is stored which is usually write-protected. While the actual return address is simply a value on the stack. If an attacker manages to remove write protection on the program they already have arbitrary code execution and don't need to mess with returns anymore.
@entcraft44 That's a good point about execute versus write. Although, I do this all the time on the x86 architecture with WriteProcessMemory().
I feel like I am missing something. Would it not be possible to "fix" the pointer by overwriting it with the original address before "retaa"?
The whole point is the hacker wants the program to return to an address they specify. If you put the original pointer back then that won't happen. You also can't just put the "signature" (top bits) on top of your own pointer as that will fail the check.
This would require a separate data structure/memory location to store return addresses in. Forth does something similar with its distinction between the data and return stacks.
Say the programmer was holding a function pointer in a stack variable and calling that later. Having a separated stack isn’t going to stop the user from getting an arbitrary execute primitive if there’s a buffer overflow on the stack. I don’t think the extra security potential it adds outweighs the performance impacts that might have.
@@imciviled So I'm not an expert on this at all but I don't think you've described the mechanism correctly. My understanding is there is no separate pointer storage. The pointers are in the usual places but simply with additional decoration (hashed high bits). This decoration is done by the kernel when requested and when the function return pointer is popped with the secure version of that instruction then the kernel verifies the decorated pointer and throws an exception if it doesn't pass. You are right that this will certainly impose performance restrictions simply due to the extra work needed to compute and verify decorated pointers. There will also likely be cache locality impacts.
retaa checks the upper bits for signature. If you overwrite it with the original address, the part it checks would essentially be "0000", which is an invalid signature and cause an exception.
I don't agree. I feel we need a new executable format entirely with a new kernel memory layout that is more secure.
PE/COFF and ELF both came out around 1992 or 1993.
about 64 bit. aren't our CPUs 48 bit or 57 bit for practical reasons?
Someone explain what I am not understanding. Can this not just be implemented across all architectures when the code for physical and virtual memory is written? Why is this specific to ARM?
While the operating system kernel plays an important role in memory mapping, most of the heavy lifting is done by the memory management unit in the CPU. This is essential for good performance. That means that some features need support in the MMU hardware to be effectively implemented.
What's the font you're using tho?
stdio means standard data input output / standard input output
modern CPUs and their NPUs allow direct memory access. So they seem like a perfect attack vector for privileged hardware (inside the CPU) that can run code easily... Like from a website doing some NN inference locally via webNN or something.
How can you disable NX with a simple write? You would need to write to the page tables in kernel memory, which wont work unless you already exploited some other bug….
Remapping the stack using mprotct
@@rodneynsubuga6275 please explain how mprotect on the user stack gives access to the kernel page tables
Isn't this the same exact mechanism targeted by the PACMAN exploit for M1 macs? Did ARM fix the vulnerability?
"Hackers are everywhere"
Yeah, they might even be wearing my skeleton
What would stop the hacker just overflowing the buffer to replace `retaa` with `ret`?
The buffer is stored on the stack, the same part of memory where the return address is located. However the instructions are stored in a different (usually write-protected) part of the memory so they can't be changed by this kind of attacks.
16 bits is very small for a MAC, what's to stop the attacker from just brute forcing it?
A failed authentication causes a program crash. If an attack crashes your program 9 times out of 10, the likelihood that the vulnerability is detected and fixed before any real damage is done is extremely high.
Interestingly I first herd about this at university, was in a security course. Didn't really go much deeper than this tbf.
MTE (memory tagging extension) and memseal (make memory essentially unmodifiable) are better approaches to exploit protection in my opinion
Which theme do you use in your vim editor
I don't understand how this works if return address is pushed onto the stack before control switches to the function, how did they overwrote it by overflowing the buffer that is located after the stack addreess that stores the return address? Does stack buffers are being written in reverse order they are allocated in? Or is the return address is written to stack after variables and buffer in the stack is allocated? I don't see any reason why that would be the way it works. Would be glad if someone shed light.
btw what neovim are you using?, coloruscheme looks great
When they inject instructions, couldn't they just remove the authentication instructions since they're changing code paths anyway
@@rb1471 this mitigate against exploit when no instruction injection is done.
The exploits being mitigated here are ROP (return oriented Programming) and JOP (jump oriented).
The idea is to corrupt data, so the existing instructions behave differently (rop corrupt return address, jop corrupt jump target).
Memory space is usually protected so that you either have write or execute access, to it, but never both.
This means that return/jump addresses must be stored in RW memory space (the return address is on the stack).
The feature shown here adds a way to confirm the address did not change before jumping/returning to it
Could you explain how triggering exceptions, like the hypervisor, changes execution states? Moving memory accesses that aren't already in the currently running program's address space should cause an exception that is on a hardware-enforced security measure, at the operating system level. ZEN Project comes to mind, but not sure if they actually use that mechanism. Most hardware has had unused VT hardware that has gone unused, because operating systems programmers felt it was unnecessary. I'm not so sure of that, but if what I imagine is true, then it would make a whole lot of old tech already a viable option for a more secure operating system without glow worms up in yo sheeeeeeeeeeit.
Stay frosty, pals.
Does the program crash if someone tampers with it? Then it would be a quick way to exploit and crash other programs like anti maleware…
If not 29 bits is not enough, as you could just keep guessing and within a few minutes you likely gain an overflow
The program crashes. But an attacker can only crash the program that has the vulnerability, not other programs. They could crash this program anyway if they wanted to.
I don't get it. I thought all of this was prevented by using virtual memory? ARM is used for lots of embedded systems, so maybe there is some benefit there where you don't already have segmentation faults.
when i was younger I thought for a while that signed integers have a signature (:
Memprotect is broken and will be fixed… with all pages handling user input permanently marked as non-executable in addition with a memory safe language there is much security to be gained. Hashing pointers needs hardware AND compiler support, it‘s going to take time.
The Intel iAPX432 from 1981 and Capability based memory protection from the 1970s has entered the chat
Do you think these kinds of mitigations lead to some indifference towards vulnerable code? "It's fine, it'll be caught by pointer authentication anyway", and then a decade later when someone finds a flaw in the pointer authentication we suddenly have a ton of known vulnerable programs out there. It just feels like a significant number of developers who aren't that mindful about security are going to use it as an excuse to not write secure code in the first place.
Am I a traitor if I turned my back to Rust in favor of C++? I have heard it is gonna get compiler features similar to a borrow checker.
what is the cost of usinng such features? also the adaption of such feature will take time, in which we wont see accross the industry for a long time but its good their is a feature that exist.
So the real question is how good is the hashing function that you can't recover the key just from leaking 2 or 3 pointers. As it seems that there will be a compromise between this and execution speed, I fear the worst...
Signed pointers seems like a great idea, if only you didn't have to do it yourself in the source code. A compiler extension would be better.
Lol, we already have shadow stacks? This would be way slower
29 bits isn't exactly a "strong" MAC. It's better than nothing, I suppose.
MACs are awesome though. Stick em on the end of a public facing ID and you can reject requests with inaccurate/mistyped/brute forced information without needing to check your database if the information was correct 99.99999% of the time. Can be applied to license codes, API keys, or any sort of ID that you distribute to another party or system
##WHICH theme he is using in vim
*_0xc00lbabeface_* you got me xD
Everyone in security: You should never use gets.
People from other languages: So they're going to fix it, or deprecate it, right?
Security people: :)
Other people: .... RIGHT?!?!?
Does rust do this?
Who control the key?
I need all my gigacycles.
Very cool information.