This wasn't bad, but missed or simplified a lot about the actual exe content. Exe files (or PE files) are organised in sections. There are different sections in it and usually only one contains your code. There are other sections which may contain resources, text or much more importantly import / export sections. While an EXE file usually does not have an export section, it usually has an import section. The content is essentially a special "contract" by the OS and your application. When the OS starts your program, the OS takes care of loading your file into memory of its own process that the OS created. The OS will scan through the import table, look up shared libraries and imported function names and dynamically load those DLL into your application and also resolves those requested methods. That way your program actually has access to certain functions that are either part of the OS or some other utility libraries. The export table usually only exists when compiling a DLL file which internally is also a PE file. Of course the export section serves the opposite of the import section. So the OS can look up a method or other symbol that the library exports when loading the DLL for an application. You can actually trim out a lot of the unnecessary code from a PE file. In the past I used a very small assembler called Flat Assembler (FASM). There you could even create your own MZ stub without all that message stuff that nobody needs anymore. In the Demoscene it was even common to have the MZ and PE header to overlap. The MZ header contains a special value that determines the position of the PE header. By cleverly offsetting the PE header you could (re)use otherwise unused or irrelevant bytes. I created key hook dlls that were only 2kb in size. Unfortunately Windows expects sections are placed at a certain alignment, so you can not shrink it too much. So Windows expects some empty space between sections. Though the demoscene usually makes use of almost every byte. Since Windows does not care about the content of those alignment sections, you can fill it with your own data. The fun thing about FASM is that it's source code is available in its own assembly dialect. So it can compile its own source code to produce itself. Of course it's open source. Just google for flat assembler or FASM.
@@diobrando7642 Windows doesn't support syscalls by user programs, but they are theoretically possible iirc. But you cannot replace whole libraries with some syscalls.
I did actually handcraft an EXE file. I did that as a part of writing a simple compiler for a stack-based language. The hardest part was making the correct header. That took 6 hours - mostly because Windows never tells you what went wrong, it just refuses to load the file. But when I get that figured out - it was a pretty smooth sail.
@@gabrielschilive7675 I tried to answer you, but my comment got deleted for links. I used "PE Format" documentation from Microsoft Learn. I also used "Tiny PE" research by Alex Sotirov. I didn't use most tricks from his research, but I used it as a guide, which fields are important, and which I can ignore.
@@gabrielschilive7675 You could also look for a copy of Inno Pascal. It's open source and a very simple Pascal compiler that directly generates PE executable files. It doesn't have every feature of Pascal, but it's super easy to understand the code.
The definition I learnt from school: 1st generation is machine code, 2nd generation is assembly language, and both are low level. Newer generations including C are high level
@@YaySyu If you're educated on the basics of how programming languages and computers work, C is actually easier to learn than Python, as it straight up has less features, and thus less to learn. What makes it so difficult for many people to learn is that what functionally it does have is a lot more powerful and closer to the hardware than Python, meaning people have to learn about things like memory management.
i don't think its true that windows is still running on dos nowadays though. thats my only critique. its running on the NT kernel now and has for a long time. I think that message about not running in dos mode was made for the time before home versions of windows used the NT kernel, so pre windows XP.
No what he meant was is you run the PE exe under DOS OS it runs a DOS part of the exe to display an error message if the same exe is run under Windows NT it jumps to the Window part of the exe. He didn’t mean that Windows NT runs on DOS which it doesn’t. on a side note, NT’s DOS support is via a virtual machine called NTVDM and Windows 16 bit apps via NTVDM and an extension called WOW. 64 bit Windows runs 32 bit Windows apps under WOW64. Windows 10 disabled NTVDM/WOW by default and Window 11 removed it so that if you want DOS and Windows 16bit support you have to use a virtual machine running a older version of Windows or something like DOSbox or WINE.
@@somacruz8272 Parallel to DOS. Windows 9x is an extension of Windows/386 and Windows 3.x, which were a layer on top of the MS-DOS kernel. Windows NT on the other hand has always been its own kernel.
@@somacruz8272 if you look at any book on internals of NT or a good book on operating systems with examples or even wikipedia you'll see that NT was a completely new architecture with the lowest API layer being the Hardware Abstraction Layer (HAL). Only companies really used NT it was for severs and powerful workstations it wasn't until Windows XP replaced MSDOS and Windows ME that Windows was completely built with using the NT/2000 architecture and became more mainstream for home users until it become the de facto.
@@andrewcrook6444 Right, almost all "newer" operating systems usually switch the processor into protected mode very early during boot and use it, well, more properly :) Windows 95 and 98 did also switch to the protected mode (in which normal x86 real mode code would no longer work), but those didn't really care much about proper seperation between kernel and userspace. It was sorta there but they still supported v86 mode to run 16 bit dos applications natively. Those essentially breached almost all security measurements of the CPU. With WinXP (which was build on NT) things got much better security wise. Though a lot of low level drivers could still easily bring down the whole system. BSODs were much rarer on XP, but still possible if some driver went havoc. Even with WinXP we still got dos support, but only in the 32 bit version. In the 64 bit version the support for 16 binaries was dropped. Which was a huge deal for me at the time as I was still occationally use my good old Borland Pascal 7 ^^. Though things have changed a lot since then. It was quite a journey.
7:06 A common misconception in Python is that each line is being read and executed in real time but what actually happens is that the interpreter compiles it to bytecode and saved to memory which will then be executed by the Python Virtual Machine in real time.
@@jeremiefaucher-goulet3365 i think the point is that its not being translated from python language to bytecode, but from a more easily translated, compiled intermediate, like Java is.
Okay, I just want to say, that one of the reasons of big size of the .exe file is compiling mode - Debug. You can basicly see there three calls of third system interruption right after the end of "main" function - they are inserted by compiler to prevent running out of function (if you, for example, forgot to write "ret" instruction). Debug mode generates terrible amount of auxilary code, which can help you in debugging. All your actions, even in assembly, are checked by debugging instruments in runtime to help you in search of mistakes. So for pure research you should better disable all of debug utilities (part of them is still used even in "Release" mode) in project settings. But even with that, this video was interesting, thank you for your work.
EXE File also contains Icons, Bitmaps, Cursors, Dialog Defintions. The function LoadBitmapA for example loads a bitmap inside the current exe file. Many of this resources can be viewed (and sometimes edited) with PE Explorer or similar programs.
I'm sure several people have pointed it out by now, but the extra code you were seeing is from the CRT (C Runtime,) since despite being written in assembly, you were compiling your program as a C program. Before main is called, there has to be code to do things like, take in the command line and split it up into argc/argv, set up thread local storage, set up floating point numbers, etc. On Windows, this stuff is done by the executable itself, not the system. The code to do it is inserted by the compiler in a way that's transparent to programmers. You can turn it off, but then you'll have to implement those features yourself if you want to use them.
This is one time i wish i could double like a video. It's a bit oversimplified for more advanced computer users, but for the layman just wanting to learn more this is fantastic.
I remember when MS DOS had a debugger. It was fun to start the debugger and tell it to just "go". Debug would dutifully attempt to execute whatever the IP register was pointing to. The machine would jump off a cliff if it could and you told it to
For those interested, I find Dave's Garage "The World's Smallest Windows App" video a fantastic explanation of how you can take out everything but the bare minimum from a PE. Very interesting video by the way, altough I feel like the viewer is left with more questions than answers. Anyways, keep up the good work and I hope to see your channel grow.
Amazing video! I never really thought about exe files that way before - you explain it so well! I always learn something cool from your channel - thank you!
the MZ at the beginning of DOS executables stands for "Mark Zbikowski"... who was one of the main developers responsible for developing the file format.
CS student so I have a few notes on this: 6:29 When I first learned about Assembly I thought the same. This is NOT true however. Assembly is extremely hard to grasp on a physical scale. It is only when you get into the meat and nitty gritty details of how a processor _actually_ functions that you realize just how close Assembly code actually is to pure machine code. All Assembly effectively does is take a command in (like 'mov') and translate it into 1s and 0s. There is a 5060 page thick "Intel 64 and IA-32 Architectures Software Developer’s Manual" for x86 Assembly detailing what exactly each instruction means, but basically "mov eax, 0x5" gets translated _directly_ into "0xb8 0x05" in hexadecimal, with b8 being the opcode and referring to 'move the following to the eax register'. The instructions that are read are directly sent to something like the processors ALU and directly fed into the connected multiplexer. So the "add" instruction you put in actually controls that specific multiplexer in that specific register. Now while this is not punching in bits into a machine by hand, you are really not gonna come any closer to controlling the pure bare bones hardware than this. 7:36 I presume you are referring to Python in this case because believe me when I say that every single one of us sucks at Assembly compared to the magic a compiler performs. A compiler is capable of spitting out insanely optimized Assembly code to the point where the only people on this planet capable of writing faster Assembly code than it are the people that actually program the damn things. Compilers do things like higher polynomial functions and division by invariant multiplication to make your code _way_ faster than you could ever do. And those are just some of the incredibly genius ways your code can be improved upon. To _really_ understand the full math a compiler uses to fold and optimize your code you basically need a PhD in Math and Computer Science. All in all that topic is a thing you can really sink time into. :)
One thing I glossed over in the video that I should have gone into more detail with is that the compiling process is still doing a lot of work even from the assembly level. For example, when writing assembly code you can still have variables with their own set name and that's an abstraction that will delt with by the compiling process. Even the MOV instruction in assembly has a couple different machine code equivalents depending on the addressing mode and what it wants to do. So I agree that assembly is close to machine code, but it isn't always a 1:1 translation process. And then as I mentioned with the program I made, the linker added all the code for creating the window and turned my short program into a 48Kb program. And I think that is the point I was trying to make, not that assembly is significantly different from the machine code of the final executable, but that the compiling process, something that most programmers probably never give a second thought to, is doing a significant amount of work to deliver that final EXE.
And don't forget to add to this that most (C) compilers are built in a two-step process, first compiling ("bootstrapping") itself using whatever tools are available, and then in a second pass, compiling itself using itself, because of well-known own optimizations etc. (and to verify everything is working, too).
@@InkboxSoftware Oh yes absolutely. I mean there's several mov instructions depending on register and type of value you are handling. But at the end of the day your instruction really does get a 1:1 translation into a final binary instruction which makes this incredibly cool to use. Given the context it's very true though that the .asm file you put into the compiler is not 100% just what you will get out of it (as you rightly pointed out). The way this statement sounded to me was giving me more of a "Assembly is basically a more complex form of C" type of vibe (I hope you get what I mean by that haha). But in the context that the .asm file is not 100% all you are getting it is very much true. :)
@@InkboxSoftware Finkel - Funk is actually right. Assembly is a 1:1 translation to machine code. Yes, most assembly dialects support certain macros or some simple simplifications, but those are merely syntactic sugar. What you've seen in your disassembly is just boilerplate code that was generated by your linker, since you actually use C++. Try using an actual assembler that directly spits out the exe file. Another thing, which I mentioned in another comment, is that the PE file format has a lot of additional headers and sections that are / need to be initialized as well. Those are not really machine code as it's just part of the actual PE format. Call it metadata. This metadata can't be interpreted by the CPU but by your OS. When you start an application, there's a lot going on on the OS side before the actual execution of your code starts. Though that's all part of the OS. Classical COM files under DOS only contained raw machine code from the very first byte. So you can write a program with just a few bytes and it would work (under DOS). Com files were always loaded at the memory address 0x0100, So absolute memory references were actually possible that way. It was literally the position in the file + 256 (==0x100). I can recommend looking at FASM which is a very slim low level assembler which directly outputs whatever you want (MZ, PE, COFF, ELF). Note: What most assemlers do for you is converting relative memory addresses or the addresses of labels for you. That's where it differs from the actual output. Though in the end the position of a label simply denotes the address. So the compiled code just contains an numeric offset at that place. Some decompilers would actually create fake labels for those, but of course they can't reconstruct the label (or variable) names, as they don't exist in the compiled code. Of course we talk about native x86 / x64 code here and not IL code (intermediate language) which is generated by .NET and can only run with the .NET framework which does the final JIT compilation on the target system. So Finkel is right. Most decompilers usually show the actual bytes that make up that opcode right next to the actual instruction in assembly. It's a literal 1:1 mapping. There are "high-level" assemblers which give you support for some high level features like if statements, loops and simple data structures. Though those do not represent the actual hardware assembly language.
In fact Assembly is just machine code rappresented by words, each word (computation) has a '1' and '0' value. I could be wrong, but I beleive that the first 5 character are related to the operation to do, then there is the adress and then the numbers to do the operation with. However python is different, the Python compiler compiles the file in bytecode, then the bytecode gets runned by the Pyhton VM
you can create much smaller executables if you use masm for instance. it doesn't add all the "unnecessary" stuff if you dont need it, and you can set the data blocks yourself, optimising your executable.
There also used to be a tool called EXE2BIN which would then strip a lot of the stuff from an .EXE file and generate a .COM file which was much smaller. I don't know if you can still build and run .COM files in Windows. I haven't used Windows on the regular for years but back when I was learning to code in 16bit assembler that was how I made tools that I could add to a 1.44Mb bootable diskette for trouble shooting.
I just want to say that this was an amazing video. You could have just stopped after the theoretical first section, like most other videos do, but you went the extra mile and showed how it works in practice. Honestly, if the rest of your work is just half as good as this one, you've got potential for blowing up!
And this is why we need for the Community to release things more like dev tools instead of production apps. To better understand how things works internally, and to improve them.
Well, he said it in the near end of the video. "Computers are so fast today, we don't even need to optimize it to nth levels the further we are from the raw materials." Well, if someone somewhere are having the capability and time to tweaking anything. Then dev tools would be common for sure.
If I remember correctly, in DOS you could also create COM files as well as EXE. I think these were basically executables for small programs like command line utilities.
I like to create tiny com files with a little help from debug and i put all instructions to build a routine into batch files. Most of my batch files have to start with one or more parameter attached to build the routine.
COM files are 16 bit only. I used to write x86 assembly to test antivirus heuristics. The average computer user really does not know how lucky we are to have so many hardware protections like NX bit, ASLR, and the move to NT kernel with Windows XP and later prevented so many virus opportunities that existed in 9X.
@@mattrogers6646 In DOS we can switch from 16 bit mode into 32 bit mode and 64 bit mode with a com file and we can startup all cores of a multicore system. I like to use the not documented 16 bit BIG real mode(unreal mode) with a segment size of 4 gb for DS, ES, FS, GS segment and 64 kb for CS and SS segment and to open the 21th address line to write directly into the linear framebuffer using VBE graphic modes with address size prefixe on 80386+ CPU.
I have been writing code since the late 70s. Back then we had to be super efficient with our logic because computers were so slow and limited. Modern languages allow me to focus on the problem I’m solving, almost ignoring the computer resources. Modern computers and IDEs are simply amazing.
Inside an EXE are: - a header, that tells the required CPU type, minimal version of Windows and a position of the first instruction. - a section table, that tells which parts of the file should be loaded in memory, and where. - an import table, that tells the names of DLL files that must be loaded, and the names of the functions that your program needs. Windows will create an array of function pointers, that point to those functions. - your machine code, your constants, and the initial values of the global variables. And that's basically it
7:00 python is actually compiled before being executed, just not into actual assembly, but into an intermediary bytecode format the python interpreter then runs. this is what’s inside the .pyc files you sometimes see in the pycache directory, just python bytecode. a better example would be javascript, which is partially compiled and partially interpreted
@@puppergump4117 No, python source code is generally more cross version than the bytecode format, and the compiled .pyc file is often several times bigger than the source file. It's just to precompile everything, so you don't have to do JIT compilation
Python is actually compiled... in a way. While the entry file is only compiled in-memory, the imported libraries are (if the source has a newer date than the saved byte-code (if any), it will recompile it). They are compiled in a similar fashion to Java Executables, but in this instance, as platform-specific, Python byte-code (on the Windows installer, there is an option to precompile the standard library).
7:41 It's surprising how fast computers are nowadays. Back in the 1980s, assembly code would've been the only option for programmers to code fast programs, as the processors inside 80s computers like the Commodore 64, ZX Spectrum or Atari XL/XE (maybe even the Atari ST or Commodore Amiga) were simply too slow to make a program run in an interpreted language (most often BASIC), and compilers were scarce for these computers.
@@matrix01234567899 Not just the cpu's, but the compilers as well. There's a limit to what a cpu by itself can optimize, but if everything is lined up for it perfectly it can eat at the code like no other
This is not how I remember the 1980s. I wrote in assembly in exactly two circumstances. First for learning assembly programming (CDC 7600 and PDP-11). Second for programming a microcontroller (Intel 8048). Other than that, it was Fortran, Cobol, C, C++, Pascal, LISP, APL, Prolog, Turing, etc., etc. This was on mainframes, minis, workstations, and microcomputers. Understanding assembly was a great help in understanding how machines and compilers work. But actual use of assembly in university or industry was fairly rare.
@@matrix01234567899 Almost certainly. In the past, optimizing assembly was mostly a matter of various tricks in arithmetic (like bit shifting instead of multiplying) and optimizing memory usage. Those things really do not matter anymore, and optimization is a matter of streamlining instructions so they can be better executed in parallel (even in single threads, processors can do multiple instructions simultaneously), optimizing memory usage not in terms of space but cache access, and various other aspects that can only be done by rigorous calculation and not by reasoning like in the old days.
This was so much simpler in the 1980s. The .EXE was just a series of records, each with a memory address to which the machine code was to be loaded, a length of the record and the machine code to be loaded. I believe there was an entry point address too to indicate where the CPU should start executing. That was it. That's all that was necessary. For a real challenge, analyze the .OBJ format. It's way WAY more complex. I am a recovering bit twiddler. 😊
Cool I'd like to add that while you *can* get faster code by writing it in assembly, you should have some faith in the compiler, they're very smart these days. And I'm pretty sure the people who wrote them are smarter than me too.
Interesting to note that the PE executable format is almost identical to the Unix COFF format, which is the predecessor to the modern ELF format used in Linux and many other operating systems today. PE is in fact sometimes known as PE/COFF.
2:38 "Yes, Windows is just DOS at the core still." This isn't true in any sense. The MZ followed by the "This program cannot be run in DOS mode" Is there strictly for compatibility and does not affect the function of the program whatsoever.
For everyone correcting the video saying that “windows is no longer based on DOS”: true, but, he didn’t say windows is based on DOS. He said the EXE file format hasn’t changed since the DOS days.
An exe file is actually an archive format similar to .apk or .zip, this is demonstrated by opening a .exe file on a linux filesystem, it will display the contents as if it were an archive, and there you will find the icon, binary, etc.
Opening it in what? Different tools will do different things with it. If you cat it on the command line, you'll see what looks like garbage (and likely also mess up terminal settings), because it's binary data. There aren't any core *nix tools I'm aware of that analyze .exe files, aside from the superficial analysis done by the "file" command (whose purpose is to identify many different types of files based on their content).
@@UFO_researcher So it's one of the graphical file managers. Well, I figured that much. It's almost certainly based on a file association, which can change if, for instance, you install new applications. If you were to install Wine, for instance, double-clicking might allow you to run such executables instead of just analyzing them. The point is, you can't assume everyone has the same setup. Just talking about "opening" a file, by itself, isn't as helpful as you assume. You could, however, try to see the name of the application that opens.
Got this reccomended, i currently work on a presentation about how a computer Works. I think this May be a good visualisation to Show the difference between "Code types"
If you create a console application, it works by sending text by streams (stdin, stdout i stderr) and rendering console window is done by operating system (typically by conhost.exe). Also disasembly dont show you what is inside exe file, but inside RAM. If you really want to see what is inside this exe, open it using 7-zip
> Also disasembly dont show you what is inside exe file, but inside RAM. There's two type of analysis, static and dynamic. A disassembler could be produce fake disassembly because code may changed at runtime. > If you really want to see what is inside this exe, open it using 7-zip 7-zip is not for analyzing executable
Your knowledge is excellent, but I'll have to pinpoint that there's a hole in your story: 1. first: an exe is a file, 2. yes, it is contains a header, and the machine code, 997: but then the multitude of exe formats? The hole is how the operating system starts the program: 3a. first the OS gets a command to start the program, 3b. it loads the program into memory, then it finds the addresses in the header, and translates those to physical addresses (more or less relocation, and similar add5ess translations), 3c. it looks up the program requirements of libraries (DLL:s), investigates whether those are loaded into the memory, if not loads them into memory, and then find the correct physical addresses of those DLL:s, and writes those into the program at appropriate locations, 4. it finds the program entry and start executing machine code from there. The exe file variants emerge from there.
New Sub! Dryden, Michigan I solely sub'd for your Effortless ADHD Transition at "4:09" About to turn 40, never treated for my extreme adhd growing up, That is exactly how I learned PCs in the 80/90s. Learning how to replace the SOL.EXE icon in Windows 3.0 MME somehow turns into finding EVERY MsDos manual to teach myself QBasic within the same 60min lol
Just a little thing about the conclusion : most compilers optimize code way beyond our level of knowledge. To handicraft an assembly code which is actually faster requires extensive knowledge of the targeted instruction(s) set(s), so the best way to optimise is actually to ask the compiler to do it and then maybe optimise the generated code. Great video though
Lots of secrets are in these files, still unexplored. For example, I found a secret in chgcolor, that is a monitor driver for Dos. When all colors are defined by the user’s decision, games may have strange colours. When the reset is chosen then a restart, the b/w laptops can recognize 14 colours instead of the default 6. 2 colours remain missing, the lcd doesn’t recognize it in Dos mode. The 256 colour games will look much better, and more details can be seen using 256 kb video ram. Windows users need the wdl disks, 16 grays driver can recognize 16 different colours in Windows.
Right, com files did not have any header whatsoever. However they were always 16 DOS applications, so no longer supported on 64 bit systems. WinXP (32 bit version) did still support the execution of Dos and com filse.
@@matrix01234567899 Yes, that's true, but running Win 10 on a 32 bit machine would be pure madness ^^. 32 Bit systems can only address 4GB of ram. That's nowdays barely an option anymore. Windows alone would chew that up :D But yes, you're right. Almost all 32 and 64 bit CPUs (with the exception of AMD Ryzen) when running in 32 bit mode do still support the virtual 8086 mode. Though how well the support is depends on the actual application. Certain exotic hardware stuff may break old code. The best solution is usually to just use DOSBox and emulate a machine.
@@Bunny99s On win10 even less than 4GB is not madness if you don't run webbrowser or other modern demanding software, OS itself (even win11) is ok with this amount of RAM. To be correct, it is decision made by microsoft, that they stopped supporting 16-bit apps on 64-bit system, CPU itself don't block this option. When running 32 bit apps on 64 bit OS, or 16-bit apps on 32 bit os CPU change modes many many times a second, when OS do context switching.
PEs don't really do fat binaries (since the header only allows specifying one machine type), the portable just meaning that the format itself is professor agnostic
Mac file forks are awesome in the way they work to get around this. Or at least they used to. I haven't messed around since OS 7.5 really. Also, Visual Basic 4.5 was the only Visual Basic to include a compile to .exe built in. The more you know 🌈
Hey! Nice to see a fellow old Mac fan. I programmed a bit on pre-OSX Mac operating systems starting at System 7.0, and ending with the release of OSX. And yes, I completely agree... the Mac Resource fork and Data Fork paradigm was *amazing* and way ahead of its time. Remember using ResEdit??? :D Gooood times! Unfortunately, Resource/Data forks as they existed pre-OSX aren't implemented in OSX.
I remember days, when we had just .COM files (CP/M era!). Then that became too limiting, being basically tied to just one hardware. And too small. So, .EXE was introduced, to allow choices in linking. Then more and more libraries to be linked, until the different versions of the .EXE were required. I essentially stopped bothering after MS-DOS 6.2. Still have Microsoft Macro Assembler 5.0, though.
Apart from assembly really not being needed for performance anymore, most programmers will also fail at attempts to write assembly by hand that would outperform the optimizations done by modern compilers. Outside of embedded work, there really is no reason to write anything in assembly anymore. What's added to your EXE in this video is the C++ runtime code. You get a version tailored to console usage, but none of that is needed to run your sample program. If you would use an assembler (like NASM) instead of Visual Code, it will run just fine using only your code translated to machine language (well, and the PE header)
Heads up - no one...NO ONE, EVER, wrote programs in machine code. Why? simple - each instruction code's mnemonic was known at the CPU design stage, and remember, those mnemonics (and operands) formed the 1:1 machine code:assembly language instruction set. Assembly language = 1:1 human-readable (mnemonic) version of machine code CPU instructions and operands. Understand that assembly language mnemonics were constructed at the same time as the instruction set was constructed - the designers never, ever expected programmers to memorise the numeric equivalents when those much, much easier to remember mnemonics were also available. So, in those 'old days', every machine code program was actually written on paper in assembly language, and beside each assembly code line, the equivalent machine code instruction/opcode was written. Hand-written labels for branches/jumping/data was also used, obviously. Then, when it came to the time when the program would be entered into the computer, that was when the machine code equivalent was used...entering in all those numbers. I know this because I used to do it many years ago, and if you think about it, it makes 100% sense ;)
So, what was the first assembler written in? It had to be written in machine code. The programs for the first computers were written in machine code with Assembly language developed later on because machine code was to difficult to write long programs in. A prime example was the Univac computer. Programs were written in machine code in the 1950s and no assembler was created until 1960.
@@palmercolson7037 You have failed to comprehend 100% of my post - that's quite an impressive achievement. Your reply is somewhat confused... "The programs for the first computers were written in machine code with Assembly language developed later on because machine code was to difficult to write long programs in. " --> It's the other way round! Assembly languages are orders of magnitude 'easier' to write in than pure machine code, for self-evident reasons. I'm also not talking about feeding assembly language into those early computers. You also asked "So, what were the first assemblers written in?" My post perfectly answers this (hint: assembly language)... Please re-read my post. Google anything your not sure about.
@@ChrisM541 The very first assemblers were humans (of the female kind oddly enough). You are correct though. They would use tables on paper to correlate numbers to the mnemonics written on paper; basically the same thing as what assemblers still do.
Do note that writing in ASM will make faster is just a myth. Most hand written ASM isn't as efficient as the programmer might wants. It's not portable either. And after all you still need to rely on linkers to make the ASM code to object code which might have some overhead performance issues. Most modern linkers are extremely powerful nowadays, however, in general ASM isn't usable because of sacrificing portability and just a burden to the programmer. Meanwhile, C and C++ compilers have gone so fast that it does beat hand written ASM and those old NASM linkers. People also might argue about the binary size, but this isn't 90's era, and having 2 to 4 TB disk space is normal nowadays, where the binary will just be under 500 KB (without stripping debug info). If you, for some reason want to write ASM, most better approach is to embed ASM inside a C or C++ program (inline ASM). However, only if you know what you're doing, as it might not be the best possible way to achieve performance.
All Windows versions from XP onwards are not based on DOS. However, since Windows 95, 98 and Me are run on DOS, This "This program cannot be run in DOS mode." error handler needs to be there to stop DOS from trying to run invalid code and crashing. This error handler even exists in UEFI boot files, since they are based on the PE format. Another strange similarity UEFI has to Windows is part of the EFI shell many PCs have built-in, to troubleshoot and provide basic functionality when the PC has no operating system installed. If you type a command incorrectly in the EFI shell, often the error message that appears is a near verbatim copy of the error you get in Windows' legacy CMD command prompt, "???? is not recognized as an internal or external command, operable program or script file."
With right compiler settings I managed to get exe size to 2kb and on linux 0.5 kb binary. so yes, you have a lot of noise there. I don't think anyone writes exes manually however modifying exe with hex editor is not uncommon.
To clear windows x DOS versions, this is a resume (from wiki) Windows: Windows 1.0, 2.0, 3.x, 4.x (95, 98, Me) - boots DOS before Windows Windows_NT: 3.x, 4.x, 2000, XP, Vista, 7, 8.x, 10, 11 - boots straight into Windows. It does not contain any DOS code, save perhaps in the NTVDM component. The notion that Windows_NT has any DOS code at it's core is simply not true.
"Yes, Windows is still just DOS at the core." -- no. That's just backwards COMPATIBILITY with MS-DOS. That doesn't mean it still IS MS-DOS at its core.
You're forgetting that while running stuff on an operating system you are never really programming the CPU directly, so you'll never be able to create your program without all the OS related fluff for it to work in that environment.
4:00 "...I don't know 64-bit assembly so I don think that I'll be [hard cut] So first I had to get familiar with x64 assembly ..." Gave me a good laugh. :)
4:11 I'm glad I learned Assembly in college and didn't have to search stuff online on my own the first time I learned it 😅. It's been ~5 years since I've last touched an assembly program (in Linux too, not even Windows 🥲) and I can't even find anything remotely close to the pdfs and ppts that were shared in the class. I regret not saving all the documents they gave us somewhere on my computer 😫. Good job figuring out how to make it work, but I noticed that your code looks very different from what I learned 🤔. Probably because ASM Linux and Windows are that different 🤷♂️.
Outside of OS differences, the main difference between any binaries would be the intended architecture. I’m sure it’s more complicated but that’s the basic difference, so fat binaries aren’t very common.
The exe (PE or PE+) may not have any machine (unmanaged) code in it actually. It could have zero machine code and instead have intermediary language (IL) code that targets the common language runtime (CLR) converting it into managed code that gets compiled just in time.
Iirc C and C++ dont actually use ASM as an intermediary step. Most compilers like GCC and Clang will translate it into their respective IR which is then processed by the backend (libgcc or llvm) and then transformed directly into an object. I think MSVC does this too. Don't quote me on that, but iirc thats how it works.
You’re mostly right, but you failed to mention how most exe need to talk to dynamic link library files, or DLLs for specific functions or functionality, and these files cannot be ran in win 32 mode so in reality exe and dynamic link, library files often work in conjunction
Actually, Windows hasn't been DOS at its core since Windows Me. Windows XP and later are based on Windows NT, which doesn't use DOS.
Windows xp still incorporated 9x for compatibility reasons. But it was based on nt
Indeed. The purpose of the little dos program at the beginning is to provide a nice error for anyone trying to run it from DOS.
and yet the files introduced in windows 10/11 (edge) still have the "this cant be run in dos mode" thing
@@earthblob2058yknow that would probably explain why me was so buggy
it might’ve been an NT os rushed into being ported to 9x
@@quickhakkeramcfunk explained already why this is.. it’s to ensure proper error handling for DOS based operating systems that try to run it
This wasn't bad, but missed or simplified a lot about the actual exe content. Exe files (or PE files) are organised in sections. There are different sections in it and usually only one contains your code. There are other sections which may contain resources, text or much more importantly import / export sections. While an EXE file usually does not have an export section, it usually has an import section. The content is essentially a special "contract" by the OS and your application. When the OS starts your program, the OS takes care of loading your file into memory of its own process that the OS created. The OS will scan through the import table, look up shared libraries and imported function names and dynamically load those DLL into your application and also resolves those requested methods. That way your program actually has access to certain functions that are either part of the OS or some other utility libraries. The export table usually only exists when compiling a DLL file which internally is also a PE file. Of course the export section serves the opposite of the import section. So the OS can look up a method or other symbol that the library exports when loading the DLL for an application.
You can actually trim out a lot of the unnecessary code from a PE file. In the past I used a very small assembler called Flat Assembler (FASM). There you could even create your own MZ stub without all that message stuff that nobody needs anymore. In the Demoscene it was even common to have the MZ and PE header to overlap. The MZ header contains a special value that determines the position of the PE header. By cleverly offsetting the PE header you could (re)use otherwise unused or irrelevant bytes. I created key hook dlls that were only 2kb in size. Unfortunately Windows expects sections are placed at a certain alignment, so you can not shrink it too much. So Windows expects some empty space between sections. Though the demoscene usually makes use of almost every byte. Since Windows does not care about the content of those alignment sections, you can fill it with your own data.
The fun thing about FASM is that it's source code is available in its own assembly dialect. So it can compile its own source code to produce itself. Of course it's open source. Just google for flat assembler or FASM.
Thank you for taking the time to write this out, I'll check out FASM.
I have a simple question, wouldn't it be simpler to use syscalls/interrupts to call for OS functionalities?
Came here to say this, well said
@@diobrando7642 Windows doesn't support syscalls by user programs, but they are theoretically possible iirc. But you cannot replace whole libraries with some syscalls.
@@diobrando7642 you shouldnt use direct syscalls, the syscall numbers change on windows
I did actually handcraft an EXE file. I did that as a part of writing a simple compiler for a stack-based language.
The hardest part was making the correct header. That took 6 hours - mostly because Windows never tells you what went wrong, it just refuses to load the file. But when I get that figured out - it was a pretty smooth sail.
the real strategy is just to make your own OS that has its own header format for executables so you know the header
If I may ask, how did you do that? Like, what resources did you use?
@@gabrielschilive7675 I tried to answer you, but my comment got deleted for links.
I used "PE Format" documentation from Microsoft Learn.
I also used "Tiny PE" research by Alex Sotirov. I didn't use most tricks from his research, but I used it as a guide, which fields are important, and which I can ignore.
@@gabrielschilive7675 You could also look for a copy of Inno Pascal. It's open source and a very simple Pascal compiler that directly generates PE executable files. It doesn't have every feature of Pascal, but it's super easy to understand the code.
@@anon_y_mousse Thank you. I had not thought about a compiler or linker before. Good idea!
Rarely these days do you hear people refer to C as high level, but I'm always glad when it is.
The definition I learnt from school: 1st generation is machine code, 2nd generation is assembly language, and both are low level. Newer generations including C are high level
@@Joker-fj8hg Same. But now C is said to be low level compared to things like Python, which I suppose it is.
@@steamrangercomputingAs someone learning python who just glanced at some c, yeah. There's a learning curve..........
@@YaySyu If you're educated on the basics of how programming languages and computers work, C is actually easier to learn than Python, as it straight up has less features, and thus less to learn.
What makes it so difficult for many people to learn is that what functionally it does have is a lot more powerful and closer to the hardware than Python, meaning people have to learn about things like memory management.
Its merely spectrum. Assembly is high level language compared to binary codes and python is high level language compared to those two.
i don't think its true that windows is still running on dos nowadays though. thats my only critique. its running on the NT kernel now and has for a long time. I think that message about not running in dos mode was made for the time before home versions of windows used the NT kernel, so pre windows XP.
No what he meant was is you run the PE exe under DOS OS it runs a DOS part of the exe to display an error message if the same exe is run under Windows NT it jumps to the Window part of the exe. He didn’t mean that Windows NT runs on DOS which it doesn’t. on a side note, NT’s DOS support is via a virtual machine called NTVDM and Windows 16 bit apps via NTVDM and an extension called WOW. 64 bit Windows runs 32 bit Windows apps under WOW64. Windows 10 disabled NTVDM/WOW by default and Window 11 removed it so that if you want DOS and Windows 16bit support you have to use a virtual machine running a older version of Windows or something like DOSbox or WINE.
@@somacruz8272 Parallel to DOS. Windows 9x is an extension of Windows/386 and Windows 3.x, which were a layer on top of the MS-DOS kernel. Windows NT on the other hand has always been its own kernel.
@@somacruz8272 if you look at any book on internals of NT or a good book on operating systems with examples or even wikipedia you'll see that NT was a completely new architecture with the lowest API layer being the Hardware Abstraction Layer (HAL). Only companies really used NT it was for severs and powerful workstations it wasn't until Windows XP replaced MSDOS and Windows ME that Windows was completely built with using the NT/2000 architecture and became more mainstream for home users until it become the de facto.
@@somacruz8272No it wasn’t. Win95 was, but win2000 only shares UI
@@andrewcrook6444 Right, almost all "newer" operating systems usually switch the processor into protected mode very early during boot and use it, well, more properly :) Windows 95 and 98 did also switch to the protected mode (in which normal x86 real mode code would no longer work), but those didn't really care much about proper seperation between kernel and userspace. It was sorta there but they still supported v86 mode to run 16 bit dos applications natively. Those essentially breached almost all security measurements of the CPU.
With WinXP (which was build on NT) things got much better security wise. Though a lot of low level drivers could still easily bring down the whole system. BSODs were much rarer on XP, but still possible if some driver went havoc. Even with WinXP we still got dos support, but only in the 32 bit version. In the 64 bit version the support for 16 binaries was dropped. Which was a huge deal for me at the time as I was still occationally use my good old Borland Pascal 7 ^^. Though things have changed a lot since then. It was quite a journey.
7:06 A common misconception in Python is that each line is being read and executed in real time but what actually happens is that the interpreter compiles it to bytecode and saved to memory which will then be executed by the Python Virtual Machine in real time.
It's all a question of perspective. That bytecode is still "interpreted" in real time by the PVM.
@@jeremiefaucher-goulet3365 i think the point is that its not being translated from python language to bytecode, but from a more easily translated, compiled intermediate, like Java is.
@@malikcurriah241 Exactly like Java bytecode yes. The JVM still needs to interpret and execute that intermediary bytecode at run time.
well, a .py file is still a .txt file with a fancy hat on.
@@TheOzumat You mean with a fancy skin on. :P
Okay, I just want to say, that one of the reasons of big size of the .exe file is compiling mode - Debug. You can basicly see there three calls of third system interruption right after the end of "main" function - they are inserted by compiler to prevent running out of function (if you, for example, forgot to write "ret" instruction). Debug mode generates terrible amount of auxilary code, which can help you in debugging. All your actions, even in assembly, are checked by debugging instruments in runtime to help you in search of mistakes. So for pure research you should better disable all of debug utilities (part of them is still used even in "Release" mode) in project settings. But even with that, this video was interesting, thank you for your work.
EXE File also contains Icons, Bitmaps, Cursors, Dialog Defintions. The function LoadBitmapA for example loads a bitmap inside the current exe file. Many of this resources can be viewed (and sometimes edited) with PE Explorer or similar programs.
I'm sure several people have pointed it out by now, but the extra code you were seeing is from the CRT (C Runtime,) since despite being written in assembly, you were compiling your program as a C program.
Before main is called, there has to be code to do things like, take in the command line and split it up into argc/argv, set up thread local storage, set up floating point numbers, etc. On Windows, this stuff is done by the executable itself, not the system. The code to do it is inserted by the compiler in a way that's transparent to programmers. You can turn it off, but then you'll have to implement those features yourself if you want to use them.
So it's basically the "startup sequence" that most game console devs had to use before running their actual game logic
This is one time i wish i could double like a video.
It's a bit oversimplified for more advanced computer users, but for the layman just wanting to learn more this is fantastic.
I remember when MS DOS had a debugger. It was fun to start the debugger and tell it to just "go". Debug would dutifully attempt to execute whatever the IP register was pointing to. The machine would jump off a cliff if it could and you told it to
For those interested, I find Dave's Garage "The World's Smallest Windows App" video a fantastic explanation of how you can take out everything but the bare minimum from a PE.
Very interesting video by the way, altough I feel like the viewer is left with more questions than answers. Anyways, keep up the good work and I hope to see your channel grow.
saaame. i love that channel ^~^, also buizel is cute af. best pokemon.
Dave is a genius. The time it took him to write that program blew my mind!
Amazing video! I never really thought about exe files that way before - you explain it so well! I always learn something cool from your channel - thank you!
the MZ at the beginning of DOS executables stands for "Mark Zbikowski"... who was one of the main developers responsible for developing the file format.
CS student so I have a few notes on this:
6:29 When I first learned about Assembly I thought the same. This is NOT true however. Assembly is extremely hard to grasp on a physical scale. It is only when you get into the meat and nitty gritty details of how a processor _actually_ functions that you realize just how close Assembly code actually is to pure machine code. All Assembly effectively does is take a command in (like 'mov') and translate it into 1s and 0s. There is a 5060 page thick "Intel 64 and IA-32 Architectures Software Developer’s Manual" for x86 Assembly detailing what exactly each instruction means, but basically "mov eax, 0x5" gets translated _directly_ into "0xb8 0x05" in hexadecimal, with b8 being the opcode and referring to 'move the following to the eax register'. The instructions that are read are directly sent to something like the processors ALU and directly fed into the connected multiplexer. So the "add" instruction you put in actually controls that specific multiplexer in that specific register.
Now while this is not punching in bits into a machine by hand, you are really not gonna come any closer to controlling the pure bare bones hardware than this.
7:36 I presume you are referring to Python in this case because believe me when I say that every single one of us sucks at Assembly compared to the magic a compiler performs. A compiler is capable of spitting out insanely optimized Assembly code to the point where the only people on this planet capable of writing faster Assembly code than it are the people that actually program the damn things. Compilers do things like higher polynomial functions and division by invariant multiplication to make your code _way_ faster than you could ever do. And those are just some of the incredibly genius ways your code can be improved upon. To _really_ understand the full math a compiler uses to fold and optimize your code you basically need a PhD in Math and Computer Science.
All in all that topic is a thing you can really sink time into. :)
One thing I glossed over in the video that I should have gone into more detail with is that the compiling process is still doing a lot of work even from the assembly level. For example, when writing assembly code you can still have variables with their own set name and that's an abstraction that will delt with by the compiling process. Even the MOV instruction in assembly has a couple different machine code equivalents depending on the addressing mode and what it wants to do. So I agree that assembly is close to machine code, but it isn't always a 1:1 translation process. And then as I mentioned with the program I made, the linker added all the code for creating the window and turned my short program into a 48Kb program. And I think that is the point I was trying to make, not that assembly is significantly different from the machine code of the final executable, but that the compiling process, something that most programmers probably never give a second thought to, is doing a significant amount of work to deliver that final EXE.
And don't forget to add to this that most (C) compilers are built in a two-step process, first compiling ("bootstrapping") itself using whatever tools are available, and then in a second pass, compiling itself using itself, because of well-known own optimizations etc. (and to verify everything is working, too).
@@InkboxSoftware Oh yes absolutely. I mean there's several mov instructions depending on register and type of value you are handling. But at the end of the day your instruction really does get a 1:1 translation into a final binary instruction which makes this incredibly cool to use. Given the context it's very true though that the .asm file you put into the compiler is not 100% just what you will get out of it (as you rightly pointed out). The way this statement sounded to me was giving me more of a "Assembly is basically a more complex form of C" type of vibe (I hope you get what I mean by that haha). But in the context that the .asm file is not 100% all you are getting it is very much true. :)
@@InkboxSoftware Finkel - Funk is actually right. Assembly is a 1:1 translation to machine code. Yes, most assembly dialects support certain macros or some simple simplifications, but those are merely syntactic sugar. What you've seen in your disassembly is just boilerplate code that was generated by your linker, since you actually use C++. Try using an actual assembler that directly spits out the exe file. Another thing, which I mentioned in another comment, is that the PE file format has a lot of additional headers and sections that are / need to be initialized as well. Those are not really machine code as it's just part of the actual PE format. Call it metadata. This metadata can't be interpreted by the CPU but by your OS. When you start an application, there's a lot going on on the OS side before the actual execution of your code starts. Though that's all part of the OS.
Classical COM files under DOS only contained raw machine code from the very first byte. So you can write a program with just a few bytes and it would work (under DOS). Com files were always loaded at the memory address 0x0100, So absolute memory references were actually possible that way. It was literally the position in the file + 256 (==0x100).
I can recommend looking at FASM which is a very slim low level assembler which directly outputs whatever you want (MZ, PE, COFF, ELF).
Note: What most assemlers do for you is converting relative memory addresses or the addresses of labels for you. That's where it differs from the actual output. Though in the end the position of a label simply denotes the address. So the compiled code just contains an numeric offset at that place. Some decompilers would actually create fake labels for those, but of course they can't reconstruct the label (or variable) names, as they don't exist in the compiled code. Of course we talk about native x86 / x64 code here and not IL code (intermediate language) which is generated by .NET and can only run with the .NET framework which does the final JIT compilation on the target system.
So Finkel is right. Most decompilers usually show the actual bytes that make up that opcode right next to the actual instruction in assembly. It's a literal 1:1 mapping. There are "high-level" assemblers which give you support for some high level features like if statements, loops and simple data structures. Though those do not represent the actual hardware assembly language.
In fact Assembly is just machine code rappresented by words, each word (computation) has a '1' and '0' value. I could be wrong, but I beleive that the first 5 character are related to the operation to do, then there is the adress and then the numbers to do the operation with. However python is different, the Python compiler compiles the file in bytecode, then the bytecode gets runned by the Pyhton VM
That "Gesundheit" killed me🤣🤣
Is that a tf2 reference
@@puppergump4117no
you can create much smaller executables if you use masm for instance. it doesn't add all the "unnecessary" stuff if you dont need it, and you can set the data blocks yourself, optimising your executable.
There also used to be a tool called EXE2BIN which would then strip a lot of the stuff from an .EXE file and generate a .COM file which was much smaller. I don't know if you can still build and run .COM files in Windows. I haven't used Windows on the regular for years but back when I was learning to code in 16bit assembler that was how I made tools that I could add to a 1.44Mb bootable diskette for trouble shooting.
I just want to say that this was an amazing video. You could have just stopped after the theoretical first section, like most other videos do, but you went the extra mile and showed how it works in practice. Honestly, if the rest of your work is just half as good as this one, you've got potential for blowing up!
And this is why we need for the Community to release things more like dev tools instead of production apps. To better understand how things works internally, and to improve them.
Well, he said it in the near end of the video.
"Computers are so fast today, we don't even need to optimize it to nth levels the further we are from the raw materials."
Well, if someone somewhere are having the capability and time to tweaking anything. Then dev tools would be common for sure.
One of the best videos that I ever seen about "how the stuff works _
If I remember correctly, in DOS you could also create COM files as well as EXE. I think these were basically executables for small programs like command line utilities.
Yep. COM files were just pure machine code and data. No header, nothing.
I like to create tiny com files with a little help from debug and i put all instructions to build a routine into batch files. Most of my batch files have to start with one or more parameter attached to build the routine.
COM files are 16 bit only. I used to write x86 assembly to test antivirus heuristics. The average computer user really does not know how lucky we are to have so many hardware protections like NX bit, ASLR, and the move to NT kernel with Windows XP and later prevented so many virus opportunities that existed in 9X.
@@mattrogers6646 In DOS we can switch from 16 bit mode into 32 bit mode and 64 bit mode with a com file and we can startup all cores of a multicore system. I like to use the not documented 16 bit BIG real mode(unreal mode) with a segment size of 4 gb for DS, ES, FS, GS segment and 64 kb for CS and SS segment and to open the 21th address line to write directly into the linear framebuffer using VBE graphic modes with address size prefixe on 80386+ CPU.
@@mattrogers6646 If we boot MS DOS from a self made CD ROM a virus can’t infect our system files.
I really like the accompanying visuals you included at the end!
Surprized there's no mention that most of the time you can open an exe as a zip archive and see it broken down into smaller pieces.
yo i luv the europe analogy it made so much sense 😩👏
Stellar analysis! I learned a lot. 💜 Thanks.
"Gesundheit" that really caught me offguard, as a german. but it is the most realistic reply. just "Bless you"
I have been writing code since the late 70s. Back then we had to be super efficient with our logic because computers were so slow and limited. Modern languages allow me to focus on the problem I’m solving, almost ignoring the computer resources. Modern computers and IDEs are simply amazing.
Was interesting and entertaining to watch, even though I knew what's "inside" and had my expectations about the video. 🙂
I would still really like to know, What's inside a .EXE file!
Inside an EXE are:
- a header, that tells the required CPU type, minimal version of Windows and a position of the first instruction.
- a section table, that tells which parts of the file should be loaded in memory, and where.
- an import table, that tells the names of DLL files that must be loaded, and the names of the functions that your program needs. Windows will create an array of function pointers, that point to those functions.
- your machine code, your constants, and the initial values of the global variables.
And that's basically it
Roller Coaster Tycoon was coded in assembly, which made it run on every computer back then. I put this as a W for assembly language
7:00 python is actually compiled before being executed, just not into actual assembly, but into an intermediary bytecode format the python interpreter then runs. this is what’s inside the .pyc files you sometimes see in the pycache directory, just python bytecode.
a better example would be javascript, which is partially compiled and partially interpreted
Is that just to make it portable while reducing the size
@@puppergump4117 No, python source code is generally more cross version than the bytecode format, and the compiled .pyc file is often several times bigger than the source file. It's just to precompile everything, so you don't have to do JIT compilation
Python is actually compiled... in a way. While the entry file is only compiled in-memory, the imported libraries are (if the source has a newer date than the saved byte-code (if any), it will recompile it). They are compiled in a similar fashion to Java Executables, but in this instance, as platform-specific, Python byte-code (on the Windows installer, there is an option to precompile the standard library).
Yeah, the python interpreter converts your code to bytecode then it get executed by the python VM
wow u derserve way more visibility this is a really great video thx !
7:41 It's surprising how fast computers are nowadays. Back in the 1980s, assembly code would've been the only option for programmers to code fast programs, as the processors inside 80s computers like the Commodore 64, ZX Spectrum or Atari XL/XE (maybe even the Atari ST or Commodore Amiga) were simply too slow to make a program run in an interpreted language (most often BASIC), and compilers were scarce for these computers.
Nowadays CPU are so complicated, that if you are not experienced in assembly, your assembly code will be probably slower than compiled c++.
@@matrix01234567899 Not just the cpu's, but the compilers as well. There's a limit to what a cpu by itself can optimize, but if everything is lined up for it perfectly it can eat at the code like no other
This is not how I remember the 1980s. I wrote in assembly in exactly two circumstances. First for learning assembly programming (CDC 7600 and PDP-11). Second for programming a microcontroller (Intel 8048). Other than that, it was Fortran, Cobol, C, C++, Pascal, LISP, APL, Prolog, Turing, etc., etc. This was on mainframes, minis, workstations, and microcomputers. Understanding assembly was a great help in understanding how machines and compilers work. But actual use of assembly in university or industry was fairly rare.
@@matrix01234567899 Almost certainly. In the past, optimizing assembly was mostly a matter of various tricks in arithmetic (like bit shifting instead of multiplying) and optimizing memory usage. Those things really do not matter anymore, and optimization is a matter of streamlining instructions so they can be better executed in parallel (even in single threads, processors can do multiple instructions simultaneously), optimizing memory usage not in terms of space but cache access, and various other aspects that can only be done by rigorous calculation and not by reasoning like in the old days.
Assembly was mostly used for 80s computer games since you needed that extra speed
This was so much simpler in the 1980s. The .EXE was just a series of records, each with a memory address to which the machine code was to be loaded, a length of the record and the machine code to be loaded. I believe there was an entry point address too to indicate where the CPU should start executing. That was it. That's all that was necessary.
For a real challenge, analyze the .OBJ format. It's way WAY more complex.
I am a recovering bit twiddler. 😊
This is a good introduction to computer architecture
I wrote a lot in asm and hex a long time ago. This is a great video to see
Cool
I'd like to add that while you *can* get faster code by writing it in assembly, you should have some faith in the compiler, they're very smart these days. And I'm pretty sure the people who wrote them are smarter than me too.
Pretty cool. Would be nice to see what's inside of a .deb package as well. Pretty interesting stuff
Afaik that's a tar.xz with a different name? Maybe not exactly that, but it was a compressed tar of some sort.
The linux equivalent of an exe file is an elf file.
@@thepiratepeter4630 wait. So not .run / .sh file?
@@tilsgee a .sh file is just a shell script, it's not a binary
It is just a Linux executable file packed with some meta data like the software repository of that executable
Interesting to note that the PE executable format is almost identical to the Unix COFF format, which is the predecessor to the modern ELF format used in Linux and many other operating systems today. PE is in fact sometimes known as PE/COFF.
This video is criminally underrated
2:38 "Yes, Windows is just DOS at the core still."
This isn't true in any sense. The MZ followed by the "This program cannot be run in DOS mode" Is there strictly for compatibility and does not affect the function of the program whatsoever.
What if you delete those MZ header manually? It still running right?
@@ravhi1000 You have to do some additional configuring, but yeah, it can be removed.
For everyone correcting the video saying that “windows is no longer based on DOS”: true, but, he didn’t say windows is based on DOS. He said the EXE file format hasn’t changed since the DOS days.
An exe file is actually an archive format similar to .apk or .zip, this is demonstrated by opening a .exe file on a linux filesystem, it will display the contents as if it were an archive, and there you will find the icon, binary, etc.
Opening it in what? Different tools will do different things with it. If you cat it on the command line, you'll see what looks like garbage (and likely also mess up terminal settings), because it's binary data. There aren't any core *nix tools I'm aware of that analyze .exe files, aside from the superficial analysis done by the "file" command (whose purpose is to identify many different types of files based on their content).
@@fllthdcrb I don't know, I just double clicked in ubuntu.
@@UFO_researcher So it's one of the graphical file managers. Well, I figured that much. It's almost certainly based on a file association, which can change if, for instance, you install new applications. If you were to install Wine, for instance, double-clicking might allow you to run such executables instead of just analyzing them. The point is, you can't assume everyone has the same setup. Just talking about "opening" a file, by itself, isn't as helpful as you assume. You could, however, try to see the name of the application that opens.
im pretty sure that was an abstraction
you dont know what you are talking about
Ahhhhh.. the Altair 8800 .I remember those days. Simple and direct.
Got this reccomended, i currently work on a presentation about how a computer Works. I think this May be a good visualisation to Show the difference between "Code types"
If you create a console application, it works by sending text by streams (stdin, stdout i stderr) and rendering console window is done by operating system (typically by conhost.exe).
Also disasembly dont show you what is inside exe file, but inside RAM.
If you really want to see what is inside this exe, open it using 7-zip
> Also disasembly dont show you what is inside exe file, but inside RAM.
There's two type of analysis, static and dynamic.
A disassembler could be produce fake disassembly because code may changed at runtime.
> If you really want to see what is inside this exe, open it using 7-zip
7-zip is not for analyzing executable
Yes, I forgot about the legendary 7-zip.
@@ufufuawa401 I meant this dissasembly in Visual Studio he used on video
No, 7-zip isn't, but it is still better than just guessing
1:07 C is high level. No, it isn't if you relate it to assembly or machine code it is, but in fact, is a low level programming lenguage
in terms of operating systems it's high level
@@MaxCE Yes, it is. It’s low level compared to python, and it’s high level compared to operating systems. We’re both right
Neat video! Now i'm looking for one that explains exactly the same thing but for Linux machines ^^
Your knowledge is excellent, but I'll have to pinpoint that there's a hole in your story: 1. first: an exe is a file, 2. yes, it is contains a header, and the machine code, 997: but then the multitude of exe formats? The hole is how the operating system starts the program: 3a. first the OS gets a command to start the program, 3b. it loads the program into memory, then it finds the addresses in the header, and translates those to physical addresses (more or less relocation, and similar add5ess translations), 3c. it looks up the program requirements of libraries (DLL:s), investigates whether those are loaded into the memory, if not loads them into memory, and then find the correct physical addresses of those DLL:s, and writes those into the program at appropriate locations, 4. it finds the program entry and start executing machine code from there. The exe file variants emerge from there.
Really interesting, looking forward for more.
Short answer: "A bunch of 1s and 0s that represent micro processor instructions", simple as that.
Slightly less short answer: "A bunch of 1s and 0s that represent micro processor instructions, that also aren't allowed to run in DOS mode"
@@smc415 Touché my friend.
creates programs that adds 2 hard coded numbers together, gets an executable that wouldn't fit on a NES cartridge. What a time we live in
New Sub! Dryden, Michigan
I solely sub'd for your Effortless ADHD Transition at "4:09"
About to turn 40, never treated for my extreme adhd growing up, That is exactly how I learned PCs in the 80/90s.
Learning how to replace the SOL.EXE icon in Windows 3.0 MME somehow turns into finding EVERY MsDos manual to teach myself QBasic within the same 60min lol
Just a little thing about the conclusion : most compilers optimize code way beyond our level of knowledge. To handicraft an assembly code which is actually faster requires extensive knowledge of the targeted instruction(s) set(s), so the best way to optimise is actually to ask the compiler to do it and then maybe optimise the generated code. Great video though
Lots of secrets are in these files, still unexplored. For example, I found a secret in chgcolor, that is a monitor driver for Dos. When all colors are defined by the user’s decision, games may have strange colours. When the reset is chosen then a restart, the b/w laptops can recognize 14 colours instead of the default 6. 2 colours remain missing, the lcd doesn’t recognize it in Dos mode. The 256 colour games will look much better, and more details can be seen using 256 kb video ram. Windows users need the wdl disks, 16 grays driver can recognize 16 different colours in Windows.
Awesome video!
DOS executable com files do not need more than mashine code, but the file size is limited to 64 kb.
Right, com files did not have any header whatsoever. However they were always 16 DOS applications, so no longer supported on 64 bit systems. WinXP (32 bit version) did still support the execution of Dos and com filse.
@@Bunny99s all 32 bit windows supported running 16-bit coms. 32-bit windows 10 supports it, but 64-bit windows xp not
@@matrix01234567899 Yes, that's true, but running Win 10 on a 32 bit machine would be pure madness ^^. 32 Bit systems can only address 4GB of ram. That's nowdays barely an option anymore. Windows alone would chew that up :D
But yes, you're right. Almost all 32 and 64 bit CPUs (with the exception of AMD Ryzen) when running in 32 bit mode do still support the virtual 8086 mode. Though how well the support is depends on the actual application. Certain exotic hardware stuff may break old code. The best solution is usually to just use DOSBox and emulate a machine.
@@Bunny99s On win10 even less than 4GB is not madness if you don't run webbrowser or other modern demanding software, OS itself (even win11) is ok with this amount of RAM.
To be correct, it is decision made by microsoft, that they stopped supporting 16-bit apps on 64-bit system, CPU itself don't block this option. When running 32 bit apps on 64 bit OS, or 16-bit apps on 32 bit os CPU change modes many many times a second, when OS do context switching.
PEs don't really do fat binaries (since the header only allows specifying one machine type), the portable just meaning that the format itself is professor agnostic
Mac file forks are awesome in the way they work to get around this. Or at least they used to. I haven't messed around since OS 7.5 really. Also, Visual Basic 4.5 was the only Visual Basic to include a compile to .exe built in.
The more you know 🌈
Hey! Nice to see a fellow old Mac fan. I programmed a bit on pre-OSX Mac operating systems starting at System 7.0, and ending with the release of OSX. And yes, I completely agree... the Mac Resource fork and Data Fork paradigm was *amazing* and way ahead of its time. Remember using ResEdit??? :D Gooood times!
Unfortunately, Resource/Data forks as they existed pre-OSX aren't implemented in OSX.
I remember days, when we had just .COM files (CP/M era!). Then that became too limiting, being basically tied to just one hardware. And too small. So, .EXE was introduced, to allow choices in linking. Then more and more libraries to be linked, until the different versions of the .EXE were required. I essentially stopped bothering after MS-DOS 6.2. Still have Microsoft Macro Assembler 5.0, though.
Fascinating stuff.. thanks for sharing!
all my EXEs live in TXTas
Remember the good old COM files for dos? 64kb of raw machine code with no header
Python is indeed somewhat "compiled" before being interpreted.
3:55 the captions say toes
youtube : What's inside a .exe file?
_Me at 3 am: Lets find out!_
You may be interested in "A smallest PE executable (x64) with every byte executed" . It has only 268 bytes.
Apart from assembly really not being needed for performance anymore, most programmers will also fail at attempts to write assembly by hand that would outperform the optimizations done by modern compilers. Outside of embedded work, there really is no reason to write anything in assembly anymore. What's added to your EXE in this video is the C++ runtime code. You get a version tailored to console usage, but none of that is needed to run your sample program. If you would use an assembler (like NASM) instead of Visual Code, it will run just fine using only your code translated to machine language (well, and the PE header)
HAHA! the ending,😂 very interesting i had no idea about decompiling, now i'm down the rabbit hole
Next time someone in Germany sneezes, I won't say "Gesundheit" but "$A9 $38 $8D $00".
😂
Heads up - no one...NO ONE, EVER, wrote programs in machine code. Why? simple - each instruction code's mnemonic was known at the CPU design stage, and remember, those mnemonics (and operands) formed the 1:1 machine code:assembly language instruction set. Assembly language = 1:1 human-readable (mnemonic) version of machine code CPU instructions and operands. Understand that assembly language mnemonics were constructed at the same time as the instruction set was constructed - the designers never, ever expected programmers to memorise the numeric equivalents when those much, much easier to remember mnemonics were also available.
So, in those 'old days', every machine code program was actually written on paper in assembly language, and beside each assembly code line, the equivalent machine code instruction/opcode was written. Hand-written labels for branches/jumping/data was also used, obviously. Then, when it came to the time when the program would be entered into the computer, that was when the machine code equivalent was used...entering in all those numbers.
I know this because I used to do it many years ago, and if you think about it, it makes 100% sense ;)
So, what was the first assembler written in? It had to be written in machine code. The programs for the first computers were written in machine code with Assembly language developed later on because machine code was to difficult to write long programs in. A prime example was the Univac computer. Programs were written in machine code in the 1950s and no assembler was created until 1960.
@@palmercolson7037 You have failed to comprehend 100% of my post - that's quite an impressive achievement.
Your reply is somewhat confused...
"The programs for the first computers were written in machine code with Assembly language developed later on because machine code was to difficult to write long programs in. "
--> It's the other way round! Assembly languages are orders of magnitude 'easier' to write in than pure machine code, for self-evident reasons.
I'm also not talking about feeding assembly language into those early computers.
You also asked "So, what were the first assemblers written in?" My post perfectly answers this (hint: assembly language)...
Please re-read my post. Google anything your not sure about.
@@ChrisM541 The very first assemblers were humans (of the female kind oddly enough). You are correct though. They would use tables on paper to correlate numbers to the mnemonics written on paper; basically the same thing as what assemblers still do.
a thing that windows s mode can't run but windows e mode , windows and wine can run
Literally every .exe creepypasta: main character evil lol
Me after watching this: wait, it’s all just data?
Other .exe files: always had been
Do note that writing in ASM will make faster is just a myth. Most hand written ASM isn't as efficient as the programmer might wants. It's not portable either. And after all you still need to rely on linkers to make the ASM code to object code which might have some overhead performance issues. Most modern linkers are extremely powerful nowadays, however, in general ASM isn't usable because of sacrificing portability and just a burden to the programmer.
Meanwhile, C and C++ compilers have gone so fast that it does beat hand written ASM and those old NASM linkers. People also might argue about the binary size, but this isn't 90's era, and having 2 to 4 TB disk space is normal nowadays, where the binary will just be under 500 KB (without stripping debug info).
If you, for some reason want to write ASM, most better approach is to embed ASM inside a C or C++ program (inline ASM). However, only if you know what you're doing, as it might not be the best possible way to achieve performance.
Actually, it all depends on the platform
All Windows versions from XP onwards are not based on DOS. However, since Windows 95, 98 and Me are run on DOS, This "This program cannot be run in DOS mode." error handler needs to be there to stop DOS from trying to run invalid code and crashing. This error handler even exists in UEFI boot files, since they are based on the PE format.
Another strange similarity UEFI has to Windows is part of the EFI shell many PCs have built-in, to troubleshoot and provide basic functionality when the PC has no operating system installed.
If you type a command incorrectly in the EFI shell, often the error message that appears is a near verbatim copy of the error you get in Windows' legacy CMD command prompt, "???? is not recognized as an internal or external command, operable program or script file."
How much i read this "This program cannot be run in the DOS Mode" 😂
With right compiler settings I managed to get exe size to 2kb and on linux 0.5 kb binary. so yes, you have a lot of noise there. I don't think anyone writes exes manually however modifying exe with hex editor is not uncommon.
Interesting video.
Thanks
To clear windows x DOS versions, this is a resume (from wiki)
Windows: Windows 1.0, 2.0, 3.x, 4.x (95, 98, Me) - boots DOS before Windows
Windows_NT: 3.x, 4.x, 2000, XP, Vista, 7, 8.x, 10, 11 - boots straight into Windows. It does not contain any DOS code, save perhaps in the NTVDM component. The notion that Windows_NT has any DOS code at it's core is simply not true.
"Yes, Windows is still just DOS at the core." -- no. That's just backwards COMPATIBILITY with MS-DOS.
That doesn't mean it still IS MS-DOS at its core.
bro imagine having to putting every byte of data into your code. thank god I wasnt alive trying to this back then smh.
You're forgetting that while running stuff on an operating system you are never really programming the CPU directly, so you'll never be able to create your program without all the OS related fluff for it to work in that environment.
4:00 "...I don't know 64-bit assembly so I don think that I'll be [hard cut] So first I had to get familiar with x64 assembly ..."
Gave me a good laugh. :)
great video!
4:11 I'm glad I learned Assembly in college and didn't have to search stuff online on my own the first time I learned it 😅. It's been ~5 years since I've last touched an assembly program (in Linux too, not even Windows 🥲) and I can't even find anything remotely close to the pdfs and ppts that were shared in the class. I regret not saving all the documents they gave us somewhere on my computer 😫. Good job figuring out how to make it work, but I noticed that your code looks very different from what I learned 🤔. Probably because ASM Linux and Windows are that different 🤷♂️.
last time i concerned myself with .exe files was when security in games was so bad that you could just add a jump command and crack the game...
inside of a .exe file is a demonic hedgehog demon
Bruh
its a joke
"demonic hedgehog demon" implies that there are also non-demonic hedgehog demons, meaning there is such a thing as a non-demonic demon
Outside of OS differences, the main difference between any binaries would be the intended architecture. I’m sure it’s more complicated but that’s the basic difference, so fat binaries aren’t very common.
Thanks for the video!
Great video. Talking about super low level code, have you seen how the game Roller Coaster Tycoon was made in assembly? Could do a great video.
The exe (PE or PE+) may not have any machine (unmanaged) code in it actually. It could have zero machine code and instead have intermediary language (IL) code that targets the common language runtime (CLR) converting it into managed code that gets compiled just in time.
Iirc C and C++ dont actually use ASM as an intermediary step. Most compilers like GCC and Clang will translate it into their respective IR which is then processed by the backend (libgcc or llvm) and then transformed directly into an object. I think MSVC does this too. Don't quote me on that, but iirc thats how it works.
Its funny to think that Windows 10 is just built up on DOS. It's like a turtle trying to carry a skyscraper.
it's not actually. windows 10/11 arw both based on NT
That „Gesundheit“ got me laughing
5:10 : the size of the .exe is it also related to the cluster size of the file system or not ?
also exes contain some metadata like if you were to assign a icon that's in the exe, assign a description of the program that's in the exe and other
You’re mostly right, but you failed to mention how most exe need to talk to dynamic link library files, or DLLs for specific functions or functionality, and these files cannot be ran in win 32 mode so in reality exe and dynamic link, library files often work in conjunction
i subscribed just for the end