Making the smallest Windows application

It has been a while since I started following Dave Plummer's amazing adventures in his Youtube channel Dave's Garage (take a look at his channel right now). He recently posted a video about making the "smallest Windows App in x86 ASM", where he was able to squeeze an entire (usable) Windows application in just 1488 byte. After watching the video I was like: TL;DR: Jump to the last section

0. The main idea

1448 byte is pretty damn good, but can we do better? A big amount of space in an executable is usually wasted by:
  1. Strings
  2. Statically linked libraries
One can decrease the former's impact by enabling the LTO mechanism. But is that enough? And what about strings? Can we really avoid them?

0.1. DavePl application

Dave Plummer, in his video, made a really valid point: it's quite trivial to make a small application when you just call "MessageBox".
So the challenge was to borrow his assembly source code and convert it to C (so that we end up with an application that behaves exactly the same) then write some more glue code and see what happens.

1. Finding kernel32.dll base address

In Win32, as soon as you can use a couple of APIs exported by kernel32.dll you can do virtually everything (I'm talking about LoadLibrary and GetProcAddress).
kernel32.dll is automatically loaded in each Windows' process address space, right before being executed. This means that, technically, we should be able to poke around in the process address space to find the kernel32.dll APIs that we need.

1.1. Understanding the InMemoryOrderModuleList

Before a process is actually executed by Windows, the kernel fills a struct in memory called TEB, which contains a lot of information about the process itself. Now bear with me:
  1. The FS register contains a pointer to the TEB
  2. The TEB contains a pointer to the Process Environment Block (PEB)
  3. The PEB contains a pointer to the PEB_LDR_DATA struct
  4. The PEB_LDR_DATA struct contains an InMemoryOrderModuleList
InMemoryOrderModuleList contains something really useful: the base address where kernel32.dll is loaded into memory!

According to Microsoft docs:
InMemoryOrderModuleList is the head of a [circular] doubly-linked list that contains the loaded modules for the process. Each item in the list is a pointer to an LDR_DATA_TABLE_ENTRY.
Now replace "modules" with DLLs and you can get an idea of what we are going to do. Actually, InMemoryOrderModuleList contains pointers to LIST_ENTRYs. Each LIST_ENTRY is wrapped in an LDR_DATA_TABLE_ENTRY. Pay attention to the first PVOID field: that's the reason why we can't brutally cast the LIST_ENTRY to an LDR_DATA_TABLE_ENTRY, but we need to take care of that "extra padding".

You can find a partial definition of LDR_DATA_TABLE_ENTRY in winternl.h. The problem is that said definition only exposes very few of the actual fields. However, playing with Microsoft-provided debug symbols, you can find a lot more "undocumented" fields that we need. I decided to define my own version of the struct (shown above). As you can notice, there's a 116 byte padding pad. This is because there are a few dozen fields between DllBase and BaseNameHashValue that we don't really care about.

1.2. What is a hash?

To put it simply: Moreover, hash functions are engineered in such a way that it is very likely that different \(A\)s produce different \(B\)s. This (almost always) means that, letting \(B_1=hash\_func(A_1)\) and \(B_2=hash\_func(A_2)\) then: \[B_1 = B_2 \Longrightarrow A_1 = A_2\]

1.3. The initAPI function

Enough theory. Where's the code?
#define KERNEL32DLL_HASH 0x536CD652u

typedef struct {
    PVOID Reserved1[2];
    LIST_ENTRY InMemoryOrderLinks;
    PVOID Reserved2[2];
    PVOID DllBase;
    uint8_t pad[116];
    ULONG BaseNameHashValue;
} MY_LDR_DATA_TABLE_ENTRY;

void initAPI(...) {
    PPEB peb = __readfsdword(0x30);
    uintptr_t kernel32Base = 0;

    for (
        PLIST_ENTRY ptr = peb->Ldr->InMemoryOrderModuleList.Flink;
        kernel32Base == 0;
        ptr = ptr->Flink)
    {
        MY_LDR_DATA_TABLE_ENTRY *e = CONTAINING_RECORD(ptr, MY_LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
        
        if (e->BaseNameHashValue == KERNEL32DLL_HASH)
            kernel32Base = (uintptr_t)e->DllBase;
    }

    ...
}
A couple of things you might want to know: So, we now know where kernel32.dll is loaded in memory. Let's move on.

2. Parsing the export tables

Now you might ask, what's the point of knowing a DLL base address? Well, given a DLL base address, we can:
  1. Parse the DLL headers
  2. Find the pointers to the exported functions we want to use
  3. Use them

2.1. Our own hashing function

We need a hashing function. It has to be reasonably good and short, so that it doesn't take a lot of space. I ended up choosing \(djb2\) (read more here):
uint32_t djb2(uint8_t* str) {
    uint32_t hash = 5381;
    uint8_t c;

    while (c = *(str++))
        hash = ((hash << 5u) + hash) + c;

    return hash;
}

2.2. The APIs we need to call

We need to call different APIs from different DLLs.

We can proceed by computing the hash of each API name. For example, from kernel32.dll: Then save this hashes for later in an array:
uint32_t Kernel32Hashes[] = {
    0x5FBFF0FBu,
    0x5A153F58u,
    0xB511FC4Du,
    0x348B7545u,
    0xB769339Eu
};
Let's define a function pointer for each API, in this example:
typedef HMODULE(__stdcall* GetModuleHandleA_t)(LPCSTR);
typedef LPSTR(__stdcall* GetCommandLineA_t)();
typedef void(__stdcall* GetStartupInfoA_t)(LPSTARTUPINFOA);
typedef void(__stdcall* ExitProcess_t)(UINT);
typedef HMODULE(__stdcall* LoadLibraryA_t)(LPCSTR);
Where __stdcall is a calling convention, used by almost all Windows APIs.

Now these function pointers need to be saved somewhere: we can use a struct.
#pragma pack(push, 1)
typedef __declspec(align(1)) struct {
    LoadLibraryA_t _LoadLibraryA;
    GetModuleHandleA_t _GetModuleHandleA;
    GetCommandLineA_t _GetCommandLineA;
    GetStartupInfoA_t _GetStartupInfoA;
    ExitProcess_t _ExitProcess;

    ...
} API;
#pragma pack(pop)
Notice how we used some directives to the compiler: Again, the suffixed underscore is needed to avoid namespace clashes and Microsoft's #defines to mess up.

2.3. Understanding the PE format

First off: RVA stands for Relative Virtual Address and it's a pointer in memory to a part of the DLL relative to the DLL base address.

Now, this is what we are going to do:
  1. Find the DllBase
  2. Move forward of 0x3C bytes to read the PE header position
  3. Go to the PE header
  4. Move forward of 0x78 bytes to find the Export table RVA
  5. Go to the Export table
  6. Skip the first 0x18 bytes (we don't need them)
  7. Save Exported functions (the number of exported functions)
  8. Save Address table RVA
  9. Save Name pointer table RVA
  10. Save Ordinal table RVA

2.4. How to find an exported function pointer

Let's say we need to find the pointer to the function "Function1", we need to:
  1. Linearly scan the Export name pointer table until we find the function name "Function1"
  2. At the same (logical) offset, in the Export ordinal table, find the ordinal number of the function
  3. Use the ordinal as a (logical) offset in the Export address table
  4. Congrats! You found the function pointer

2.5. The findFunc function

#pragma pack(push, 1)
struct dll_info {
    uint32_t exported_functions;
    uintptr_t address_table_RVA;
    uintptr_t name_pointer_table_RVA;
    uintptr_t ordinal_table_RVA;
};
#pragma pack(pop)

void findFunc(uintptr_t dllBase, uint32_t* hashes, void** ptrs, size_t size) {
    uintptr_t PE_RVA = *(uintptr_t*)((uint8_t*)dllBase + 0x3Cu);
    uintptr_t PE = dllBase + *(uintptr_t*)((uint8_t*)dllBase + 0x3Cu);

    uintptr_t export_table_RVA = *(uintptr_t*)((uint8_t*)PE + 0x78u);
    struct dll_info *dll = (uint8_t*)dllBase + export_table_RVA + 0x18u;

    uintptr_t name_pointer_table_entry_RVA = dll->name_pointer_table_RVA;
    uint32_t i, j;

    uintptr_t ordinal_function_RVA;
    uint16_t ordinal_function;
    uintptr_t function_RVA;

    for (i = 0; i < dll->exported_functions; i++, name_pointer_table_entry_RVA += 4) {
        uintptr_t function_name_RVA = *(uintptr_t*)((uint8_t*)dllBase + name_pointer_table_entry_RVA);
        char* function_name = (uint8_t*)dllBase + function_name_RVA;
        uint32_t function_hash = djb2(function_name);

        for (j = 0; j < size; j++) {
            if (function_hash == hashes[j]) {
                ordinal_function_RVA = dll->ordinal_table_RVA + i * 2;
                ordinal_function = *(uint16_t*)((uint8_t*)dllBase + ordinal_function_RVA);
                function_RVA = *(uintptr_t*)((uint8_t*)dllBase + dll->address_table_RVA + ordinal_function * 4);

                ptrs[j] = (uint8_t*)dllBase + function_RVA;

                break;
            }
        }
    }
}
findFunc takes as its inputs:
  1. The base address where the DLL is loaded
  2. An array of function names' hashes
  3. An (empty) array of function pointers
  4. The length of the arrays
Every function name exported by the DLL is hashed with \(djb2\) and compared with each hash in the array passed as the second parameter. If it's a match then the pointer to that function is stored in the void ** array. For example, after the function is executed: \[hashes[0]=djb2(“LoadLibraryA”)=\text{0x5FBFF0FB} \Longrightarrow ptrs[0] = \text{function pointer to LoadLibraryA}\] A couple of things you might want to understand:

2.6. Getting the kernel32.dll APIs

Now everything should be set up to load in the API struct the function pointers to the kernel32.dll functions.
Since the API struct is packed we can modify our initAPI function as follows:
void initAPI(API* api) {
    PPEB peb = __readfsdword(0x30);
    uintptr_t kernel32Base = 0;

    for (
        PLIST_ENTRY ptr = peb->Ldr->InMemoryOrderModuleList.Flink;
        kernel32Base == 0;
        ptr = ptr->Flink)
    {
        MY_LDR_DATA_TABLE_ENTRY *e = CONTAINING_RECORD(ptr, MY_LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
        
        if (e->BaseNameHashValue == KERNEL32DLL_HASH)
            kernel32Base = (uintptr_t)e->DllBase;
    }

    findFunc(kernel32Base, Kernel32Hashes, &api->_LoadLibraryA, ARRAYSIZE(Kernel32Hashes));
}
Where ARRAYSIZE is a macro imported by including windows.h, which returns the length of a static array.

2.7. Getting the user32.dll and gdi32.dll APIs

We actually need other functions exported by user32.dll and gdi32.dll.

The problem is that we neither know the base address of these DLLs nor they are automatically loaded like kernel32.dll. We can use a trick however: the LoadLibrary function exported by kernel32.dll.
LoadLibrary loads a DLL by its file name and returns an HMODULE, which, in fact, is just a PVOID, pointing to the DLL base address: we can cast it to a uintptr_t and use it in our findFunc.

We also need to: The initAPI function needs to be changed accordingly:
void initAPI(API* api) {
    PPEB peb = __readfsdword(0x30);
    uintptr_t kernel32Base = 0;

    for (
        PLIST_ENTRY ptr = peb->Ldr->InMemoryOrderModuleList.Flink;
        kernel32Base == 0;
        ptr = ptr->Flink) 
    {
        MY_LDR_DATA_TABLE_ENTRY *e = CONTAINING_RECORD(ptr, MY_LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
        
        if (e->BaseNameHashValue == KERNEL32DLL_HASH)
            kernel32Base = (uintptr_t)e->DllBase;
    }

    findFunc(kernel32Base, Kernel32Hashes, &api->_LoadLibraryA, ARRAYSIZE(Kernel32Hashes));

    HMODULE user32 = api->_LoadLibraryA("USER32.DLL");
    findFunc((uintptr_t)user32, User32Hashes, &api->_LoadIconA, ARRAYSIZE(User32Hashes));

    HMODULE gdi32 = api->_LoadLibraryA("GDI32.DLL");
    findFunc((uintptr_t)gdi32, &Gdi32Hash, &api->_SetBkMode, 1);
}
Where in the last findFunc call we just pass 1 as the array size, since we need just 1 function in gdi32.dll. You can find the new function pointers definition here and the new hashes here.

3. Producing a small executable

3.1. Visual Studio configuration

Clearly, we might want to tweak MSVC and the linker to produce a small binary. I made a new configuration called "MinRel" based on the standard "Release" one. After playing around with the configuration I ended up with these command line parameters:

Compiler

/permissive- /ifcOutput "MinRel\" /GS- /analyze- /W3 /Gy /Zc:wchar_t- /Gm- /O1 /Ob0 /sdl- /Fd"MinRel\vc142.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /fp:except- /errorReport:prompt /GF /WX- /Zc:forScope /arch:IA32 /Gd /Oy /Oi /MD /FC /Fa"MinRel\" /nologo /Zl /Fo"MinRel\" /Os /Fp"MinRel\MinHW.pch" /diagnostics:column

Linker

/OUT:"C:\Users\Davide\source\repos\MinHW\MinRel\MinHW.exe" /MANIFEST:NO /PDB:"C:\Users\Davide\source\repos\MinHW\MinRel\MinHW.pdb" /DYNAMICBASE:NO /MACHINE:X86 /ENTRY:"main" /WINMD:NO /OPT:REF /SAFESEH:NO /INCREMENTAL:NO /PGD:"C:\Users\Davide\source\repos\MinHW\MinRel\MinHW.pgd" /SUBSYSTEM:WINDOWS /MANIFESTUAC:NO /ManifestFile:"MinRel\MinHW.exe.intermediate.manifest" /LTCGOUT:"MinRel\MinHW.iobj" /OPT:ICF /ERRORREPORT:PROMPT /NOLOGO /ALIGN:16 /NODEFAULTLIB /TLBID:1 /MERGE:.rdata=.text /MERGE:.data=.text /EMITPOGOPHASEINFO /RELEASE /STUB:"$(MSBuildProjectDirectory)\stub.bin"
You can see some undocumented/not common flags here: Even if I'm quite sure we don't actually need all these flags, I ended up with a 1312 byte EXE.

3.2. Removing some bytes

We can cut some extra bytes by opening the EXE with a hex editor. As you can see, the last 0x5D bytes of the executable are all \(0\)s: we can delete them.
We also need to change the SizeOfCode field in the IMAGE_OPTIONAL_HEADER. To put it simply, just go to offset 0x8C and decrease whatever you find (0x90 in my case) by the number of bytes we removed earlier (0x5D): We end up with a fully working 1219 bytes executable. So, the executable is roughly \(18.1\%\) smaller than the original 1488 bytes one. Not bad. But can we do better?

3.3. What about assembly?

Now someone might ask: "why didn't you use assembly?". Well, I did.
Here is a gist with some code I wrote to show a MessageBox. For your own pleasure, a GIF (joking, it's a looped video) of me writing assembly at really high speed. (The video has been sped up to 10x)

Now the funny thing: after compiling the source code with MASM (ML actually), I ended up with a binary with exactly the same size as the C one.
The thing is that, today, compilers are pretty damn good, so there isn't really any point in writing assembly (even if there might be some exceptions).

4. Asking DavePl

Here comes the point when I asked for help. I reached Dave Plummer himself to tell him about my final EXE size and ask him if he knew other ways to reduce the size. Sure enough, he promptly gave me some pieces of advice: Thanks Dave!

4.1. Avoiding the stack

Generally speaking, moving a struct from the stack to the BSS or the data section generates a smaller binary. This is because if you initialize a struct in the stack, a MOV instruction has to be used to fill each struct's field and this takes quite some space. On the other hand, if the struct is stored in the BSS or the data section, the whole section is just copied in memory in one go when the executable starts.
I immediately moved the WNDCLASSEX in the global scope. This saved a bunch of bytes (I can't remember exactly how many).

4.2. Using Crinkler

In its GitHub repo, the about section states that:
Crinkler is an executable file compressor (or rather, a compressing linker) for Windows for compressing small demoscene executables. As of 2020, it is the most widely used tool for compressing 1k/4k/8k intros.
I was quite skeptical at the beginning: approximately 1200 bytes is very small by today's standards, maybe we can't go lower. But guess what? There's always room for improvement.

After a few minutes reading the manual, I ended up executing the following command in the same directory where Visual Studio generates the .obj files:
Crinkler.exe /NODEFAULTLIB /ENTRY:main /SUBSYSTEM:WINDOWS /TINYHEADER /NOINITIALIZERS /UNSAFEIMPORT /ORDERTRIES:1000 /TINYIMPORT /LIBPATH:"C:\Program Files (x86)\Windows Kits\10\Lib\10.0.18362.0\um\x86" MinHW.obj api.obj kernel32.lib
Let's analyze some flags:

Notice how we need to link kernel32.lib because Crinkler emits some code that needs to be linked against it. If you have a Windows 10 SDK installed, you can find your x86 kernel32.lib file under:
C:\Program Files (x86)\Windows Kits\10\Lib\<version>\um\x86
Replace <version> with the newer version you have installed.

Read the following section to know the latest size.

5. Conclusions

The final size is:

874 bytes

This means that the new executable is: Wow!

5.1. Compile it on your own

I made a zip that you can use to compile your own 874 bytes executable. To use it:
  1. Extract the zip somewhere
  2. Find kernel32.lib as discussed in the previous section and edit build.bat accordingly
  3. Make sure you have Visual Studio with MSVC + Windows 10 SDK installed
  4. Search on your computer x86 Native Tools Command Prompt for VS 2019 or x64_x86 Cross Tools Command Prompt for VS 2019 if you want to cross comiple on an x86_64 system
  5. cd into the directory where you extracted the zip
  6. Execute build.bat
If everything goes well, after a bunch of warnings, in the extracted directory you will get:

5.2. Further improvements

If you have some ideas on how to shrink the executable even more, email me at info.davide99@gmail.com.

5.3. Special thanks

Obviously, a special thank goes to Dave Plummer. He gave me a great coding adventure to work with and some great tips. Thank you for all the effort you put into things.