Making the smallest Windows application possible

0. The main idea

1448 byte is pretty damn good, but can we do better? A big amount of space in an executable is usually wasted by:

Strings
Statically linked libraries

One can decrease the former's impact by enabling the LTO mechanism. But is that enough? And what about strings? Can we really avoid them?

0.1. DavePl application

Dave Plummer, in his video, made a really valid point: it's quite trivial to make a small application when you just call "MessageBox".
So the challenge was to borrow his assembly source code and convert it to C (so that we end up with an application that behaves exactly the same) then write some more glue code and see what happens.

1. Finding `kernel32.dll` base address

In Win32, as soon as you can use a couple of APIs exported by kernel32.dll you can do virtually everything (I'm talking about LoadLibrary and GetProcAddress).
kernel32.dll is automatically loaded in each Windows' process address space, right before being executed. This means that, technically, we should be able to poke around in the process address space to find the kernel32.dll APIs that we need.

1.1. Understanding the InMemoryOrderModuleList

Before a process is actually executed by Windows, the kernel fills a struct in memory called TEB, which contains a lot of information about the process itself. Now bear with me:

The FS register contains a pointer to the TEB
The TEB contains a pointer to the Process Environment Block (PEB)
The PEB contains a pointer to the PEB_LDR_DATA struct
The PEB_LDR_DATA struct contains an InMemoryOrderModuleList

InMemoryOrderModuleList contains something really useful: the base address where kernel32.dll is loaded into memory!

According to Microsoft docs:

InMemoryOrderModuleList is the head of a [circular] doubly-linked list that contains the loaded modules for the process. Each item in the list is a pointer to an LDR_DATA_TABLE_ENTRY.

Now replace "modules" with DLLs and you can get an idea of what we are going to do.

Actually, InMemoryOrderModuleList contains pointers to LIST_ENTRYs. Each LIST_ENTRY is wrapped in an LDR_DATA_TABLE_ENTRY.

Pay attention to the first PVOID field: that's the reason why we can't brutally cast the LIST_ENTRY to an LDR_DATA_TABLE_ENTRY, but we need to take care of that "extra padding".

You can find a partial definition of LDR_DATA_TABLE_ENTRY in winternl.h. The problem is that said definition only exposes very few of the actual fields. However, playing with Microsoft-provided debug symbols, you can find a lot more "undocumented" fields that we need. I decided to define my own version of the struct (shown above). As you can notice, there's a 116 byte padding pad. This is because there are a few dozen fields between DllBase and BaseNameHashValue that we don't really care about.

1.2. What is a hash?

To put it simply:

Let \(A\) be a varying length bit string
\(hash\_func(A)=B\)
\(B\) is a fixed-length bit string, called hash

Moreover, hash functions are engineered in such a way that it is very likely that different \(A\)s produce different \(B\)s. This (almost always) means that, letting \(B_1=hash\_func(A_1)\) and \(B_2=hash\_func(A_2)\) then: \[B_1 = B_2 \Longrightarrow A_1 = A_2\]

1.3. The `initAPI` function

Enough theory. Where's the code?

#define KERNEL32DLL_HASH 0x536CD652u

typedef struct {
    PVOID Reserved1[2];
    LIST_ENTRY InMemoryOrderLinks;
    PVOID Reserved2[2];
    PVOID DllBase;
    uint8_t pad[116];
    ULONG BaseNameHashValue;
} MY_LDR_DATA_TABLE_ENTRY;

void initAPI(...) {
    PPEB peb = __readfsdword(0x30);
    uintptr_t kernel32Base = 0;

    for (
        PLIST_ENTRY ptr = peb->Ldr->InMemoryOrderModuleList.Flink;
        kernel32Base == 0;
        ptr = ptr->Flink)
    {
        MY_LDR_DATA_TABLE_ENTRY *e = CONTAINING_RECORD(ptr, MY_LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
        
        if (e->BaseNameHashValue == KERNEL32DLL_HASH)
            kernel32Base = (uintptr_t)e->DllBase;
    }

    ...
}

A couple of things you might want to know:

I had to typedef my LDR_DATA_TABLE_ENTRY with a MY_ suffix to avoid namespace clashes
__readfsdword is an MSVC intrinsic that behaves like:

MOV EAX, FS[0x30] MOV peb, EAX
0x30 is the offset of the PEB pointer in the TEB
CONTAINING_RECORD is a macro that "casts" the LIST_ENTRY pointer to an LDR_DATA_TABLE_ENTRY pointer, taking care of the "extra padding"
e->BaseNameHashValue contains the hash of the base name of the DLL file. In this case: \(hash\_func(“kernel32.dll”)=0x536CD652\)
We don't care what hash function Windows internally use to compute the hash
e->DllBase contains the base address where the DLL is loaded into memory. If NULL, then we are dealing with the list head.

So, we now know where kernel32.dll is loaded in memory. Let's move on.

2. Parsing the export tables

Now you might ask, what's the point of knowing a DLL base address? Well, given a DLL base address, we can:

Parse the DLL headers
Find the pointers to the exported functions we want to use
Use them

2.1. Our own hashing function

We need a hashing function. It has to be reasonably good and short, so that it doesn't take a lot of space. I ended up choosing \(djb2\) (read more here):

uint32_t djb2(uint8_t* str) {
    uint32_t hash = 5381;
    uint8_t c;

    while (c = *(str++))
        hash = ((hash << 5u) + hash) + c;

    return hash;
}

2.2. The APIs we need to call

We need to call different APIs from different DLLs.

We can proceed by computing the hash of each API name. For example, from kernel32.dll:

\(djb2(“LoadLibraryA”)=\text{0x5FBFF0FB}\)
\(djb2(“GetModuleHandleA”)=\text{0x5A153F58}\)
\(djb2(“GetCommandLineA”)=\text{0xB511FC4D}\)
\(djb2(“GetStartupInfoA”)=\text{0x348B7545}\)
\(djb2(“ExitProcess”)=\text{0xB769339E}\)

Then save this hashes for later in an array:

uint32_t Kernel32Hashes[] = {
    0x5FBFF0FBu,
    0x5A153F58u,
    0xB511FC4Du,
    0x348B7545u,
    0xB769339Eu
};

Let's define a function pointer for each API, in this example:

typedef HMODULE(__stdcall* GetModuleHandleA_t)(LPCSTR);
typedef LPSTR(__stdcall* GetCommandLineA_t)();
typedef void(__stdcall* GetStartupInfoA_t)(LPSTARTUPINFOA);
typedef void(__stdcall* ExitProcess_t)(UINT);
typedef HMODULE(__stdcall* LoadLibraryA_t)(LPCSTR);

Where __stdcall is a calling convention, used by almost all Windows APIs.

Now these function pointers need to be saved somewhere: we can use a struct.

#pragma pack(push, 1)
typedef __declspec(align(1)) struct {
    LoadLibraryA_t _LoadLibraryA;
    GetModuleHandleA_t _GetModuleHandleA;
    GetCommandLineA_t _GetCommandLineA;
    GetStartupInfoA_t _GetStartupInfoA;
    ExitProcess_t _ExitProcess;

    ...
} API;
#pragma pack(pop)

Notice how we used some directives to the compiler:

#pragma pack(push, 1) & #pragma pack(pop) to avoid putting extra padding between our struct fields and avoid reordering them
__declspec(align(1)) to force a 1 byte alignment (instead of 4), to save space

Again, the suffixed underscore is needed to avoid namespace clashes and Microsoft's #defines to mess up.

2.3. Understanding the PE format

First off: RVA stands for Relative Virtual Address and it's a pointer in memory to a part of the DLL relative to the DLL base address.

Now, this is what we are going to do:

Find the DllBase
Move forward of 0x3C bytes to read the PE header position
Go to the PE header
Move forward of 0x78 bytes to find the Export table RVA
Go to the Export table
Skip the first 0x18 bytes (we don't need them)
Save Exported functions (the number of exported functions)
Save Address table RVA
Save Name pointer table RVA
Save Ordinal table RVA

2.4. How to find an exported function pointer

Let's say we need to find the pointer to the function "Function1", we need to:

Linearly scan the Export name pointer table until we find the function name "Function1"
At the same (logical) offset, in the Export ordinal table, find the ordinal number of the function
Use the ordinal as a (logical) offset in the Export address table
Congrats! You found the function pointer

2.5. The `findFunc` function

#pragma pack(push, 1)
struct dll_info {
    uint32_t exported_functions;
    uintptr_t address_table_RVA;
    uintptr_t name_pointer_table_RVA;
    uintptr_t ordinal_table_RVA;
};
#pragma pack(pop)

void findFunc(uintptr_t dllBase, uint32_t* hashes, void** ptrs, size_t size) {
    uintptr_t PE_RVA = *(uintptr_t*)((uint8_t*)dllBase + 0x3Cu);
    uintptr_t PE = dllBase + *(uintptr_t*)((uint8_t*)dllBase + 0x3Cu);

    uintptr_t export_table_RVA = *(uintptr_t*)((uint8_t*)PE + 0x78u);
    struct dll_info *dll = (uint8_t*)dllBase + export_table_RVA + 0x18u;

    uintptr_t name_pointer_table_entry_RVA = dll->name_pointer_table_RVA;
    uint32_t i, j;

    uintptr_t ordinal_function_RVA;
    uint16_t ordinal_function;
    uintptr_t function_RVA;

    for (i = 0; i < dll->exported_functions; i++, name_pointer_table_entry_RVA += 4) {
        uintptr_t function_name_RVA = *(uintptr_t*)((uint8_t*)dllBase + name_pointer_table_entry_RVA);
        char* function_name = (uint8_t*)dllBase + function_name_RVA;
        uint32_t function_hash = djb2(function_name);

        for (j = 0; j < size; j++) {
            if (function_hash == hashes[j]) {
                ordinal_function_RVA = dll->ordinal_table_RVA + i * 2;
                ordinal_function = *(uint16_t*)((uint8_t*)dllBase + ordinal_function_RVA);
                function_RVA = *(uintptr_t*)((uint8_t*)dllBase + dll->address_table_RVA + ordinal_function * 4);

                ptrs[j] = (uint8_t*)dllBase + function_RVA;

                break;
            }
        }
    }
}

findFunc takes as its inputs:

The base address where the DLL is loaded
An array of function names' hashes
An (empty) array of function pointers
The length of the arrays

Every function name exported by the DLL is hashed with \(djb2\) and compared with each hash in the array passed as the second parameter. If it's a match then the pointer to that function is stored in the void ** array. For example, after the function is executed: \[hashes[0]=djb2(“LoadLibraryA”)=\text{0x5FBFF0FB} \Longrightarrow ptrs[0] = \text{function pointer to LoadLibraryA}\] A couple of things you might want to understand:

struct dll_info is used to read its 4 fields from the DLL all at once (they are contiguous)
Sometimes a cast to uint8_t * is used because we can't do pointers arithmetic on void *
In the for body, we multiply i by 2 because sizeof(ordinal) = 2
Also, we multiply ordinal_function by 4 because sizeof(address) = 4
I used quite a lot of variables, but the compiler will optimize most of them out
The algorithm is \(O(n \cdot m)\)

2.6. Getting the `kernel32.dll` APIs

Now everything should be set up to load in the API struct the function pointers to the kernel32.dll functions.
Since the API struct is packed we can modify our initAPI function as follows:

void initAPI(API* api) {
    PPEB peb = __readfsdword(0x30);
    uintptr_t kernel32Base = 0;

    for (
        PLIST_ENTRY ptr = peb->Ldr->InMemoryOrderModuleList.Flink;
        kernel32Base == 0;
        ptr = ptr->Flink)
    {
        MY_LDR_DATA_TABLE_ENTRY *e = CONTAINING_RECORD(ptr, MY_LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
        
        if (e->BaseNameHashValue == KERNEL32DLL_HASH)
            kernel32Base = (uintptr_t)e->DllBase;
    }

    findFunc(kernel32Base, Kernel32Hashes, &api->_LoadLibraryA, ARRAYSIZE(Kernel32Hashes));
}

Where ARRAYSIZE is a macro imported by including windows.h, which returns the length of a static array.

2.7. Getting the `user32.dll` and `gdi32.dll` APIs

We actually need other functions exported by user32.dll and gdi32.dll.

The problem is that we neither know the base address of these DLLs nor they are automatically loaded like kernel32.dll. We can use a trick however: the LoadLibrary function exported by kernel32.dll.
LoadLibrary loads a DLL by its file name and returns an HMODULE, which, in fact, is just a PVOID, pointing to the DLL base address: we can cast it to a uintptr_t and use it in our findFunc.

We also need to:

Compute the function names' hashes we need from user32.dll and gdi32.dll
Add some new function pointer definitions
Add some new fields in the API struct to save the pointers

The initAPI function needs to be changed accordingly:

void initAPI(API* api) {
    PPEB peb = __readfsdword(0x30);
    uintptr_t kernel32Base = 0;

    for (
        PLIST_ENTRY ptr = peb->Ldr->InMemoryOrderModuleList.Flink;
        kernel32Base == 0;
        ptr = ptr->Flink) 
    {
        MY_LDR_DATA_TABLE_ENTRY *e = CONTAINING_RECORD(ptr, MY_LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
        
        if (e->BaseNameHashValue == KERNEL32DLL_HASH)
            kernel32Base = (uintptr_t)e->DllBase;
    }

    findFunc(kernel32Base, Kernel32Hashes, &api->_LoadLibraryA, ARRAYSIZE(Kernel32Hashes));

    HMODULE user32 = api->_LoadLibraryA("USER32.DLL");
    findFunc((uintptr_t)user32, User32Hashes, &api->_LoadIconA, ARRAYSIZE(User32Hashes));

    HMODULE gdi32 = api->_LoadLibraryA("GDI32.DLL");
    findFunc((uintptr_t)gdi32, &Gdi32Hash, &api->_SetBkMode, 1);
}

Where in the last findFunc call we just pass 1 as the array size, since we need just 1 function in gdi32.dll. You can find the new function pointers definition here and the new hashes here.

3. Producing a small executable

3.1. Visual Studio configuration

Clearly, we might want to tweak MSVC and the linker to produce a small binary. I made a new configuration called "MinRel" based on the standard "Release" one. After playing around with the configuration I ended up with these command line parameters:

Compiler

/permissive- /ifcOutput "MinRel\" /GS- /analyze- /W3 /Gy /Zc:wchar_t- /Gm- /O1 /Ob0 /sdl- /Fd"MinRel\vc142.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /fp:except- /errorReport:prompt /GF /WX- /Zc:forScope /arch:IA32 /Gd /Oy /Oi /MD /FC /Fa"MinRel\" /nologo /Zl /Fo"MinRel\" /Os /Fp"MinRel\MinHW.pch" /diagnostics:column

Linker

/OUT:"C:\Users\Davide\source\repos\MinHW\MinRel\MinHW.exe" /MANIFEST:NO /PDB:"C:\Users\Davide\source\repos\MinHW\MinRel\MinHW.pdb" /DYNAMICBASE:NO /MACHINE:X86 /ENTRY:"main" /WINMD:NO /OPT:REF /SAFESEH:NO /INCREMENTAL:NO /PGD:"C:\Users\Davide\source\repos\MinHW\MinRel\MinHW.pgd" /SUBSYSTEM:WINDOWS /MANIFESTUAC:NO /ManifestFile:"MinRel\MinHW.exe.intermediate.manifest" /LTCGOUT:"MinRel\MinHW.iobj" /OPT:ICF /ERRORREPORT:PROMPT /NOLOGO /ALIGN:16 /NODEFAULTLIB /TLBID:1 /MERGE:.rdata=.text /MERGE:.data=.text /EMITPOGOPHASEINFO /RELEASE /STUB:"$(MSBuildProjectDirectory)\stub.bin"

You can see some undocumented/not common flags here:

/DYNAMICBASE:NO → removes the relocation table from the EXE
/ENTRY:"main" → sets a custom entry point
/ALIGN:16 → sets the alignment of each section in the EXE, we can't go lower than 16 bytes
/NODEFAULTLIB → avoids using the standard linked library
/EMITPOGOPHASEINFO → removes some additional debug information (undocumented)
/MERGE → merges different sections
/RELEASE → along with /EMITPOGOPHASEINFO removes all debug information
/STUB → specifies a custom DOS stub (a small custom MZ EXE)

Even if I'm quite sure we don't actually need all these flags, I ended up with a 1312 byte EXE.

3.2. Removing some bytes

We can cut some extra bytes by opening the EXE with a hex editor.

As you can see, the last 0x5D bytes of the executable are all \(0\)s: we can delete them.
We also need to change the SizeOfCode field in the IMAGE_OPTIONAL_HEADER. To put it simply, just go to offset 0x8C and decrease whatever you find (0x90 in my case) by the number of bytes we removed earlier (0x5D):

We end up with a fully working 1219 bytes executable.

So, the executable is roughly \(18.1\%\) smaller than the original 1488 bytes one. Not bad. But can we do better?

3.3. What about assembly?

Now someone might ask: "why didn't you use assembly?". Well, I did.
Here is a gist with some code I wrote to show a MessageBox. For your own pleasure, a GIF (joking, it's a looped video) of me writing assembly at really high speed.

(The video has been sped up to 10x)

Now the funny thing: after compiling the source code with MASM (ML actually), I ended up with a binary with exactly the same size as the C one.
The thing is that, today, compilers are pretty damn good, so there isn't really any point in writing assembly (even if there might be some exceptions).

4. Asking DavePl

Here comes the point when I asked for help. I reached Dave Plummer himself to tell him about my final EXE size and ask him if he knew other ways to reduce the size. Sure enough, he promptly gave me some pieces of advice:

Move structs in the BSS
Use another linker called Crinkler

Thanks Dave!

4.1. Avoiding the stack

Generally speaking, moving a struct from the stack to the BSS or the data section generates a smaller binary. This is because if you initialize a struct in the stack, a MOV instruction has to be used to fill each struct's field and this takes quite some space. On the other hand, if the struct is stored in the BSS or the data section, the whole section is just copied in memory in one go when the executable starts.
I immediately moved the WNDCLASSEX in the global scope. This saved a bunch of bytes (I can't remember exactly how many).

4.2. Using Crinkler

In its GitHub repo, the about section states that:

Crinkler is an executable file compressor (or rather, a compressing linker) for Windows for compressing small demoscene executables. As of 2020, it is the most widely used tool for compressing 1k/4k/8k intros.

I was quite skeptical at the beginning: approximately 1200 bytes is very small by today's standards, maybe we can't go lower. But guess what? There's always room for improvement.

After a few minutes reading the manual, I ended up executing the following command in the same directory where Visual Studio generates the .obj files:

Crinkler.exe /NODEFAULTLIB /ENTRY:main /SUBSYSTEM:WINDOWS /TINYHEADER /NOINITIALIZERS /UNSAFEIMPORT /ORDERTRIES:1000 /TINYIMPORT /LIBPATH:"C:\Program Files (x86)\Windows Kits\10\Lib\10.0.18362.0\um\x86" MinHW.obj api.obj kernel32.lib

Let's analyze some flags:

/NODEFAULTLIB, /ENTRY and /SUBSYSTEM behave exactly the same as with the Microsoft linker
/TINYHEADER → uses an alternative compression algorithm which is beneficial for extremely small file
/NOINITIALIZERS → disables some C++-related features. We are using plain C, we don't need them
/UNSAFEIMPORT → avoid displaying a MessageBox if an import fails, generating a smaller executable
/ORDERTRIES → specifies how many section reordering iterations Crinkler will try
/TINYIMPORT → enables a more compact function importing scheme
/LIBPATH → add a new library search path

Notice how we need to link kernel32.lib because Crinkler emits some code that needs to be linked against it. If you have a Windows 10 SDK installed, you can find your x86 kernel32.lib file under:

C:\Program Files (x86)\Windows Kits\10\Lib\<version>\um\x86

Replace <version> with the newer version you have installed.

Read the following section to know the latest size.

5. Conclusions

The final size is:

874 bytes

This means that the new executable is:

\(28.3\%\) smaller than the previous 1219 bytes one
\(41.3\%\) smaller than the previous 1488 bytes one

Wow!

5.1. Compile it on your own

I made a zip that you can use to compile your own 874 bytes executable. To use it:

Extract the zip somewhere
Find kernel32.lib as discussed in the previous section and edit build.bat accordingly
Make sure you have Visual Studio with MSVC + Windows 10 SDK installed
Search on your computer x86 Native Tools Command Prompt for VS 2019 or x64_x86 Cross Tools Command Prompt for VS 2019 if you want to cross comiple on an x86_64 system
cd into the directory where you extracted the zip
Execute build.bat

If everything goes well, after a bunch of warnings, in the extracted directory you will get:

MinRel\MinHW.exe → the EXE linked without Crinkler
out.exe → the EXE linked with Crinkler

5.2. Further improvements

If you have some ideas on how to shrink the executable even more, email me at [email protected].

5.3. Special thanks

Obviously, a special thank goes to Dave Plummer. He gave me a great coding adventure to work with and some great tips. Thank you for all the effort you put into things.

Making the smallest Windows application