This is Part 7 — and the final part — of my Cortex‑M7 without hardware series.
Parts 1 through 6 built minimal bare-metal projects that would boot on Renode and QEMU. Minimal was the key word: the linker script had no alignment or symbol exports, the startup code skipped .data copying and .bss zeroing, and main.cpp had to talk to the host through hand-rolled inline assembly. Those shortcuts were fine for proving the concept, but they left the project in a state where most of the C, and especially C++, language was technically broken.
This post ties up all the loose ends. By the end, both projects have:
- Working initialized globals —
.datais copied from flash to RAM at startup - Guaranteed zero-initialized globals —
.bssis zeroed beforemain() - C++ global constructors —
.init_arrayis iterated so static objects are constructed - Standard library I/O —
printfandexitwork without hand-rolled assembly
The minimal versions are still available as git tags (blog-minimal for Renode 1, blog-minimal-0.1.2 for QEMU 2) so it is possible to diff them against the blog-completed tags on both projects to see every change.
The linker–startup contract
Part 1 introduced the linker script and in Part 2 the startup code. Together they form a contract: the linker script defines the memory layout and exports boundary symbols, and the startup code uses those symbols to prepare the runtime environment before main() executes.
Part 2 explicitly listed what a production Reset_Handler does:
- Copy
.datafrom flash (LMA) to RAM (VMA) - Zero
.bss - Run early system init
- Call
main()
The minimal startup simply called main to get the application running. Without the others, any global variable with an initial value would contain values, essentially garbage, that were in RAM at power-on, and any zero-initialized global was not guaranteed to be zero. That was tolerable when main.cpp had no globals and used no standard library functions. It stops being tolerable the moment you write int counter = 42; at file scope or call printf.
Both the Renode and QEMU projects get the same structural changes described below. The only difference is the memory addresses (0x08000000 for STM32F746, 0x00000000 for MPS2-AN500) and the flash/RAM sizes — the pattern is identical.
Linker script: from minimal to complete
The minimal linker script from Part 1 was 10 lines inside the SECTIONS block. The complete version is roughly 120 lines. Here is what was added and why.
Alignment
The minimal script had no alignment directives. The complete version uses . = ALIGN(4) or . = ALIGN(8) throughout. The ALIGN builtin function returns the location counter (.) rounded up to the next multiple of its argument. 3 Writing . = ALIGN(8); before a section’s content therefore inserts padding bytes (if needed) so that the first byte that follows sits on an 8-byte boundary. The .isr_vector section uses two ALIGN directives — one before and one after the content:
.isr_vector :
{
. = ALIGN(8);
KEEP(*(.isr_vector))
. = ALIGN(8);
} > FLASH
As the first ensures the 8-byte boundary, the second pads the section’s tail so the location counter is aligned when the next section (.text) begins — without it, if the vector table’s total size is not a multiple of 8, .text could start at an unaligned address.
On Cortex-M7, unaligned accesses to normal memory generally work but cause an extra bus cycle. More importantly, certain memory regions (like the vector table) must be aligned as the architecture requires it. 4 Explicit alignment prevents subtle faults.
.rodata as a separate section
The minimal script bundled read-only data (*(.rodata*)) inside .text. It works, but the complete version gives it its own section:
.rodata :
{
. = ALIGN(4);
*(.rodata)
*(.rodata*)
. = ALIGN(4);
} > FLASH
Separating .rodata makes the map file easier to read, which is nice and all, but it also enables memory protection, as the Cortex-M7 includes an optional Memory Protection Unit (MPU). This unit allows you to define regions of memory with specific access permissions — read-only, read/write, executable, or non-executable (the XN, “execute never”, bit). 5
When .rodata lives inside .text, it is not possible to distinguish code from constant data, forcing the entire region to be marked executable, which means string literals and const globals become executable too. By placing .rodata in its own section, you can configure separate MPU regions: .text as read-only and executable, .rodata as read-only and execute-never. This is a common hardening step that prevents malicious (or accidental) code execution from data regions.
.ARM.extab and .ARM.exidx
These are the C++ exception unwinding tables. Even though both projects are compiled with the -fno-exceptions flag, the ARM toolchain and/or Embedded C libraries (newlib) may emit these sections internally. If the linker script does not place them, the linker may warn or silently produce a broken ELF. 6
.ARM.extab :
{
*(.ARM.extab* .gnu.linkonce.armextab.*)
} > FLASH
.ARM :
{
__exidx_start = .;
*(.ARM.exidx*)
__exidx_end = .;
} > FLASH
The __exidx_start and __exidx_end symbols are required by the C++ runtime unwinder (__aeabi_unwind_cpp_pr*). Without these symbols, “undefined reference” errors will pop up when linking against library code that references them.
.preinit_array, .init_array, .fini_array
These sections hold tables of function pointers that the startup code iterates through:
.preinit_array— low-level platform init that must run before global constructors.init_array— C++ global constructors and__attribute__((constructor))functions.fini_array— C++ global destructors (called at exit)
.init_array :
{
PROVIDE_HIDDEN(__init_array_start = .);
KEEP(*(SORT(.init_array.*)))
KEEP(*(.init_array*))
PROVIDE_HIDDEN(__init_array_end = .);
} > FLASH
KEEP is essential here — the linker’s --gc-sections pass cannot see direct references to these entries (they are looked up by address, not by symbol), so without KEEP the linker would garbage-collect them. SORT ensures priority-ordered constructors (e.g. .init_array.100, .init_array.200) run in the correct order. 7
PROVIDE_HIDDEN creates the boundary symbols (__init_array_start, __init_array_end) which are ABI-standard names that the startup code uses to iterate the table. These symbols do not come from any object file — they exist only because the linker script defines them. Without the assignment, the startup code would get an “undefined reference” error. A plain assignment like __init_array_start = .; would work too, but PROVIDE_HIDDEN adds two safety layers: PROVIDE only defines the symbol if no object file already defines it, avoiding “multiple definition” clashes if, for example, a C library exports the same name; HIDDEN marks the symbol with STV_HIDDEN visibility so it stays internal and does not leak into the dynamic symbol table. 8 The same pattern applies to .preinit_array and .fini_array.
.data with LMA/VMA split and boundary symbols
The minimal script had .data : { *(.data*) } > RAM AT > FLASH, which is correct placement but there is no way for the startup code to find the data. The complete version adds the symbols the copy loop needs:
_sidata = LOADADDR(.data);
.data :
{
. = ALIGN(4);
_sdata = .;
*(.data)
*(.data*)
. = ALIGN(4);
_edata = .;
} > RAM AT> FLASH
_sidata is the load address in flash where the initial values are stored. 7 _sdata and _edata mark the RAM destination. The startup code copies from _sidata to _sdata..._edata.
.bss with boundary symbols
Same idea, but simpler — there is nothing to copy, just assign a region (_sbss..._ebss) to zero:
.bss :
{
. = ALIGN(4);
_sbss = .;
*(.bss)
*(.bss*)
*(COMMON)
. = ALIGN(4);
_ebss = .;
} > RAM
_estack and _end
Two important symbols that sit outside any section:
_estack = ORIGIN(RAM) + LENGTH(RAM);
This replaces the hardcoded 0x20000000 + 320 * 1024 that was in Part 2’s startup.c. The vector table references _estack as the initial MSP value.
. = ALIGN(8);
_end = .;
_end marks the first free address after all static allocations. The _sbrk syscall (covered later) uses it as the heap starting point.
Startup: from C to C++
The minimal startup.c was 25 lines: a 4-entry vector table and a Reset_Handler that simply called main(). The complete startup.cpp is 117 lines.
Why .cpp?
The startup now iterates C++ init arrays and uses reinterpret_cast, auto, and an asm label on a function declaration. Using C++ for the startup file keeps the project in a single language.
Full vector table
The minimal version had 4 entries (MSP, Reset, NMI, HardFault). The complete version has all 16 core Cortex-M exception entries:
extern "C" __attribute__((used, section(".isr_vector")))
const uintptr_t vector_table[] = {
reinterpret_cast<uintptr_t>(&_estack), // Initial stack pointer
reinterpret_cast<uintptr_t>(Reset_Handler), // Reset
reinterpret_cast<uintptr_t>(NMI_Handler), // NMI
reinterpret_cast<uintptr_t>(HardFault_Handler), // HardFault
reinterpret_cast<uintptr_t>(MemManage_Handler), // MemManage
reinterpret_cast<uintptr_t>(BusFault_Handler), // BusFault
reinterpret_cast<uintptr_t>(UsageFault_Handler), // UsageFault
0, // Reserved
0, // Reserved
0, // Reserved
0, // Reserved
reinterpret_cast<uintptr_t>(SVC_Handler), // SVCall
reinterpret_cast<uintptr_t>(DebugMon_Handler), // Debug monitor
0, // Reserved
reinterpret_cast<uintptr_t>(PendSV_Handler), // PendSV
reinterpret_cast<uintptr_t>(SysTick_Handler), // SysTick
};
Each handler is declared as a weak alias to Default_Handler:
extern "C" void NMI_Handler(void) __attribute__((weak, alias("Default_Handler")));
extern "C" void HardFault_Handler(void) __attribute__((weak, alias("Default_Handler")));
extern "C" void MemManage_Handler(void) __attribute__((weak, alias("Default_Handler")));
// ... and so on for all exceptions
The weak attribute means application code can define its own HardFault_Handler and the linker will use that instead — without touching the startup file. 9 This is the standard pattern used by vendor startup files (ST’s HAL, NXP’s SDK, etc.).
.data copy and .bss zero-fill
The two loops that the minimal startup deliberately skipped:
void Reset_Handler(void)
{
// Copy .data from flash to RAM.
auto src = &_sidata;
auto dst = &_sdata;
while (dst < &_edata)
{
*dst++ = *src++;
}
// Zero-fill .bss.
dst = &_sbss;
while (dst < &_ebss)
{
*dst++ = 0;
}
These run before any C or C++ code that might read a global variable. Without the .data copy, an int x = 42; at file scope would read as whatever random value happened to be in RAM. Without the .bss zero, a static int count; might not start at zero.
Constructor iteration
After memory is initialized, the startup iterates the pre-init and init arrays:
// Run pre-init callbacks.
for (auto fn = __preinit_array_start; fn < __preinit_array_end; fn++)
{
(*fn)();
}
// Run C++ global constructors.
for (auto fn = __init_array_start; fn < __init_array_end; fn++)
{
(*fn)();
}
This is what makes C++ objects at file scope (like std::array<int, 5> data = {1, 2, 3, 4, 5}) get constructed before main() runs.
The app_main / asm label trick
Calling main() directly from startup code with -Wpedantic triggers a diagnostic: “ISO C++ forbids taking the address of function ::main.” The workaround:
extern int app_main(void) asm("main");
This binds the C++ name app_main to the linker symbol main. The startup calls app_main(), the linker resolves it to the actual main function, and the compiler stays happy and there is no need for disabling the compiler flag.
WFI after main() returns
Instead of a bare while(1), the post-main loop issues the wfi (Wait For Interrupt) instruction:
app_main();
while (true)
{
asm volatile("wfi");
}
}
On real hardware this puts the core into a low-power sleep state. On emulators it is harmless. Either way it is better than running cycles in a loop.
Making the standard library work
With the linker script and startup in place, the next step is hooking up the C standard library so that printf, malloc, and exit actually work.
Dropping -ffreestanding
Part 3 compiled with -ffreestanding because there was no standard library backend. That flag tells the compiler “do not assume any standard library functions exist” — which means even printf is off limits.
Now that we are providing the necessary syscall stubs (see below), -ffreestanding is removed. The project is a hosted environment, though it still uses -nostartfiles because the startup code is custom.
--specs=nano.specs
The linker flags now include --specs=nano.specs. This tells GCC to link against newlib-nano instead of full newlib. Newlib-nano provides smaller implementations of printf, sprintf, malloc, and friends — at the cost of some features (no floating-point numbers (%f) in printf by default, for example). For a microcontroller with 1 MB of flash, this matters: the full newlib printf can add tens of kilobytes to the binary. 10
A “specs file” is a GCC mechanism for bundling default compiler/linker options into a named configuration. nano.specs redirects the C library and system call implementations to their newlib-nano counterparts. 11
Syscalls: what newlib needs
Newlib-nano is a C library, but it does not know how a specific target does I/O or allocates memory. It delegates those operations to a set of POSIX-like functions that the application (we) must provide. Without these, linking with printf or malloc will produce “undefined reference” errors.
The backend-independent stubs live in syscalls_common.cpp:
_sbrk — called by malloc to grow the heap. Even a simple printf needs this because newlib uses malloc internally for formatting buffers:
void* _sbrk(ptrdiff_t incr)
{
extern char _end; // Linker symbol — first address after BSS.
static char* heap_end = &_end;
char* prev = heap_end;
register char* sp asm("sp");
if (heap_end + incr > sp)
{
return reinterpret_cast<void*>(-1);
}
heap_end += incr;
return prev;
}
The heap starts at _end (the linker symbol we defined earlier) and grows upward toward the stack. The collision check (heap_end + incr > sp) prevents malloc from corrupting the stack — if the heap would overlap, _sbrk returns -1 and malloc returns NULL.
_fstat — reports every descriptor as a character device (S_IFCHR), which makes the C library use line-buffered I/O suitable for a serial console.
_isatty — returns 1 (true) for all descriptors, ensuring stdout is line-buffered rather than fully buffered. This gives immediate output per newline.
_read, _close, _lseek — minimal stubs that return EOF, error, or zero respectively. There is no input source and no file system on bare metal.
Output backends
The output-specific stubs (_write and _exit) live in a separate file so they can be swapped per target.
Semihosting (syscalls_semihosting.cpp) — used by both the Renode and QEMU projects. This is the same ARM semihosting protocol from Part 6, but now routed through the standard _write interface instead of inline assembly in main.cpp:
int _write(int fd, const char* buf, int len)
{
(void) fd;
for (int i = 0; i < len; i++)
{
(void) semihosting_call(0x03, &buf[i]); // SYS_WRITEC
}
return len;
}
Note the switch from SYS_WRITE0 (op 0x04, null-terminated string) in Part 6 to SYS_WRITEC (op 0x03, single character). The C library calls _write with a buffer and length — not a null-terminated string — so writing one character at a time through SYS_WRITEC is the correct fit.
UART (syscalls_uart.cpp) — an alternative backend available in the QEMU project. Instead of semihosting, it writes directly to the MPS2-AN500 UART0 peripheral at 0x40004000. This is selected at CMake configure time:
set(OUTPUT_BACKEND "semihosting" CACHE STRING "Output backend: semihosting or uart")
Pass -DOUTPUT_BACKEND=uart to switch. QEMU then uses -serial stdio instead of -semihosting-config to route UART output to the terminal.
C++ runtime stubs
One more file: cxx_runtime.cpp provides two small stubs that the C++ ABI requires:
extern "C"
{
void __cxa_pure_virtual()
{
while (true)
;
}
void __cxa_deleted_virtual()
{
while (true)
;
}
}
__cxa_pure_virtual is called if a pure virtual function is invoked at runtime (meaning a derived class failed to override it). __cxa_deleted_virtual is called if a deliberately deleted virtual method is invoked. Normally libstdc++ provides these, but its implementations pull in std::terminate(), which drags in the entire exception and demangling machinery — adding tens of kilobytes to the binary. These minimal stubs just hang so a debugger can catch the error, keeping the binary small. 12
The payoff: the new main.cpp
After all that infrastructure, main.cpp becomes:
#include <cstdio>
#include <cstdlib>
int main()
{
printf("Hello from Cortex-M7 on Renode!\n");
printf("Value: %d\n", 42);
exit(0);
}
Compare this to Part 6’s main.cpp, which had to define semihosting_call, sh_print, and sh_exit inline — roughly 30 lines of assembly-heavy code just to print a string and shut down.
All that complexity has moved into reusable infrastructure files (startup.cpp, syscalls_*.cpp, cxx_runtime.cpp). The application code now reads like a normal C++ program. That is the whole point.
Build and run
The CMake changes are modest:
- Five new source files added to
add_executable -ffreestandingremoved--specs=nano.specsadded to linker flags- Post-build
objcopystep generates.hexand.binfiles alongside the ELF (useful for flash programmers and some emulators that want raw binary) - The Renode project dynamically generates
.rescscripts in the build directory - The QEMU project adds the
OUTPUT_BACKENDcache variable
Renode
git clone \
--depth 1 \
--branch blog-completed --single-branch \
https://gitlab.com/sorhanp/cortex-m7-renode.git \
cortex-m7-renode-blog-completed
cd cortex-m7-renode-blog-completed
cmake --preset arm-none-eabi-debug
cmake --build --preset arm-gcc-debug-build --target run-renode
Expected output (Renode):
Hello from Cortex-M7 on Renode!
Value: 42
QEMU
git clone \
--depth 1 \
--branch blog-completed --single-branch \
https://gitlab.com/sorhanp/cortex-m7-qemu.git \
cortex-m7-qemu-blog-completed
cd cortex-m7-qemu-blog-completed
cmake --preset arm-none-eabi-debug
cmake --build --preset arm-gcc-debug-build --target run-qemu
Expected output (QEMU):
Hello from Cortex-M7 on QEMU!
Value: 42
What we built
Here is a quick summary of what went from “missing” to “present”:
| What was added | Why it matters |
|---|---|
| Alignment directives in linker script | Prevents bus faults and ensures vector table correctness |
.rodata, .ARM.extab, .ARM.exidx sections | Proper placement of read-only data and C++ unwind tables |
.preinit_array / .init_array / .fini_array | C++ global constructors and destructors run automatically |
.data copy loop in startup | Initialized globals have their correct values |
.bss zero-fill in startup | Zero-initialized globals are guaranteed zero |
| Full 16-entry vector table with weak aliases | All core exceptions are handled; application can override any handler |
Newlib-nano via --specs=nano.specs | printf, malloc, exit work out of the box |
| POSIX syscall stubs | Newlib’s backend for heap, I/O, and process control |
| C++ ABI stubs | Virtual function safety without pulling in libstdc++ exception machinery |
Over seven posts, this series went from a linker script that mapped flash and RAM for an STM32F746 to a fully hosted C++ environment. From there, the minimal vector table and startup code got the CPU running. CMake cross-compiled it, the map file proved the layout was correct, and Renode and QEMU gave us execution and debugging — all without soldering a single pin.
Now the foundation is solid. The runtime is initialized, the standard library works, and main.cpp is just normal C++. From here you could add HAL drivers, an RTOS, or complex application logic and the environment would support it.
The full source is on GitLab: Renode project 1 and QEMU project 2.
References
Cortex-M7 Renode project: https://gitlab.com/sorhanp/cortex-m7-renode ↩︎ ↩︎
Cortex-M7 QEMU project: https://gitlab.com/sorhanp/cortex-m7-qemu ↩︎ ↩︎
GNU ld — ALIGN built-in function: https://sourceware.org/binutils/docs/ld/Builtin-Functions.html ↩︎
Arm Architecture Reference Manual — Vector Table Offset Register (VTOR): https://developer.arm.com/documentation/ddi0489/f/system-control/system-control-register-descriptions/vector-table-offset-register ↩︎
Arm Cortex-M7 — MPU Type Register and region attributes: https://developer.arm.com/documentation/ddi0489/f/memory-protection-unit ↩︎
ARM exception handling ABI (.ARM.exidx / .ARM.extab): https://github.com/ARM-software/abi-aa/blob/main/ehabi32/ehabi32.rst ↩︎
GNU ld — KEEP, SORT, and LOADADDR: https://sourceware.org/binutils/docs/ld/Input-Section-Keep.html ↩︎ ↩︎
GNU ld — PROVIDE and PROVIDE_HIDDEN: https://sourceware.org/binutils/docs/ld/PROVIDE.html ↩︎
GCC attribute weak and alias: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html ↩︎
Newlib and newlib-nano: https://sourceware.org/newlib/ ↩︎
GCC spec files: https://gcc.gnu.org/onlinedocs/gcc/Spec-Files.html ↩︎
Itanium C++ ABI — __cxa_pure_virtual: https://itanium-cxx-abi.github.io/cxx-abi/abi.html ↩︎