This is Part 7 — and the final part — of my Cortex‑M7 without hardware series.

Parts 1 through 6 built minimal bare-metal projects that would boot on Renode and QEMU. Minimal was the key word: the linker script had no alignment or symbol exports, the startup code skipped .data copying and .bss zeroing, and main.cpp had to talk to the host through hand-rolled inline assembly. Those shortcuts were fine for proving the concept, but they left the project in a state where most of the C, and especially C++, language was technically broken.

This post ties up all the loose ends. By the end, both projects have:

  1. Working initialized globals.data is copied from flash to RAM at startup
  2. Guaranteed zero-initialized globals.bss is zeroed before main()
  3. C++ global constructors.init_array is iterated so static objects are constructed
  4. Standard library I/Oprintf and exit work without hand-rolled assembly

The minimal versions are still available as git tags (blog-minimal for Renode 1, blog-minimal-0.1.2 for QEMU 2) so it is possible to diff them against the blog-completed tags on both projects to see every change.

The linker–startup contract

Part 1 introduced the linker script and in Part 2 the startup code. Together they form a contract: the linker script defines the memory layout and exports boundary symbols, and the startup code uses those symbols to prepare the runtime environment before main() executes.

Part 2 explicitly listed what a production Reset_Handler does:

  1. Copy .data from flash (LMA) to RAM (VMA)
  2. Zero .bss
  3. Run early system init
  4. Call main()

The minimal startup simply called main to get the application running. Without the others, any global variable with an initial value would contain values, essentially garbage, that were in RAM at power-on, and any zero-initialized global was not guaranteed to be zero. That was tolerable when main.cpp had no globals and used no standard library functions. It stops being tolerable the moment you write int counter = 42; at file scope or call printf.

Both the Renode and QEMU projects get the same structural changes described below. The only difference is the memory addresses (0x08000000 for STM32F746, 0x00000000 for MPS2-AN500) and the flash/RAM sizes — the pattern is identical.

Linker script: from minimal to complete

The minimal linker script from Part 1 was 10 lines inside the SECTIONS block. The complete version is roughly 120 lines. Here is what was added and why.

Alignment

The minimal script had no alignment directives. The complete version uses . = ALIGN(4) or . = ALIGN(8) throughout. The ALIGN builtin function returns the location counter (.) rounded up to the next multiple of its argument. 3 Writing . = ALIGN(8); before a section’s content therefore inserts padding bytes (if needed) so that the first byte that follows sits on an 8-byte boundary. The .isr_vector section uses two ALIGN directives — one before and one after the content:

.isr_vector :
{
    . = ALIGN(8);
    KEEP(*(.isr_vector))
    . = ALIGN(8);
} > FLASH

As the first ensures the 8-byte boundary, the second pads the section’s tail so the location counter is aligned when the next section (.text) begins — without it, if the vector table’s total size is not a multiple of 8, .text could start at an unaligned address.

On Cortex-M7, unaligned accesses to normal memory generally work but cause an extra bus cycle. More importantly, certain memory regions (like the vector table) must be aligned as the architecture requires it. 4 Explicit alignment prevents subtle faults.

.rodata as a separate section

The minimal script bundled read-only data (*(.rodata*)) inside .text. It works, but the complete version gives it its own section:

.rodata :
{
    . = ALIGN(4);
    *(.rodata)
    *(.rodata*)
    . = ALIGN(4);
} > FLASH

Separating .rodata makes the map file easier to read, which is nice and all, but it also enables memory protection, as the Cortex-M7 includes an optional Memory Protection Unit (MPU). This unit allows you to define regions of memory with specific access permissions — read-only, read/write, executable, or non-executable (the XN, “execute never”, bit). 5

When .rodata lives inside .text, it is not possible to distinguish code from constant data, forcing the entire region to be marked executable, which means string literals and const globals become executable too. By placing .rodata in its own section, you can configure separate MPU regions: .text as read-only and executable, .rodata as read-only and execute-never. This is a common hardening step that prevents malicious (or accidental) code execution from data regions.

.ARM.extab and .ARM.exidx

These are the C++ exception unwinding tables. Even though both projects are compiled with the -fno-exceptions flag, the ARM toolchain and/or Embedded C libraries (newlib) may emit these sections internally. If the linker script does not place them, the linker may warn or silently produce a broken ELF. 6

.ARM.extab :
{
    *(.ARM.extab* .gnu.linkonce.armextab.*)
} > FLASH

.ARM :
{
    __exidx_start = .;
    *(.ARM.exidx*)
    __exidx_end = .;
} > FLASH

The __exidx_start and __exidx_end symbols are required by the C++ runtime unwinder (__aeabi_unwind_cpp_pr*). Without these symbols, “undefined reference” errors will pop up when linking against library code that references them.

.preinit_array, .init_array, .fini_array

These sections hold tables of function pointers that the startup code iterates through:

  • .preinit_array — low-level platform init that must run before global constructors
  • .init_array — C++ global constructors and __attribute__((constructor)) functions
  • .fini_array — C++ global destructors (called at exit)
.init_array :
{
    PROVIDE_HIDDEN(__init_array_start = .);
    KEEP(*(SORT(.init_array.*)))
    KEEP(*(.init_array*))
    PROVIDE_HIDDEN(__init_array_end = .);
} > FLASH

KEEP is essential here — the linker’s --gc-sections pass cannot see direct references to these entries (they are looked up by address, not by symbol), so without KEEP the linker would garbage-collect them. SORT ensures priority-ordered constructors (e.g. .init_array.100, .init_array.200) run in the correct order. 7

PROVIDE_HIDDEN creates the boundary symbols (__init_array_start, __init_array_end) which are ABI-standard names that the startup code uses to iterate the table. These symbols do not come from any object file — they exist only because the linker script defines them. Without the assignment, the startup code would get an “undefined reference” error. A plain assignment like __init_array_start = .; would work too, but PROVIDE_HIDDEN adds two safety layers: PROVIDE only defines the symbol if no object file already defines it, avoiding “multiple definition” clashes if, for example, a C library exports the same name; HIDDEN marks the symbol with STV_HIDDEN visibility so it stays internal and does not leak into the dynamic symbol table. 8 The same pattern applies to .preinit_array and .fini_array.

.data with LMA/VMA split and boundary symbols

The minimal script had .data : { *(.data*) } > RAM AT > FLASH, which is correct placement but there is no way for the startup code to find the data. The complete version adds the symbols the copy loop needs:

_sidata = LOADADDR(.data);

.data :
{
    . = ALIGN(4);
    _sdata = .;
    *(.data)
    *(.data*)
    . = ALIGN(4);
    _edata = .;
} > RAM AT> FLASH

_sidata is the load address in flash where the initial values are stored. 7 _sdata and _edata mark the RAM destination. The startup code copies from _sidata to _sdata..._edata.

.bss with boundary symbols

Same idea, but simpler — there is nothing to copy, just assign a region (_sbss..._ebss) to zero:

.bss :
{
    . = ALIGN(4);
    _sbss = .;
    *(.bss)
    *(.bss*)
    *(COMMON)
    . = ALIGN(4);
    _ebss = .;
} > RAM

_estack and _end

Two important symbols that sit outside any section:

_estack = ORIGIN(RAM) + LENGTH(RAM);

This replaces the hardcoded 0x20000000 + 320 * 1024 that was in Part 2’s startup.c. The vector table references _estack as the initial MSP value.

. = ALIGN(8);
_end = .;

_end marks the first free address after all static allocations. The _sbrk syscall (covered later) uses it as the heap starting point.

Startup: from C to C++

The minimal startup.c was 25 lines: a 4-entry vector table and a Reset_Handler that simply called main(). The complete startup.cpp is 117 lines.

Why .cpp?

The startup now iterates C++ init arrays and uses reinterpret_cast, auto, and an asm label on a function declaration. Using C++ for the startup file keeps the project in a single language.

Full vector table

The minimal version had 4 entries (MSP, Reset, NMI, HardFault). The complete version has all 16 core Cortex-M exception entries:

extern "C" __attribute__((used, section(".isr_vector")))
const uintptr_t vector_table[] = {
  reinterpret_cast<uintptr_t>(&_estack),           // Initial stack pointer
  reinterpret_cast<uintptr_t>(Reset_Handler),      // Reset
  reinterpret_cast<uintptr_t>(NMI_Handler),        // NMI
  reinterpret_cast<uintptr_t>(HardFault_Handler),  // HardFault
  reinterpret_cast<uintptr_t>(MemManage_Handler),  // MemManage
  reinterpret_cast<uintptr_t>(BusFault_Handler),   // BusFault
  reinterpret_cast<uintptr_t>(UsageFault_Handler), // UsageFault
  0,                                               // Reserved
  0,                                               // Reserved
  0,                                               // Reserved
  0,                                               // Reserved
  reinterpret_cast<uintptr_t>(SVC_Handler),        // SVCall
  reinterpret_cast<uintptr_t>(DebugMon_Handler),   // Debug monitor
  0,                                               // Reserved
  reinterpret_cast<uintptr_t>(PendSV_Handler),     // PendSV
  reinterpret_cast<uintptr_t>(SysTick_Handler),    // SysTick
};

Each handler is declared as a weak alias to Default_Handler:

extern "C" void NMI_Handler(void)        __attribute__((weak, alias("Default_Handler")));
extern "C" void HardFault_Handler(void)  __attribute__((weak, alias("Default_Handler")));
extern "C" void MemManage_Handler(void)  __attribute__((weak, alias("Default_Handler")));
// ... and so on for all exceptions

The weak attribute means application code can define its own HardFault_Handler and the linker will use that instead — without touching the startup file. 9 This is the standard pattern used by vendor startup files (ST’s HAL, NXP’s SDK, etc.).

.data copy and .bss zero-fill

The two loops that the minimal startup deliberately skipped:

void Reset_Handler(void)
{
  // Copy .data from flash to RAM.
  auto src = &_sidata;
  auto dst = &_sdata;
  while (dst < &_edata)
  {
    *dst++ = *src++;
  }

  // Zero-fill .bss.
  dst = &_sbss;
  while (dst < &_ebss)
  {
    *dst++ = 0;
  }

These run before any C or C++ code that might read a global variable. Without the .data copy, an int x = 42; at file scope would read as whatever random value happened to be in RAM. Without the .bss zero, a static int count; might not start at zero.

Constructor iteration

After memory is initialized, the startup iterates the pre-init and init arrays:

  // Run pre-init callbacks.
  for (auto fn = __preinit_array_start; fn < __preinit_array_end; fn++)
  {
    (*fn)();
  }

  // Run C++ global constructors.
  for (auto fn = __init_array_start; fn < __init_array_end; fn++)
  {
    (*fn)();
  }

This is what makes C++ objects at file scope (like std::array<int, 5> data = {1, 2, 3, 4, 5}) get constructed before main() runs.

The app_main / asm label trick

Calling main() directly from startup code with -Wpedantic triggers a diagnostic: “ISO C++ forbids taking the address of function ::main.” The workaround:

extern int app_main(void) asm("main");

This binds the C++ name app_main to the linker symbol main. The startup calls app_main(), the linker resolves it to the actual main function, and the compiler stays happy and there is no need for disabling the compiler flag.

WFI after main() returns

Instead of a bare while(1), the post-main loop issues the wfi (Wait For Interrupt) instruction:

  app_main();

  while (true)
  {
    asm volatile("wfi");
  }
}

On real hardware this puts the core into a low-power sleep state. On emulators it is harmless. Either way it is better than running cycles in a loop.

Making the standard library work

With the linker script and startup in place, the next step is hooking up the C standard library so that printf, malloc, and exit actually work.

Dropping -ffreestanding

Part 3 compiled with -ffreestanding because there was no standard library backend. That flag tells the compiler “do not assume any standard library functions exist” — which means even printf is off limits.

Now that we are providing the necessary syscall stubs (see below), -ffreestanding is removed. The project is a hosted environment, though it still uses -nostartfiles because the startup code is custom.

--specs=nano.specs

The linker flags now include --specs=nano.specs. This tells GCC to link against newlib-nano instead of full newlib. Newlib-nano provides smaller implementations of printf, sprintf, malloc, and friends — at the cost of some features (no floating-point numbers (%f) in printf by default, for example). For a microcontroller with 1 MB of flash, this matters: the full newlib printf can add tens of kilobytes to the binary. 10

A “specs file” is a GCC mechanism for bundling default compiler/linker options into a named configuration. nano.specs redirects the C library and system call implementations to their newlib-nano counterparts. 11

Syscalls: what newlib needs

Newlib-nano is a C library, but it does not know how a specific target does I/O or allocates memory. It delegates those operations to a set of POSIX-like functions that the application (we) must provide. Without these, linking with printf or malloc will produce “undefined reference” errors.

The backend-independent stubs live in syscalls_common.cpp:

_sbrk — called by malloc to grow the heap. Even a simple printf needs this because newlib uses malloc internally for formatting buffers:

void* _sbrk(ptrdiff_t incr)
{
  extern char    _end; // Linker symbol — first address after BSS.
  static char*   heap_end = &_end;
  char*          prev     = heap_end;

  register char* sp asm("sp");
  if (heap_end + incr > sp)
  {
    return reinterpret_cast<void*>(-1);
  }

  heap_end += incr;
  return prev;
}

The heap starts at _end (the linker symbol we defined earlier) and grows upward toward the stack. The collision check (heap_end + incr > sp) prevents malloc from corrupting the stack — if the heap would overlap, _sbrk returns -1 and malloc returns NULL.

_fstat — reports every descriptor as a character device (S_IFCHR), which makes the C library use line-buffered I/O suitable for a serial console.

_isatty — returns 1 (true) for all descriptors, ensuring stdout is line-buffered rather than fully buffered. This gives immediate output per newline.

_read, _close, _lseek — minimal stubs that return EOF, error, or zero respectively. There is no input source and no file system on bare metal.

Output backends

The output-specific stubs (_write and _exit) live in a separate file so they can be swapped per target.

Semihosting (syscalls_semihosting.cpp) — used by both the Renode and QEMU projects. This is the same ARM semihosting protocol from Part 6, but now routed through the standard _write interface instead of inline assembly in main.cpp:

int _write(int fd, const char* buf, int len)
{
  (void) fd;
  for (int i = 0; i < len; i++)
  {
    (void) semihosting_call(0x03, &buf[i]); // SYS_WRITEC
  }
  return len;
}

Note the switch from SYS_WRITE0 (op 0x04, null-terminated string) in Part 6 to SYS_WRITEC (op 0x03, single character). The C library calls _write with a buffer and length — not a null-terminated string — so writing one character at a time through SYS_WRITEC is the correct fit.

UART (syscalls_uart.cpp) — an alternative backend available in the QEMU project. Instead of semihosting, it writes directly to the MPS2-AN500 UART0 peripheral at 0x40004000. This is selected at CMake configure time:

set(OUTPUT_BACKEND "semihosting" CACHE STRING "Output backend: semihosting or uart")

Pass -DOUTPUT_BACKEND=uart to switch. QEMU then uses -serial stdio instead of -semihosting-config to route UART output to the terminal.

C++ runtime stubs

One more file: cxx_runtime.cpp provides two small stubs that the C++ ABI requires:

extern "C"
{
void __cxa_pure_virtual()
{
  while (true)
    ;
}

void __cxa_deleted_virtual()
{
  while (true)
    ;
}
}

__cxa_pure_virtual is called if a pure virtual function is invoked at runtime (meaning a derived class failed to override it). __cxa_deleted_virtual is called if a deliberately deleted virtual method is invoked. Normally libstdc++ provides these, but its implementations pull in std::terminate(), which drags in the entire exception and demangling machinery — adding tens of kilobytes to the binary. These minimal stubs just hang so a debugger can catch the error, keeping the binary small. 12

The payoff: the new main.cpp

After all that infrastructure, main.cpp becomes:

#include <cstdio>
#include <cstdlib>

int main()
{
  printf("Hello from Cortex-M7 on Renode!\n");
  printf("Value: %d\n", 42);
  exit(0);
}

Compare this to Part 6’s main.cpp, which had to define semihosting_call, sh_print, and sh_exit inline — roughly 30 lines of assembly-heavy code just to print a string and shut down.

All that complexity has moved into reusable infrastructure files (startup.cpp, syscalls_*.cpp, cxx_runtime.cpp). The application code now reads like a normal C++ program. That is the whole point.

Build and run

The CMake changes are modest:

  • Five new source files added to add_executable
  • -ffreestanding removed
  • --specs=nano.specs added to linker flags
  • Post-build objcopy step generates .hex and .bin files alongside the ELF (useful for flash programmers and some emulators that want raw binary)
  • The Renode project dynamically generates .resc scripts in the build directory
  • The QEMU project adds the OUTPUT_BACKEND cache variable

Renode

git clone \
  --depth 1 \
  --branch blog-completed --single-branch \
  https://gitlab.com/sorhanp/cortex-m7-renode.git \
  cortex-m7-renode-blog-completed
cd cortex-m7-renode-blog-completed
cmake --preset arm-none-eabi-debug
cmake --build --preset arm-gcc-debug-build --target run-renode

Expected output (Renode):

Hello from Cortex-M7 on Renode!
Value: 42

QEMU

git clone \
  --depth 1 \
  --branch blog-completed --single-branch \
  https://gitlab.com/sorhanp/cortex-m7-qemu.git \
  cortex-m7-qemu-blog-completed 
cd cortex-m7-qemu-blog-completed
cmake --preset arm-none-eabi-debug
cmake --build --preset arm-gcc-debug-build --target run-qemu

Expected output (QEMU):

Hello from Cortex-M7 on QEMU!
Value: 42

What we built

Here is a quick summary of what went from “missing” to “present”:

What was addedWhy it matters
Alignment directives in linker scriptPrevents bus faults and ensures vector table correctness
.rodata, .ARM.extab, .ARM.exidx sectionsProper placement of read-only data and C++ unwind tables
.preinit_array / .init_array / .fini_arrayC++ global constructors and destructors run automatically
.data copy loop in startupInitialized globals have their correct values
.bss zero-fill in startupZero-initialized globals are guaranteed zero
Full 16-entry vector table with weak aliasesAll core exceptions are handled; application can override any handler
Newlib-nano via --specs=nano.specsprintf, malloc, exit work out of the box
POSIX syscall stubsNewlib’s backend for heap, I/O, and process control
C++ ABI stubsVirtual function safety without pulling in libstdc++ exception machinery

Over seven posts, this series went from a linker script that mapped flash and RAM for an STM32F746 to a fully hosted C++ environment. From there, the minimal vector table and startup code got the CPU running. CMake cross-compiled it, the map file proved the layout was correct, and Renode and QEMU gave us execution and debugging — all without soldering a single pin.

Now the foundation is solid. The runtime is initialized, the standard library works, and main.cpp is just normal C++. From here you could add HAL drivers, an RTOS, or complex application logic and the environment would support it.

The full source is on GitLab: Renode project 1 and QEMU project 2.

References