Skip to content

Latest commit

 

History

History
73 lines (40 loc) · 4.82 KB

File metadata and controls

73 lines (40 loc) · 4.82 KB

EduOS Debugging Log — Hardest Errors

A record of the toughest bugs encountered while bringing EduOS from a hang at boot to a fully rendered desktop. Ranked by difficulty.


1. 🔴 Uninitialized .bss Section (Silent Triple Fault)

Difficulty: Extreme — No error output, no crash screen, just a black void.

With --oformat binary in the linker, the .bss section (uninitialized globals) is not zeroed. Every global array — the 256-entry IDT, the 256-entry interrupt handler table, the GDT entries, the TSS — contained raw disk garbage. The CPU loaded corrupt IDT entries, jumped to garbage addresses, and triple-faulted before a single line of C code ever ran.

Why it was hard: There was zero diagnostic output. The VGA screen was black (mode 0x13 was set by the bootloader), but the kernel never executed init_serial(), so no serial logs either. The only clue was the absence of output.

Fix: Added a BSS zeroing loop in kernel_entry.asm using __bss_start/__bss_end linker symbols and rep stosd.


2. 🔴 PIC Mask Restore → General Protection Fault

Difficulty: Hard — Crash appeared random and unrelated to any specific code.

After PIC remapping, the old code restored the BIOS default interrupt masks (port_byte_out(0x21, a1)). The BIOS had all 16 IRQs enabled, but only 3 had IDT gates registered (IRQ0 timer, IRQ1 keyboard, IRQ12 mouse). The moment sti enabled interrupts, a stray hardware IRQ (like IRQ6/floppy controller) fired, hit a null IDT entry, and triggered an immediate General Protection Fault.

Why it was hard: The GPF appeared to come from sti or init_window_manager(), making it look like a code bug. The real cause (a hardware IRQ with no handler) was invisible — you had to reason about which IRQs the BIOS leaves unmasked and cross-reference that with registered IDT gates.

Fix: Replaced PIC mask restore with explicit masks: 0xF8 for PIC1 (only IRQ 0,1,2) and 0xEF for PIC2 (only IRQ12).


3. 🟠 Missing Bare-Metal GCC Flags (Potential SSE/PIC Crashes)

Difficulty: Medium — Would manifest as random crashes in optimized builds.

The kernel was compiled with just -O2 -ffreestanding. Without -fno-pie, GCC could emit position-independent code referencing a Global Offset Table that doesn't exist. Without -mno-sse -mno-mmx, the compiler could emit SIMD instructions while the FPU/SSE state was never initialized — causing an Invalid Opcode (#UD) exception that, without proper IDT entries, would triple-fault silently.

Why it was hard: These flags don't always cause issues — it depends on what code GCC decides to optimize and which instructions it selects. The crash might appear in one build and disappear in the next.

Fix: Added -fno-pie -fno-stack-protector -mno-sse -mno-mmx -mgeneral-regs-only to all GCC invocations.


4. 🟠 Incomplete IDT Registration (3 of 32 Exception Gates)

Difficulty: Medium — Any CPU exception became a silent death spiral.

Only ISR gates 0, 1, and 14 were registered in the IDT. Exceptions like General Protection Fault (#13), Invalid Opcode (#6), Double Fault (#8), and Stack Fault (#12) went to address 0x0000 (null IDT entry), causing nested faults that escalated to a triple fault and silent reset.

Why it was hard: Without exception handlers, there's no way to know which exception occurred or where. The CPU just resets, and you're left guessing.

Fix: Registered all 32 CPU exception gates (ISR 0–31) and added a human-readable exception name table for serial diagnostics.


5. 🟡 QEMU Floppy Geometry Mismatch

Difficulty: Medium — Appeared as a regression between builds.

The OS image was ~20KB — far smaller than a 1.44MB floppy. QEMU auto-detected the floppy geometry based on image size, potentially selecting a format with only 9 sectors/track instead of 18. The disk_load.asm code assumed 18 sectors/track, so reads beyond sector 9 would silently fail or return wrong data.

Why it was hard: The behavior was inconsistent between builds (depended on exact image size) and between QEMU versions.

Fix: Padded os-image.bin to exactly 1,474,560 bytes (1.44MB) to force correct geometry.


6. 🟡 Debug Halt Left in Production Code

Difficulty: Easy to find, easy to miss — Classic "works on my machine" bug.

vga_hang(2) on line 37 of kernel.c filled the VGA framebuffer green and halted the CPU forever. It was a debug checkpoint that was never removed.

Fix: Deleted the function and the call.


Key Takeaway

In OS development, the hardest bugs produce no output at all. The CPU triple-faults, resets, or hangs — and you're left staring at a black screen with no clues. The fix is always the same: build layers of diagnostics first (serial logging, VGA pixel markers, boot stage markers), then the actual bugs become visible.