It is a performance hit to use gcc's profiling approach for this tiny target. Even more – jtag hardware facility does not perform any profiling functions. However we've got gdb's built-in simulator where we can do anything.
We define new section .profiler which holds all profiling information. We define new pseudo operation .profiler which will instruct assembler to add new profile entry to the object file. Profile should take place at the present address.
Pseudo operation format:
.profiler flags,function_to_profile [, cycle_corrector, extra]
where:
sxiflcdIPpEejatfunction_to_profilecycle_correctorextraFor example:
     .global fxx
     .type fxx,@function
     fxx:
     .LFrameOffset_fxx=0x08
     .profiler "scdP", fxx     ; function entry.
     			  ; we also demand stack value to be saved
       push r11
       push r10
       push r9
       push r8
     .profiler "cdpt",fxx,0, .LFrameOffset_fxx  ; check stack value at this point
     					  ; (this is a prologue end)
     					  ; note, that spare var filled with
     					  ; the farme size
       mov r15,r8
     ...
     .profiler cdE,fxx         ; check stack
       pop r8
       pop r9
       pop r10
       pop r11
     .profiler xcde,fxx,3      ; exit adds 3 to the cycle counter
       ret                     ; cause 'ret' insn takes 3 cycles