Tuesday, July 17, 2012

Better timings

Since my last post, I've had a look at the ARM-on-ARM emulator with a view to improving the speed.

Using gprof, I found that:
  1. The image translation from RISC OS Sprite to screen mode was unoptimised (i.e. making two unoptimised function calls per pixel).  This, I improved, but it only made a small overall difference since it occurs after the rendering work is done in the emulator.
  2. gprof is useless for following what's going on in the emulator generated code.
Building the emulator to dump the state after every emulated instruction, then with a bit of fiddling with grep, sed, sort and uniq, meant that I could find the instructions most used by the renderer (about 10000 times each rendering the ACORN file).  I noticed that one branch condition (actually, the case where a conditional branch instruction is not taken) always did a lot of work (hash table lookup, etc.) but that the existing branch fixup code could be used to improve it.

With that six or seven line code change, the situation is now that the BeagleBoard renders the celtic_knot3 file in a little over 37s, down from 85s.
x86 PC: 14s, Beagle Board: 37s
My RISC PC is not cooperating at the moment, but the same file takes about 55s to render on its 200MHz StrongARM.

The branch fixup code currently avoids having to clear the code cache by calling the fixup code using LDR pc, [pc, #...], having stored the code address in a scratch register.  That way, only the word loaded by that instruction has to be changed to point to the generated code instead of the fixup routine, and the next time the instruction is reached, the fixup routine will be bypassed (although the setup for the call remains).

Further possible improvements to be tried:
  1. Modify the first instruction of the fixed up code to be a proper branch to the address, as well as the current change; if the code falls out of the code cache by itself, the next time the code is run it will be quicker.
  2. Clear the ARM code cache explicitly, so that the faster code will be called straight away.  This may be slower, due to the overhead of a system call the first time the branch occurs.
The ARM-on-ARM emulator code is available on the SourceForge ROLF project site, at: http://ro-lf.svn.sourceforge.net/viewvc/ro-lf/ROLF/rolf/Libs/Compatibility/arm_arm_emulator.c?view=log

If you compile with -DSTANDALONE, it will create an executable that takes the name of a file that should contain ARM instructions, and run them (only really useful with gdb, to see what's going on).

The handling of unaligned memory accesses is still incorrect (except on my custom kernel, which fixes up the accesses in the old fashioned way).

Update: http://ro-lf.svn.sourceforge.net/viewvc/ro-lf/ROLF/rolf/Libs/Compatibility?view=tar downloads the whole ROLF compatibility library, including the include files and disassembly code.  (I can't check this at the moment, but...) The following should generate an executable on an ARM system:

tar xf ro-lf-Compatibility.tar.gz
cd Libs/Compatibility/  # Possibly other subdirectory
touch config.h  # Usually generated by the ROLF configure routine
gcc -o standalone_emulator arm_arm_emulator.c arm_d*.c -DARCH_ARM -DSTANDALONE -DDISASSEMBLE -Iincludes -I.

Thursday, July 05, 2012

AWRender on BeagleBoard, with timings

The AWRender module is now working under ROLF on the BeagleBoard.

It's a long story, but it looks like the problem wasn't with unaligned accesses after all (the newest module version doesn't do any), as I'd thought.

The symptoms were twofold: Viewer crashed as soon as it tried to open an ArtWorks image with a problem in the dynamic linker, and the !AWRender BASIC program displayed just a small part of the image (and showed unaligned accesses occuring).

It wasn't until I built from svn sources on the x86 platform again that I noticed that the BASIC program generated exactly the same (wrong) output.  The implication was that the problem was (a) independent of the emulator and (b) previously fixed, but lost (the program had previously be working, as can be seen from screenshots on this blog).  This started me looking at the Viewer problem, and I eventually noticed that ARMLinux locates executables at 0x8000 (same as RISC OS), rather than up above the 128MB boundary.  A build explicitly locating the executable in the same area resulted in properly displayed images (although, strangely, the celtic knot 3 image appears lighter on the BB than on the x86 PC, when both viewed via tightvnc on the same monitor).

This is what Linux's 'uname -a' and /proc/cpuinfo tells me about both systems:

BeagleBoard:

Linux (none) 2.6.39.1 #10 SMP Wed Jun 13 21:00:36 CEST 2012 armv7l GNU/Linux

Processor : ARMv7 Processor rev 3 (v7l)
processor : 0
BogoMIPS : 490.52

Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x1
CPU part : 0xc08
CPU revision : 3

Hardware : OMAP3 Beagle Board
Revision : 0020
Serial : 0000000000000000


PC:
Linux Microknoppix 2.6.32.6 #8 SMP PREEMPT Thu Jan 28 10:51:16 CET 2010 i686 GNU/Linux

processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) Dual Core Processor 5050e
stepping : 2
cpu MHz : 2593.613
cache size : 512 KB
bogomips : 5189.36

The same for core 1 (the emulator is single threaded, so that the second core will make minimal difference to rendering speed).

The AMD processor appears to be over ten times faster (BogoMIPS) than the BB's ARM.

The test I'm using is to display celtic_knot3, a file of 355812 bytes. The rendering time is in the tens of seconds, displayed in the titlebar of the Viewer window, so the file access time (in the millseconds) is of no consequence.

The initial times I'm getting are:

BeagleBoard 85s, PC 15s (to the nearest second).

This is disappointing; the (200MHz) RISC PC manages it in 55s (see here).