Thursday, November 24, 2011

Booting Beagleboard by BBC BASIC

The RPC is the only machine I have left with a serial port (well, except the BBC Micro that's coming up to its 30th birthday next year, and it can't quite manage 115200 baud).

Below is a very short BBC BASIC program that assembles some code and uploads it to the BB internal memory. You have to hold down the USR button on powerup/reset to get it to talk. (The slowest part in development was noticing out that the initial boot sequence uses 8E1 protocol, and not the 8N1 that the x-loader and u-boot and linux do.)

I wanted to get into the boot process as early as possible without continuously re-flashing the NAND ROM, and also avoiding plugging an SD card into the somewhat worn out socket.

Brunel OS is an idea I've been thinking about for years, with a concept based on an Eiffel inspired language that handles multi-threading by simply only alowing one thread at a time access to any object and allowing threads to pass through multiple memory maps (somewhat different to the process model of other operating systems I've had experience of).

So, an example of loading a very small program into internal RAM of the BeagleBoard from a Risc PC:

SYS"OS_SerialOp", 0, &106, &fffffe00
SYS"OS_SerialOp", 1, %011000
SYS"OS_SerialOp", 5, 18
SYS"OS_SerialOp", 6, 18

PROCassemble

REM Receive ASIC ID (and ignore it, for the time being)
FOR R%=1 TO 58
  REPEAT
    SYS"OS_SerialOp", 4 TO ,J%;F%
  UNTIL (F% AND 2) = 0

  PRINT RIGHT$( "0"+STR$~J%, 2 );
NEXT

PRINT "Sending boot command"
PROCput( 2 )
PROCput( 0 )
PROCput( 3 )
PROCput( &f0 )
PRINT "Boot command sent"
DIM B% 4
!B%=P%-code%
PROCput( B%?0 )
PROCput( B%?1 )
PROCput( B%?2 )
PROCput( B%?3 )
PRINT "Length sent ";!B%
T%=0
FOR Q%=code% TO P%
  SYS"OS_SerialOp", 4 TO ,J%;F%
  IF (F% AND 2) = 0 THEN
    PROCshow( J% )
  ENDIF
  PROCput( ?Q% )
  T%+=1
NEXT
PRINT "Transmission finished "; T%; " bytes"

REM Kind of terminal (one way, at the moment)
REPEAT
  REPEAT
    SYS"OS_SerialOp", 4 TO ,J%;F%
  UNTIL (F% AND 2) = 0
  PROCshow( J% )
UNTIL 0

END


DEF PROCput( C% )
REPEAT
SYS"OS_SerialOp", 3, C% TO ; F%
UNTIL (F% AND 2) = 0
ENDPROC

DEF PROCshow( C% )
    R%+=1
    IF C%>31 OR C%=10 OR C%=13 THEN
      VDU C%
    ELSE
      PRINT "<";~C%;">"
    ENDIF
ENDPROC


DEF PROCassemble
DIM code% 1000
FOR I%=0 TO 1
P%=code%
[OPT 3*I%
  MOV r9, #&49000000
  ORR r9, r9, #&00020000
  ADR r1, welcome
.write_loop
  LDRB r2, [r1], #1
  CMP r2, #0
  STRNEB r2, [r9]  ;This can overflow the buffer, but it's not important yet
  BNE write_loop

.loop
  B loop
.welcome
  EQUS "Welcome to Brunel OS 0.0.0-1"+CHR$10+CHR$13
  EQUB 0
  ALIGN
]
NEXT
ENDPROC

Output:
04010501343007561302010012150100000000000000000000000000000000000000001415010000000000000000000000000000000000000000Sending boot command
Boot command sent
Length sent 64
Transmission finished 65 bytes
Welcome to Brunel OS 0.0.0-1

Friday, November 04, 2011

Alignment and modern ARM processors

Back when RISC OS was in its heyday, and the ARM processor was new, reading or writing a word at a non-word aligned address didn't cause an exception or cause two memory words to be read (and half discarded); instead it would read the word at (address & ~3) (i.e. the word containing the byte addressed) and load it into the register, rotated so that the addressed byte was the least significant byte in the register (IIRC).

ARM Architecture Reference Manual:

load single word ARM instructions are architecturally defined to rotate right the word-aligned data transferred by a non word-aligned address one, two or three bytes depending on the value of the two least significant address bits.

Modern ARM processors, when faced with this case, can trigger an alignment exception, allowing the OS to fixup the instruction to behave in the (admittedly more high-level-language-friendly) way of reading the four bytes starting at the address and treating them as a single word); some hardware can even perform the fixup automatically.

Unfortunately, that means that some RISC OS (read: old) code breaks.

There are two Linux features that appear to have a bearing on the problem:
  1. /proc/cpu/alignment (allows you to set whether an alignment exception should be fixed up or a SIGBUS sent to the process).  Unfortunately, that has at least two problems; firstly, it's system-wide, and so may cause other programs to break, secondly, it doesn't register unaligned LDRs (probably because the hardware performs the fixup).
  2. The prctl has the following values defined in linux/prctl.h: PR_SET_UNALIGN and PR_GET_UNALIGN, and the possible values: PR_UNALIGN_NOPRINT (for silent fixup) and PR_UNALIGN_SIGBUS (to signal the exception to the process for fixing up).  This is a per-process feature but, unfortunately, these values are not implemented in the ARM kernel.
Ideally, I think I'd like to use the prctl approach and have a third value to set the system to, PR_UNALIGN_ARM_TRADITIONAL, which would simply work the way ARM processors used to.  How practical that is remains to be seen (especially in the presence of multi-core processors; the behaviour has to be settable on a per-core basis).

Wednesday, November 02, 2011

ARM Emulation on ARM - progress

The emulation approach I talked about in my last post is working well enough that most BASIC seems to work well (and the missing parts are probably due to omissions in the ROLF libraries, not down to the emulator code).  !Edit (from a RISC OS 4.02 ROM image) starts up, brings up an IconBar icon and opens a window (whose text can be read, provided it's not in system font), but for some reason the characters typed each get put on a new line!

There are still some bugs to be ironed out, and the (unoptimised) speed is less than I'd hoped.  Actually it's about 100 times slower than my 200MHz RiscPC!

Still, I have several features that still need implementing, such as a hash table for cache searching, fixing up jumps to non-local code so that the second and subsequent jumps make no search, compiling the library with gcc optimisation on (which made a surprisingly large difference on the x86 ARM emulator) and, if all else fails, reorganising the instruction identification mechanism to be more efficient.

Update 19:39.  Implemented a simple three instruction hash, for a better than 10x speedup.  Now only 10 times slower than the 15 year old machine!

Update 19:55.  -O4 optimisation gives a 30% speedup (a test that used to take 167cs, now takes 115cs), although that's probably more about turning off some debug than anything else.

Update 00:21.  One fixup approach takes the time down to 78cs (for 1 million times around a FOR...NEXT loop).
Initially, I thought I'd have to clear the cache (which involves a system call, plus whatever the OS does) for each fixup, but I realised that I could use LDR pc, [pc,...] to load the destination address using the data cache, (where the destination address would initially be the fixup routine and later the actual code address).  I might try some other approaches in the morning.  (I just noticed some debug output still in there, so I deleted it and... the time went UP to 106cs.  I'm going to bed.)