Execution speed
Calculations
Examples
Optimizing for speed
Speed test program
Instruction issues
External instructions
XOP
X
C
Timing diagrams
Electrical characteristics
The TMS9900 is the CPU, i.e. the brain of the TI-99/4A. This microprocessor executes a machine language program located in memory and controls all the other chips in the computer. It's a real 16 bits microprocessor, which means it has 16 data lines and an address space of 2^^16 bytes, i.e. 64K. To accomodate all these lines, TI had to create an extra-large 64-pins chip, that was quite a novelty by that time. They could even afford the luxury of having 5 non-connected pins!
+----+-----+----+
Vbb |1 o 64| HOLD*
Vcc |2 63| MEMEN*
WAIT |3 T 62| READY
LOAD* |4 M 61| WE*
HOLDA |5 S 60| CRUCLK
RESET* |6 59| Vcc
IAQ |7 9 58| nc
PHI1 |8 9 57| nc
PHI2 |9 0 56| D15
A14 |10 0 55| D14
A13 |11 54| D13
A12 |12 53| D12
A11 |13 52| D11
A10 |14 51| D10
A9 |15 50| D9
A8 |16 49| D8
A7 |17 48| D7
A6 |18 47| D6
A5 |19 46| D5
A4 |20 45| D4
A3 |21 44| D3
A2 |22 43| D2
A1 |23 42| D1
A0 |24 41| D0
PHI4 |25 40| Vss
Vss |26 39| nc
Vdd |27 38| nc
PHI3 |28 37| nc
DBIN |29 36| IC0
CRUOUT |30 35| IC1
CRUIN |31 34| IC2
INTREQ* |32 33| IC3
+---------------+
Power supply
Vbb: -5V Vcc: +5V Both pins (2 and 59) must be connected.
Vdd: +12V Vss: Ground. Both pins (26 and 40) must be connected.
Clock
Phi1-Phi3 are 4 input pins that receive the same signal from the
TMS9904 clock generator, with one exception: each signal is shifted by
1/4 of a phase with respect the the previous one.
Data bus
D0-D15: These 16 pins are used to read or write data. Note that
contrarily to almost anybody else, TI made D0 the most significant bit
(weight >8000) and D15 the least significant bit (weight >0001).
Address bus
A0-A14: These 15 output pins are used the specify the address of
32Kwords in memory. Each word is two bytes long (since we have 16 data
lines), thus we are effectively addressing 64Kbytes. On the TI-99/4A,
the
data bus is multiplexed as 2 x 8 bits for almost every purpose except
accessing
the console ROMs and the scratch-pad RAM. The multiplexing circuitery
controls
a pseudo-address line A15 that indicated whether the 8-bit data bus
contains
the most significant byte of the 16-bit bus (A15=0) ot the least
significant
byte (A15=1).
Bus control
MEMEN* Memory enable. When active low, this pin indicated that
the
TMS9900 wants to access memory and has placed a valid address on the
address
bus.
DBIN Data bus in. When active (high) this pin indicates that the TMS9900 is ready to accept data.
WE* Write enable. When active (low) this pin indicates that the TMS9900 has placed valid data on the bus.
CRU control
CRUCLK CRU clock. When active (high) this pin indicates the the
TMS9900 is performing a CRU operation (or an external instruction).
CRUOUT This output pin contains the data that the TMS9900 sends out during CRU operations.
CRUIN This pin is used by the TMS9900 to input data during CRU operations.
Interrupt control
INTREQ* When active (low) this input pin signals the TMS9900
that
an interrupt is pending. If it accepts the interrupt, the TMS9900 will
perform BLWP @>0000 through BLWP @>003C
depending
on the interrupt level.
IC0-IC3 These 4 input pins indicate the level of the interrupt (0-15). The LIMI instruction can be used to define the "cutoff" level, above which the TMS9900 will ignore interrupts. On the TI-99/4A those pins are hardwired as low,low,low,high which means all interrupts are level 1.
LOAD* Non-maskable interrupt. When active low, this pin forces the TMS9900 to perform a BLWP @>FFFC interrupt. If it remains low, interrupts continue to be issued, thus is should not remain low for more than one instruction.
IAQ Instruction acquisition. This output pin indicates that the TMS9900 is acquiring an instruction. It can be used to detect illegal opcodes or to prevent LOAD* to last longer than one instruction. On the TI-99/4A, IAQ and HOLDA are combined via an OR gate and presented to the peripheral port. However, the flex cable connector does not carry that signal to the PE-Box.
RESET* When active (low) this input pin resets the TMS9900 and inhibits WE* and CRUCLK. RESET* must remain low for at least 3 clock cycles. As soon as it becomes high again, the TMS9900 performs a BLWP @>0000 (note that it is the same vector as interrupt level 0).
Memory control
HOLD* When active (low) this pins tells the TMS9900 that a DMA
controller
wants to perform Direct Memory Access. The TMS9900 set D0-D15, A0-A14,
MEMEN*, DBIN and WE* in high impedance state (isolated) then activate
HOLDA
and waits until HOLD* becomes high again. On the TI-99/4A this pin is
hardwired
high, which means we cannot perform DMA operations.
HOLDA Hold acknowledge. This output pin is used to tell the DMA controller that the it can perform direct memory access.
READY When active (high) this input pin tells the TMS9900 that the memory is ready to read or write data. If it's low, the TMS9900 enters a wait state and suspends all operations until READY becomes high again. On the TI-99/4A this line is used to multiplex the data bus, i.e. handle it as two 8-bit bytes, instead of one 16-but word. WAIT When active (high) this output pin indicates that the TMS9900 is now in wait state.
To execute an instruction, the TMS9900 must first fetch it from memory, which takes a few clock cycles (depending on the memory), then is must execute the instruction wich takes more clock cycles (depending on the instruction) and may require fetching one or two arguments from the memory (again, more clock cycles according to the addressing mode and memory type).
On the TI-99/4A, there are two kinds of memory: 16-bit and 8-bit. Console ROMs (address >0000-1FFF) and the RAM scratch-pad (address >8300-83FF) are the only 16-bit memories. All the rest, including peripheral cards, and memory-mapped devices (GROM, VDP, sound and speech chips) are accessed in a byte-wise manner. This requires multiplexing the 16-bit data bus in two 8-bits chunks. An electronic circuitery in the console takes care ot that burden and uses the READY line to halt the TMS9900 until the peripheral has received/sent the second data byte. Therefore, accessing such a memory results in 4 wait states for each memory access.
The table below can be used to calculate how long an instruction takes to be executed.
Instruction | Clock cycles | Memory access |
Source | Destination |
---|---|---|---|---|
A | 14 | 4 | Y | Y |
AB | 14 | 4 | Y | Y |
ABS(pos) (neg) |
12 | 2 | Y | - |
14 | 3 | Y | - | |
AI | 14 | 4 | - | - |
ANDI | 14 | 4 | - | - |
B | 8 | 2 | Y | - |
BL | 12 | 3 | Y | - |
BLWP | 26 | 6 | Y | - |
C | 14 | 3 | Y | Y |
CB | 14 | 3 | Y | Y |
CI | 14 | 3 | - | - |
CKOF | 12 | 1 | - | - |
CKON | 12 | 1 | - | - |
CLR | 10 | 3 | Y | - |
COC | 14 | 3 | Y | - |
CZC | 14 | 3 | Y | - |
DEC | 10 | 3 | Y | - |
DECT | 10 | 3 | Y | - |
DIV (ovf) (no ovf) |
16 | 3 | Y | - |
92-124 (1) | 6 | Y | - | |
IDLE | 12 | 1 | - | - |
INC | 10 | 3 | Y | - |
INCT | 10 | 3 | Y | - |
INV | 10 | 3 | Y | - |
Jump (taken)
(not taken) |
10 | 1 | - | - |
8 | 1 | - | - | |
LDCR | 20 +2*bits | 3 | Y | - |
LI | 12 | 3 | - | - |
LIMI | 16 | 2 | - | - |
LREX | 12 | 1 | - | - |
LWPI | 10 | 2 | - | - |
MOV | 14 | 4 | Y | Y |
MOVB | 14 | 4 | Y | Y |
MPY | 52 | 5 | Y | - |
NEG | 12 | 3 | Y | - |
ORI | 14 | 4 | - | - |
RSET | 12 | 1 | - | - |
RTWP | 14 | 4 | - | - |
S | 14 | 4 | Y | Y |
SB | 14 | 4 | Y | Y |
SBO | 12 | 2 | - | - |
SBZ | 12 | 2 | - | - |
SETO | 10 | 3 | Y | - |
Shift
(disp in R0) |
12 +2*disp | 3 | - | - |
20 +2*disp | 4 | - | - | |
SOC | 14 | 4 | Y | Y |
SOCB | 14 | 4 | Y | Y |
STCR (1-7)
(8 bits) (9-15 bits) (16 bits) |
42 | 4 | Y | - |
44 | 4 | Y | - | |
58 | 4 | Y | - | |
60 | 4 | Y | - | |
STST | 8 | 2 | - | - |
STWP | 8 | 2 | - | - |
SWPB | 10 | 3 | Y | - |
SZC | 14 | 4 | Y | Y |
SZCB | 14 | 4 | Y | Y |
TB | 12 | 2 | - | - |
X (note 2) | 8 | 2 | Y | - |
XOP | 36 | 8 | Y | - |
XOR | 14 | 4 | Y | - |
Illegal | 6 | 1 | - | - |
Interrupts | 22 | 5 | - | - |
Reset | 26 | 5 | - | - |
Notes
1) DIV execution time, when no overflow occurs, depends on the partial
quotient after each clock cycle during execution.
2) For X, add this time to the execution time of the instruction found
at the source address, minus 4 clock cycles and 1 memory access.
For each source and destination arguments (if any) add the
following:
Address mode | Clock cycles |
Memory access |
---|---|---|
Rx | 0 | 0 |
*Rx | 4 | 1 |
*Rx+ (byte)
(word) |
6 | 2 |
8 | 2 | |
@>xxxx | 8 | 1 |
@>xxxx(Rx) | 8 | 2 |
Note
For the *Rx+ addressing mode, the number of clock cycles depends on
whether
the register must be incremented by 1 or by 2. The byte-oriented
operations
increment it by 1 and use 6 clock cycles, these are: AB, CB, MOVB, SB,
SOCB and SZCB. In addition, the LDCR and STCR are considered as byte
operations
if they transfer 1 to 8 bits (with 9 to 16 bits they are word
operations
and use 8 clock cycles).
LIMI 2
The LIMI instruction uses 16 clock cycles and 2 memory access
operations:
one to fetch the LIMI instruction, one to fetch the immediate value 2.
Execution: 16 cycles
1 mem access to read LIMI }
1 mem access to read 2 } add 8 cycles if program is in slow mem.
This adds up to:
16 cycles if the instruction is in the ROMs or the scratch-pad,
16+2*4=24 clock cycles otherwise (remember, there are 4 wait states per
memory access).
With a 3 MHz clock that cycles every 333 nanoseconds, this boils down to 16*333 ns = 5.33 microseconds or 24*333 ns = 8 microseconds, depending on which memory the instruction is in. Tip: divide the number of clock cycles by 3 to find the execution time in microseconds (with a 3 MHz clock).
CLR R2
The CLR instruction uses 10 clock cycles and 3 memory access
operations.
Depending on which memory it is in, it may require from 10 to 22 clock
cycles. But CLR also takes an argument that we must consider. In this
case,
the argument is a register and the clock cycles needed for its access
can
be found in table 1.
Execution: 10 cycles
1 mem access to read CLR: add 4 cycles if program is in slow mem.
1 mem access to read R2 }
1 mem access to write R2 } Add 8 cycles if workspace is in slow mem
CLR @TEST
The CLR instruction itself still uses 10 or 22 clock cycles to execute,
but now dealing with the argument requires 8 extra clock cycles and 1
extra
memory access operations (to read the address of TEST from the word
following
CLR in the program).
Execution: 10 cycles
Argument @xxx: 8 cycles
1 mem access to read CLR }
1 mem access to read TEST } Add 8 cycles if program is in slow mem
1 mem access to read the contents of @TEST }
1 mem access to write to @TEST } Add 8 cycles if TEST is in slow mem
If the address TEST is not in the ROMs nor in the scratch-pad, it will add 8+2*4=16 clock cycles to the execution time. Otherwise it just adds 8 cycles. We thus have:
10+8 = 18 cycles if program and TEST are both in fast memory.
10+8+8 = 26 cycles if program in slow memory, but TEST is in fast memory.
10+8+8 = 26 cycles if program is in fast memory, but TEST is in slow memory.
10+8+8+8 = 34 cycles if both program and TEST are in slow memory.
CLR *R2
Fetching the source argument from the address found in register R2
requires
4 clock cycles and 1 memory access operation. Depending whether the
workspace
is in the scratch-pad or not (no workspace should ever be in ROM!),
this
may require 4 additional clock cycles. Depending whether the target
address
found in R2 (i.e. *R2) is in slow or fast memory, we may have to add 4
more cycles.
Execution: 10 cycles
Argument *Rx: 4 cycles
1 mem access to read CLR: Add 4 cycles if program is in slow mem
1 mem access to read R2: Add 4 cycles if workspace is in slow mem
1 mem access to read the contents of *R2 }
1 mem access to write to *R2 } Add 8 cycles if R2 points to slow mem
So the total could be:
10+4 = 14 cycles if CLR, workspace and *R2 are all in fast memory.
10+4+4 = 18 if either CLR or the workspace is in slow memory.
10+4+4+4 = 22 if both CLR and workspace are in slow memory
10+4+8 = 22 if R2 points to slow memory
10+4+8+4 = 26 if R2 points to slow memory and either CLR or WS is in slow memory
10+4+8+4+4 = 30 if everything is in slow memory.
LDCR R1,7
The LDCR instruction itself requires 20 clock cycles. There may be
additonal
cycles require to retrieve the instruction, or the register. Writing
the
CRU definitely implies 2 additional clock cycles per bit transfered: in
this case we are transfering 7 bits, thus we'll eat 2*7=14 cycles.
Execution: 20 cycles
1 mem access to read LDCR: Add 4 cycles if program is in slow mem
1 mem access to read R1: Add 4 cycles if workspace is in slow mem
1 cru write operation: Add 2 cycles per bit transfered (here: 2*7=14)
LDCR *R1+,7
Now this one gets tricky: the LDCR instruction requires 20+2*7=14 clock
cycles plus whatever may be needed for the memory access operations,
just
as above. But now we must allocate time to increment R1 after
execution.
Since we are transfering only one byte (that is, less than one byte:
only
7 bits), R1 will be incremented by 1, which requires 6 clock cycles and
2 memory accesses. Depending on where the workspace is, this could add
up to 6+2*4=14 cycles.
Execution: 20 cycles
1 mem access to read LDCR: Add 4 cycles if program is in slow mem
1 mem access to read R1: Add 4 cycles if workspace is in slow mem
1 mem access to read *R1: Add 4 cycles if source is in slow mem
Incrementing R1: 6 cycles (byte operation)
1 mem access to write back R1: Add 4 cycles if workspace is in slow mem
1 cru write operation: Add 2 cycles per bit transfered (here: 7*2=14)
LDCR *R1+,14
Here, we are transfering 14 bits, thus the LCDR instruction takes
20+2*14=48
cycles, plus what's needed for the memory access. In addition, since we
are transfering more that 1 byte, R1 will be incremented by two, which
requires 8 clock cycles instead of the 6 mentionned above.
Execution: 20 cycles
1 mem access to read LDCR: Add 4 cycles if program is in slow mem
1 mem access to read R1: Add 4 cycles if workspace is in slow mem
1 mem access to read *R1: Add 4 cycles if source is in slow mem
Incrementing R1 by two: 8 cycles (word operation)
1 mem access to write R1: Add 4 cycles if workspace is in slow mem
1 cru write operation: Add 2 cycles per bit transfered (here: 14*2=28 cycles)
SRL R2,4
The SRL shift operation requires 12 clock cycles, plus 2 cycles for
each
position shifted. Since the displacement is 4 in this example, it will
require 12+2*4=20 cycles. Plus the number of cycles required for 3
memory
access operations: 0 to12 depending on the position of the program and
the workspace.
Execution: 12 cycles
Shifting: 2 cycles per position
1 mem access to read SRL: Add 4 cycles if program is in slow mem
1 mem access to read R2 }
1 mem access to write R2 } Add 8 cycles if workspace is in slow mem
SRL R2,0
Here, we are fetching the displacement from R0 (let's say it contains
5).
This indirect shift operation requires 20+2*5=30 cycles, and 4 memory
access
operations instead of 3, as we must now fetch R0.
Execution: 12 cycles
Shifting: 2 cycles per position
1 mem access to read SRL: Add 4 cycles if program is in slow mem
1 mem access to read R2 }
1 mem access to read R0 } Add 12 cycles if workspace is in slow mem
1 mem access to write R2 }
These calculations are a pain to perform, aren't they? I'm playing with the idea to write an optimisation helper, i.e. a program that would read an assembly source file, and produce a corresponding output file with the execution times listed as comments. But I don't know when I will have time for that. Anyone aware of such a program around here?
Now we can see what are the cycle-hungry operations: DIV, MPY, LDCR, STCR, XOP and BLWP.
That's why it is often wise to perform a multiplication using shifts and additions rather than MPY:
MPY R0,R8 |
Requires 72 cycles to execute (52 in 16-bits memory). And that's the fastest MPY.
Now if R0 contains 8, we could have written:
SLL R8,3 |
Which does the same, but only uses 30 cycles (18 in fast memory).
To multiply by ten, we could do:
SLL R8,1 Multiply by two |
This requires 58 cycles in fast memory and 114 in slow memory. True, this is slower than the initial MPY, but we may have a use for the intermediary result in R9 (that is, R8 times two).
For the same reason, many programers avoid calling subroutines with BLWP-RTWP and favor BL-B *R11, at least in critical regions of their programs.
We can't do much about LDCR and STCR, but this
is
less of a problem: these instructions are rarely used anyhow, and the
limiting
factor may well be the hardware they are addressing (although not very
likely: TTLs are fast).
All this theory is impressive, but we'd like to verify whether it is true in the real word. Let's write a little test program and time its execution with a stop watch (you may want to time it automatically, with the TMS9901 timer, but that's another story).
START LWPI >A800 Load our workspace ** 1 ** DELAY LI R1,100 You can change this value RET LWPI >20BA Assuming editor/assembler workspace |
The test program first copies the timed routine in memory. This could be the scratch pad memory or the memory expansion. Then it executes the delay loop. Once it is done, it returns to the caller. I have assumed that this is the Editor/Assembler cartridge (oe Funnelweb). If it's not, modify the return instrucutions accordingly.
Now let's do some measurements. First of all, assemble the program with the B @RET in line ** 3 **. The delay loops will be skipped and the program will return immediately. This allows us to account for the time it takes the Editor/Assembler module to enter our program, and to display the <press any key> message when returning from it. As you'll see, this is so fast that we cannot time it..
Then let's comment out line ** 3 ** and time our program in four
different
situations: Modify line ** 1 ** to use a workspace in the memory
expansion
(>A800) or in the scratch-pad (>83E0). For each of those, modify
line ** 2 ** to copy the delay loop in the memory expansion (>B000)
or in the scratch-pad (>8300). Write the resulting times in this
table:
Program ______ Worskpace |
Memory expansion | Scratch-pad |
Memory expansion | 89 sec (100%) | 62 sec (70%) |
Scratch-pad | 62 sec (70%) | 44 sec (49%) |
Of course, the faster way is to have both the program and the workspace in the scratch-pad: in our case, it's twice as fast as the slowest solution. Unfortunately, this is not practical as the scratch-pad is only 256 bytes long. And many of those bytes have special meanings. Most of the time however, it is possible for you to place your workspace in the scratch-pad: LWPI >8300 for instance. This will substantially speed up your program (in our case, by 30%), especially if you are carefull to reserve your registers for frequently used variables.
Now, if there are some speed-critical routines in your program (such a scrolling the screen left/right in an arcade game), and if they are small enough, you could copy them in the scratch-pad as we did above and execute them there. Say at >8320, not to overwrite your own workspace.
But anyhow, there is more to gain with the workspace than with the program. Consider the instruction A R1,R2 for instance: it requires one memory operation to ftech the instruction from the program memory, and three accesses to the workspace (to read R1, to read R2 and to write back R2). So by placing the workspace in fast memory you gain three times more than by plancing the instruction itself in fast memory.
To speed up the program itself, you could optimize your code to avoid using those instructions that require a lot of time to execute, as discussed above. And often an improved algorithm will do better than nay optimization trick!
It is not the purpose of these pages to teach assembly language. Thus I won't discuss in detail the meanings of each and every instructions (see my assembly language primer). However, there are a few that are worth noticing.
There are five so-called external instructions: LREX (Load and Restart EXecution), CKOF (clock off), CKON (clock on), RSET (reset) and IDLE. The first four were used in the 990 microcomputer and have no special meaning on the TMS9900. They just place a special code on address lines A0-A2 and send a pulse on the CRUCLK pin. RSET also set the interrupt mask to zero, just like a LIMI 0.
IDLE puts the TMS9900 in an idle state in which it remains until an interrupt, a RESET* or a LOAD* signal occurs. During that time, the processor repeatedly places the special code for IDLE on lines A0-A2 and pluses the CRUCLK pin.
The special codes are the following:
Instruction | A0 | A1 | A2 | A3-A14 |
---|---|---|---|---|
LREX | H | H | H | n/a |
CKOF | H | H | L | n/a |
CKON | H | L | H | n/a |
RSET | L | H | H | n/a |
IDLE | L | H | L | n/a |
CRU operations | L | L | L | From R12 |
These instructions should not be used on the TI-99/4A because neither the console nor any peripheral card I know of bothers with decoding lines A0-A2 to distinguish a CRU operations from an external instructions. In other words, the external instructions would be mistaken for CRU operations and could cause havoc.
We could however make use of them if we were to make a slight modification to the console: use a 74LS138 decoder to intercept the CRUCLK line and let the signal through only if the CRU code is present on lines A0-A2. The same decoder would activate five different lines, one for each external instruction, that would allow us to trigger five external devices.
+---------+ |
Note that we'll also need an inverter to make CRUCLK active high after the decoder. Very conveniently, there is a 74LS04 in the console with 3 non-connected inverters. It is located just above the TMS9900, right were we need it.!
Why would we need such a circuit?
- We could use the IDLE instruction to enter an idle state
and
wait for an interrupt.
- We could use a line (RSET* jumps to mind) the send a reset signal the
the TMS9904 clock driver. This will perform a hardware reset
programmatically,
as opposed to BLWP @>0000 that only performs a software
reset
(i.e. does not physically reset the peripheral chips).
- We could use the CKON* and CKOF* lines to switch the clock speed, by
feeding the appropriate signals to the TMS9904.
- More generally, we could use those line to activate any kind of
hardware
we want.
For instance, I was told (by Anders Persson) that the "Cortex"
computer, which was sold as a kit in the 80s, made use of these
instructions for the following:
- RSET caused a hardware reset
- IDLE lit an external LED. The system would issues lots of IDLE when
it had nothing to do, so you could see that.
- CKON and CKOF enabled and disabled a memory mapper system, in the
style of the one used by the SuperAMS card.
- LREX triggered a delayed interrupt after two instructions (via a
series of flip-flops clocked by the IAQ line). This was used by a
debugger to execute single instructions, by setting up a pseudo-RTWP
pointing at a given instruction, and then doing LREX RTWP. The RTWP
would go back to the desired instruction, which was executed, and then
the interrupt would fire and the debugger was back in control. Nifty,
isn't it?
The XOP (eXtended OPeration) instruction is kind of a
special
BLWP. It takes a source argument and an operation number from
0 to 15. The XOP instructions uses that number to perform a BLWP
to one among 16 vectors located in memory addresses >0040-007F.
In addition, it places the content of the source argument in the R11
register
of the new workspace. Finally, bit 6 (weight >0200) is set in the
status
register while a XOP instruction is executed. Note that the
TMS9900
does not test the INTREQ* interrupt request pin after a XOP operation.
The main advantage of XOP is that it only requires one word. Therefore, we could use it to replace any word in a program and interrupt its execution. That's how my debugger RIP v.2 works: to set a breakpoint is saves the content of a memory address and replaces it with an XOP 1. Execution of this XOP results in activating RIP that can then execute the saved instruction and/or ask the used what to do. True, BLWP *Rx is only one word long, and so is BLWP Rx, but both expect special values in the registers. The first one want a pointer to the WR and PC vectors in Rx, the second want the WR vector in Rx and the PC vector in the next register. XOP is much more convenient.
The drag is that all vectors for XOPs are in ROM memory, and only
three
of them (two with some consoles) have usefull values. That's because
the
GPL interpreter code begins right there. However, there are some eight
empty words at the end of the console ROM, TI could just have shifted
up
the whole stuff and provide us with 4 more XOPs. Oh well, we can do
with
the first three. Not to mention that some of the following vectors
happen
to contain usefull values.
Address | XOP | WR | PC | Comments |
---|---|---|---|---|
>0040 | 0 | >280A | >0C1C | Enters the extended GPL card |
>0044 | 1 | >FFD8 | >FFF8 | Very usefull for us |
>0048 | 2 | >83A0 | >8300 | Very usefull for us, but not always present |
>004C | 3 | >1100 | >06A0 | |
>0050 | 4 | >0864 | >06A0 | |
>0054 | 5 | >0864 | >C90D | |
>0058 | 6 | >8300 | >C342 | Could be used, although not meant to be so |
>005C | 7 | >D11D | >C180 | Dangerous (odd WR) |
>0060 | 8 | >DB46 | >0402 | Pops inside the keyscan routine with wrong WS |
>0064 | 9 | >0B60 | >83ED | |
>0068 | 10 | >0402 | >5802 | |
>006C | 11 | >011B | >837C | |
>0070 | 12 | >0300 | >0002 | |
>0074 | 13 | >0300 | >0000 | |
>0078 | 14 | >D25D | >1105 | Pops inside XML >0E with wrong WS |
>007c | 15 | >D109 | >09C4 | Pops intp the ISR (sprite motion) with wrong WS |
Now, any WR value that maps to the console ROM (below >2000) is useless as it will result in loosing the return address. I suspect that odd workspace addresses may also cause havoc. The same is true for PC values: we don't want to branch to ROM routines. Odd PC values have less importance since the TMS9900 will ignore the least significant bit anyway.
XOP 0 is hardwired to switch on a peripheral card whose CRU base address should be >1B00, then enters its ROM at address >4028, after having changed the worskpace to >2800. My guess is that this card was meant to implement extra GPL opcodes, but I don't think it has ever been released. Note that XOP 0 will not check whether the card is here or not before branching. If there is no such card, the TI-99/4A will crash. Now this is the ideal instruction to use to implement a debugger board...
XOP 1 is extremely usefull for us. All we need to do is to place a B @MYPROG at location >FFF8 and XOP 1 will enter our program. Note that this will preserve the LOAD interrupt vectors at locations >FFFC-FFFF.
XOP2 Be carefull about XOP 2: some consoles do not support it... If you decide to use it, you should probably place a B @WHERE instruction right at >8300 or soon after, since scratch-pad RAM is a precious resource. Also, having our workspace at >83A0 may disturb the data stack of the GPL interpreter...
XOP 6 The vectors are not guarantied to have these values on each and every console...
XOP 8, XOP 14, XOP 15 land in the middle of various routines
in the console ROMs. These routines expect a workspace of >83E0
which
won't be the case. What happens then depends on the contents of the
workspace
in use, of the >83E0 workspace and possibly of other bytes in the
scratch-pad.
It is not unconceivable that you can come up with a valid combination
that
would do something usefull, but why bother?
Another (admitedly not so usefull) trick with the XOP instructions is to use them to build a routine that can be called either with BLWP or with BL. This relies on the fact that XOP opcodes have a value of >2Cxx, which can serve as a valid workspace. See, like this:
MYSUB XOP R0,1 This is equivalent to DATA >2C40 (WS vector) MYSB1 ... Do something *--------------------------------------------------- *--------------------------------------------------- END |
If the procedure is called with BLWP @MYSUB it will treat the word at MYSUB as a worskpace pointer (of value >2C40) and begin execution of the subroutine at MYSB1 with a workspace of >2C40. Return to the caller is performed by a plain vanilla RTWP
If it is called with BL @MYSUB it executes XOP 1 which immediately branches to BLS with workspace >FFD8. BLS is a routine that converts BLs into BLWPs: it gets the PC vector from the data word following the XOP 1 instruction (in this case MYSB1) and puts it in R11 of workspace >2C40 (that receives the source argument R0 upon XOP execution). Then it copies the user's worskpace pointer and status (saved by XOP 1) into R13 and R15 of workspace >2C40. It gets the return address from the R11 in the user's workspace and places it in R14 of workspace >2C40. Finally it branches to the called subroutine, which will never be aware of all the above: it can just assumed it was called with BLWP, access parameters accordingly and return with RTWP.
That's a helluva slow way to call a subroutine, but it might be usefull in cases...
This instruction can be used to simulate another instruction: just place the corresponding code in the source argument of the X (eXecute) instruction.
Example:
LI R0,>37C3 >37C3 means STCR R3,15 |
If the executed instruction has operands, they will be fetched from the words following X, which somewhat limitates the usefullness of X. If it were to fetch the operands from the words following the operand of the X instruction, it would be a wonderfull way to write a debugger: place a memory pointer in R1 and then do: X *R1+ Ok,ok it's not that simple: we must trap the jumps and branches, and account for instructions that use R1, but you get the idea. Unfortunately, that's not the way X works...
Nevertheless, X may be usefull to replace a test in a frequently executed loop. For instance:
MOV @CLEAR,R2 SK1 LI R0,>2000 This loop is executed >2000 times CLEAR CLR *R1+ Simple alternative to calculating the values |
We could have written:
LI R0,>2000 |
But the first way is much faster, since we don't have to repeat the test at each execution of the loop.
Appart for comparison, this instruction can also be used to increment a register by four:
C *Rx+,*Rx+ |
This uses only one word of memory as opposed to the equivalent :
INCT Rx |
Note that the corresponding CB instruction would increment the register by two, but there is no advantage over a plain vanilla INCT in this case.
The TMS9900 is meant to be fed a 3 MHz clock signal (that's right, 3 not 30) by the TMS9904 clock generator. This signal comes on 4 different lines, each one being shifted by a quarter of a phase with respect to the previous one. A graphical representation of these signals looks like this:
|____ 333 ns |____
__/ 45 \___________________/ \______ Phi 1
_____ _____
________/ \____________________/ Phi 2
| 83 | _____
_______________/ \___________________ Phi 3
12 12_____
______________________/ \____________ Phi 4
5
The period of the clock, i.e the time between two pulses in a given phase is 333 nanoseconds for a 3 Megaherz clock (a nanosecond is a billionth of a second). Each pulse is high for about 45nanoseconds, with a rising time of 12 ns and a falling time of 12 ns (note that my graph is not drawn to scale).
Pulses in the next phase are 83 ns behind those in the first phase and there is a 5 ns lag time between the end of one pulse and the start of the corresponding pulse on the next phase. At least, that's what the data manual says, but if you add up durations: 12+45+12+5 you get 74 ns, not 83 ns! So were are the missing 9 nanoseconds? Your guess is mine...
Note that it is possible to crank up clock speed, upto 4 Mhz at least, without risking to fry the TMS9900. Such modifications the the TI-99/4A have been described (including in those pages), and are known to work. Of course any process that relies on execution speed to time an external device (such as disk access) will be messed up...
Below are some timing diagrams for the memory bus of the TMS9900. Note that these will be different in the PE-Box, due to the multiplexing of the data bus.
_ _ _ _ _ _ _
_| |_____| |_____| |_____| |_____| |_____| |_____| |_____ Phi 1
_ _ _ _ _ _ _
___| |_____|a|_____| |_____| |_____| |_____| |_____| |___ Phi 2
_ _ _ _ _ _ _
_____| |_____| |_____| |_____| |_____| |_____| |_____| |_ Phi 3
_ _ _ _ _ _ _
_______| |_____| |_____| |_____| |_____| |_____| |_____| Phi 4
____________ ____________________________
\_______________/ MEMEN*
_______________
____________/ \____________________________ DBIN
d d
_________________________________________________________ WE*
XXXXXXXXXXXX/ Valid address \XXXXXXXXXXXXXXXXXXXXXXXXXXXX A0-A14
______
XXXXXXXXXXXXXXX/ ^b \XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX READY
e e
_________________________________________________________ WAIT
CPU driven | Input mode |r| Input ^c | CPU driven D0-D15
________________
____________/ if instruction \___________________________ IAQ
Notes
a) The cycle begins and ends on the rising edge of Phi 2 pulses.
b) Inputs should be ready at least 30 ns before the rising edge of the
next Phi 1 pulse.
c) Inputs should remain valid for at least 10 ns after the falling edge
of the Phi 1 pulse.
d) Propagation delays are at most 30 ns for MEMEN*, DBIN, WE* and WAIT.
e) Propagation delays for all other outputs are at most 40 ns.
r) Read data
_ _ _ _ _ _ _
_| |_____| |_____| |_____| |_____| |_____| |_____| |_____ Phi 1
_ _ _ _ _ _ _
___| |_____|a|_____| |_____| |_____| |_____| |_____| |___ Phi 2
_ _ _ _ _ _ _
_____| |_____| |_____| |_____| |_____| |_____| |_____| |_ Phi 3
_ _ _ _ _ _ _
_______| |_____| |_____| |_____| |_____| |_____| |_____| Phi 4
____________ ____________________
\_______________________/ MEMEN*
_________________________________________________________ DBIN
___________________ ________________________
\____________/ WE*
XXXXXXXXXXXXXX/ Valid address \XXXXXXXXXXXXXXXXXXXX A0-A14
___
XXXXXXXXXXXXXXXXX\^b___/ d^\XXXXXXXXXXXXXXXXXXXXXXXXXXXXX READY
_______
______________________/c \__________________________ WAIT
CPU driven | CPU write data | CPU driven D0-D15
_________________________________________________________ IAQ
Notes
a) The cycle begins with the rising edge of Phi2.
b) The READY line is tested on the rising edge of the next Phi 1. It
should
be high at least 40 ns before that time..
c) If it's low, the TMS9900 enters a wait state and
d) retest the READY line at each Phi 1 pulse, until it is high again.
_ _ _ _ _ _ _
_| |_____| |_____| |_____| |_____| |_____| |_____| |_____ Phi 1
_ _ _ _ _ _ _
___| |_____|a|_____| |_____| |_____| |_____| |_____| |___ Phi 2
_ _ _ _ _ _ _
_____| |_____| |_____| |_____| |_____| |_____| |_____| |_ Phi 3
_ _ _ _ _ _ _
_______| |_____| |_____| |_____| |_____| |_____| |_____| Phi 4
XXXXXXXXXXXX/ Bit 1 address | Bit 2 address \XXXXXXXXXXXX A0-A14
___ ___
___________________|b |___________|c |_________________ CRUCLK
d d d d
XXXXXXXXXXXX/ Bit 1 value | Bit 2 value \XXXXXXXXXXXX CRUOUT
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX CRUIN
Notes
a) The cycle begins on the rising edge of a Phi 2 pulse.
b) A CRUCLK pulse is issued at the next Phi 2 pulse, until the end of
the
Phi 3 pulse.
c) A similar CRUCLK pulses is issued for each following bit.
d) Propagation delays are at most 30 ns for CRUCLK (40 ns for other
outputs).
_ _ _ _ _ _ _
_| |_____| |_____| |_____| |_____| |_____| |_____| |_____ Phi 1
_ _ _ _ _ _ _
___| |_____|a|_____| |_____| |_____| |_____| |_____| |___ Phi 2
_ _ _ _ _ _ _
_____| |_____| |_____| |_____| |_____| |_____| |_____| |_ Phi 3
_ _ _ _ _ _ _
_______| |_____| |_____| |_____| |_____| |_____| |_____| Phi 4
XXXXXXXXXXXX/ Bit 1 address | Bit 2 address \XXXXXXXXXXXX A0-A14
_________________________________________________________ CRUCLK
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX CRUOUT
____ ____
XXXXXXXXXXXXXXXXXXXXXXX/ ^b \XXXXXXXXXX/ ^c \XXXXXXXXXXXX CRUIN
Notes
a) The cycle begins on the rising edge of a Phi 2 pulse.
b) The CRUIN line is sampled at the rising edge of the second Phi 1
pulse
following the Phi 2 pulse.
c) Following bits are sampled on the rising edge of every second Phi 1
pulse.
d) No CRUCLK is generated during CRU input.
Parameter | Min | Nom | Max | Unit |
---|---|---|---|---|
Vbb | 5.25 | -5 | -4.75 | Volts |
Vcc | 4.75 | 5 | 5.25 | Volts |
Vdd | 11.4 | 12 | 12.6 | Volts |
Vss | - | 0 | - | Volts |
High level input | 2.2 | 2.4 | Vcc+1 | Volts |
Ditto for clock | Vdd-2 | - | Vdd | Volts |
Low level input | -1.0 | 0.4 | 0.8 | Volts |
Ditto for clocks | -0.3 | 0.3 | 0.6 | Volts |
Free-air temperature | 0 | 25 | 70 | `C |
Parameter | Test conditions | Min | Nom | Max | Unit |
---|---|---|---|---|---|
Data bus input current | Vss to Vcc | - | 50 | 100 | uAmp |
Clock input current | -0.3V to 12.6V | - | 25 | 75 | uAmp |
Other pins input current | Vss to Vcc | - | 1 | 10 | uAmp |
High level output voltage | -0.4 mAmp | 2.4 | - | Vcc | Volts |
Low level output voltage | 3.2 mAmp | - | - | 0.65 | Volts |
2.0 mAmp | - | - | 0.50 | Volts | |
Supply current from Vbb | - | - | 0.1 | 1 | mAmp |
Supply current from Vcc | - | - | 50 | 75 | mAmp |
Supply current from Vdd | - | - | 25 | 45 | mAmp |
Data bus capacitance | Vbb=-5 f=1Mhz | - | 15 | 25 | pF |
Clock 1 capacitance | Vbb=5 f=1Mhz | - | 100 | 150 | pF |
Clock 2 capacitance | Vbb=5 f=1Mhz | - | 150 | 200 | pF |
Clock 3 capacitance | Vbb=5 f=1Mhz | - | 100 | 150 | pF |
Clock 4 capacitance | Vbb=5 f=1Mhz | - | 100 | 150 | pF |
Other input capacitance | Vbb=5 f=1 MHz | - | 10 | 15 | pF |