## Oldskool: tiled texture mapping

While reading “Inside the Machine” (see my earlier post), I thought about the times (many many years ago!) when I was doing (demo)coding on my old 486 (then substituted by an AMD K5), using nice little tricks to optimize the execution of inner loops and writing assembly code day and night. In ’97 I wrote a little doc, which is still lying around in my hard drive (and somewhere non the net), on the subject of fast texture mapping by using cache-friendly tiling techniques. Just for fun I decided to restore this doc and put it in this blog. Keep in mind it’s almost 10 years old! :-)

Read on…

**Tiled Texture Mapping for pow2 Texture Sizes**

Author: TheGlide/SpinningKids – June 1998

*Introduction*

I assume here you know the basics of texture mapping, as eplained in *fatmap* and *fatmap2* docs by MRI/Doomsday (see on MRI’s homepage), since the present document is just an extension of those.

This doc is about doing texture mapping using texture maps stored as tiles, namely 8×8 pixels tiles. Storing the maps this way can improve very much cache access. Most of the time we have to traverse the texture through non-horizontal lines, and this causes many cache misses. The worst situation happens when we have to traverse the texture vertically: each texel we access will be on a different row and this will require, from the processor side, a whole cache line load, which is very slow.Storing the texture in 8×8 tiles ensures that every tile fits in two 32 bytes cache lines (on the Pentium), and as we traverse the texture we have a greater chance to read from the same cache line for a longer time.

Let’s assume for the moment that you have 256×256 textures.

So the U (x-axis) and V (y-axis) coordinates take up 8 bits:

`U : xxxxxxxx`

V : xxxxxxxx

*Tiling – method 1*

The first way to tile the map in 8×8 tiles is this one:

where numbers 0, 1, 2… indicate the order by which the 8×8 tiles are stored in memory (each tile being stored in 64 consecutive addresses in memory).

This way we can go from the original (U, V) coordinates to the ones in the tiled map with the following relationship:

`U : xxxxxXXX -> U' = 00000xxxxx000XXX`

V : xxxxxXXX -> V' = xxxxx00000XXX000

i.e. in C code:

`U' = (U&0x7)|((U<<3)&0x7c0);`

V' = ((V<<3)&0x38)|((V<<8)&f800);

Basically the lower 3 bits of both U and V (XXX, above) are used to address the texel inside a single tile, whereas the 5 upper bits (xxxxx, above) are used to select the tile inside the texture. The C code to convert normal texture coordinates (U,V) to tiled-texture (U’, V’) coordinates is the following:

`U' = (U&0x7)|((U<<3)&0x7c0);`

V' = ((V<<3)&0x38)|((V<<8)&f800);

This code enables us to convert a straight texture to a tiled texture:

`tiledtmap [U'+V'] = tmap [U+V*256]`

*Tiling – method 2 – the better method*

There’s another way to tile a texture map in 8×8 tiles. This one, which uses a different ordering of the tiles:

With this tiling method we get from the (U, V) coordinates within the original texture to (U’, V’) coordinates relative to the tiled map with this method:

`u : xxxxxXXX -> u' = xxxxx00000000XXX`

v : xxxxxXXX -> v' = 00000xxxxxXXX000

The corresponding C code is:

`u' = (u&0x7)|((u<<8)&0xf800);`

v' = (v<<3);

and as before it can be readily plugged in a converter from straight textures to tiled textures.

The code really ‘looks better’ than the first method above. It is easier and faster to convert from V to V’. That’s why we will choose this second tiling method.

Now, we could easily get our usual texture map scanline filler, put those relations inside the inner loop, and see the result. Slooow! At the expense of a little overhead, we can get a loop that is really little and optimized. So what can we do to directly use U’ and V’ in the loop and the corresponding dU’ and dV’, and read directly from the tiled texture?

We convert all of our starting U and V, and the corresponding deltas (dU, dV) that are calculated in the texture mapper before entering the inner loop (all quantities in 8.16 fixed point format, xxx is the integer part, XXX is the fractional part):

`U : xxxxxxxx,XXXXXXXXXXXXXXXx -> U' = xxxxx00000000xxx,0XXXXXXXXXXXXXXX`

V : xxxxxxxx,XXXXXXXXXXXXXXXx -> V' = 00000xxxxxxxx000,0XXXXXXXXXXXXXXX

dU : xxxxxxxx,XXXXXXXXXXXXXXXx -> dU' = xxxxx11111111xxx,1XXXXXXXXXXXXXXX

dV : xxxxxxxx,XXXXXXXXXXXXXXXx -> dV' = 00000xxxxxxxx111,1XXXXXXXXXXXXXXX

We have to fill the gaps in (dU’, dV’) with 1s because when we add them to the current (U’, V’) values we must propagate the carry from the lower bits to the bits that lie after the gap. After the addition we must not forget to mask out the 1s from the (U’, V’) we obtain.

Of the 16 bit fractional part we keep only the upper 15 bits. There’s a valid reason to do this: when calculating the offset to access the texel we add U’ and V’ and shift right by 16. If we kept all of the fractional bits, an hypotetical carry would propagate to the integer part, thus influencing the offset value. Keeping instead only the upper 15 bits of the fractional part, and putting a 1 bit gap between fractional and integer part the problem gets solved automatically. If this explanation seems harsh, look at the ‘picture’ of U’/V’ above.

Now, an hypothetical tiled tmap scanline filler would look like (sorry for the bad formatting, I still have to find a proper way to format/indent C code in WordPress :-):

`void tiledtmapline (int u, int v, int du, int dv,`

int run, const unsigned char * vid, const unsigned char * tmap) {

// on entry u,v,du,dv are in 8.16 format

u = (( u<<8)&0xf8000000) | ( u&0x70000) | (( u>>1)&0x7fff);

du = ((du<<8)&0xf8000000) | (du&0x70000) | ((du>>1)&0x7fff) | 0x7f88000;

v = (( v<<3)&0x07f80000) | (( v>>1)&0x7fff);

dv = ((dv<<3)&0x07f80000) | ((dv>>1)&0x7fff) | 0x78000;

vid+=run;

for (run=-run;run;run++) {

*(vid+run) = tmap [((unsigned int)(u+v)>>16)];

u =(u+du)&0xf8077fff; // addition + masking out the 1s in the gaps

v =(v+dv)&0x07f87fff; // same as above

}}

*Extending to pow2 textures*

Now comes the cool part. We will extend all the formulas we have developed to other texture dimensions (actually always power of 2). Let’s look at the U’ and V’ formats:

` U : xxxxxXXX -> U' = xxxxx00000000XXX`

V : xxxxxXXX -> V' = 00000xxxxxXXX000

bits 0-2 (LSB) of U’ and bits 3-5 of V’ are the coordinates in the single 8×8 tile. Since we always use 8×8 tiles, those fields won’t change in bitwidth. Let’s look at the remaining 5 bits of U’ (bits 11-16) and V’ (bits 6-10). 5 bits are need for 32 tiles. 32tiles*8pixels = 256 pixels.

It takes a minute to understand that by varying the number of those bits we can account for different texture sizes. With 4 bits we get 16 tiles, that is a 16*8=128 pixels width/height texture. Here are a couple of cases to make everything more clear:

- 128×128 tiled map ( = 16tiles x 16 tiles): U’ = 00xxxx0000000XX / V’ = 000000xxxxXXX000

- 64×64 tiled map ( = 8tiles x 8tiles): U’ = 0000xxx000000XXX / V’ = 0000000xxxXXX000

and so on…

So how can we handle all those cases in the formulas we wrote above? Easy: we simply need a parameter that tells us the number of bits for the ‘inter-tile’ addressing, and the corresponding mask. In formulas this will look like:

`// u,v,du,dv 16.16 fixed point quantities`

// bits = tile addressing bits

// mask = tile addressing bit mask

ushift = (3+bits);

umask = (mask<<(16+6+bits));

vmask = (mask<<(16+6))|0x380000;

dumask = vmask|0x8000;

u = (( u<<ushift)&umask)|( u&0x70000)|(( u>>1)&0x7fff);

du = ((du<<ushift)&umask)|(du&0x70000)|((du>>1)&0x7fff)|dumask;

v = (( v<<3)&vmask)|(( v>>1)&0x7fff);

dv = ((dv<<3)&vmask)|((dv>>1)&0x7fff)|0x78000;

and that’s all.

Here are the correct bits & mask values for the different texture sizes:

`size / bits / mask`

256x256 5 0x1f

128x128 4 0xf

64x64 3 0x7

32x32 2 0x3

16x16 1 0x1

8x8 0 0

The inner loop then looks like:

`innerumask = umask|0x77fff;`

innervmask = vmask|0x07fff;

vid+=run;

for (run=-run;run;run++) {

*(vid+run) = tmap [((unsigned int)(u+v)>>16)];

u =(u+du)&innerumask;

v =(v+dv)&innervmask;

}

And you got it! That’s a tiled texture mapper ready to handle any power of 2 texture size, subdvided in 8×8 tiles. ushift, umask, vmask, innerumask and innervmask do not need to be calculated at each scanline obviously as they depend solely on the dimensions of the texture. But a little overhead still remains; that’s true especially when you use this scanline filler in a perspective correct texture mapper that linearly interpolates every 16 pixels. One last thing to note is that wrapping is still allowed with this method.

*More extensions*

An obvious limit of the method I presented is that you can apply it to textures with a maximum dimension of 256×256 texels. Extending beyond this limit is not a problem: you only have to trade some bits from the fractional part, so they can be used to address more texels :)

That’s all folks!

## Leave a Reply