README for NV30 Fragment Compiler

Eric Chan <ericchan@cs.stanford.edu>

Initial date:
  10-11-2001

Last updated:
  $Author: ericchan $
  $Date: 2002/01/14 18:15:24 $

------------------------------------------------------------

OVERVIEW

This document describes the NV30 Fragment Compiler (nv30fc).

------------------------------------------------------------

HOW TO USE IT

You need to specify the fragment backend to be "nv30f". For
example, this can be expressed in scviewer.in as follows:

  codegen x86 nv30f

Before compiling each shader, nv30fc will look for a file in
the same directory as the executable named

  nv30f_compile_options.txt

This text file allows you to set a number of compiler
options. In particular, you can control debugging output,
print instruction and resource use statistics, enable and
disable specific optimizations, or disable optimizations
altogether. See the sample "nv30f_compile_options.txt" file
for documentation. If this option file is not found, then a
default compiler configuration is used.

------------------------------------------------------------

OVERVIEW OF NV30FC DESIGN

Instruction selection is performed just once. Then nv30fc
calls [pass_split] with the psnode_t DAG, which decomposes
the DAG into passes. The node corresponding to the root node
of each pass is marked (except for the root of the entire
shader, which is always unmarked). Thus, if N is the number
of marks, then the number of passes is N+1. The [pass_split]
routine is provided a number of callbacks in order to
decompose passes properly for the nv30 architecture. In
particular, one of these callbacks, [nv30f_ps_map], will
likely be called quite often. This routine calls
[nv30f_test_compile] to see whether a subtree of psnode_t's
will compile to a single rendering pass. The actual
compilation phase is described below.

After [pass_split] returns with a successful decomposition,
nv30fc assembles the passes (see Pass Assembly, below) and
generates final code for each pass.

In summary, we have:

1. Instruction selection.

2. Decomposition of DAG into passes.

3. Assembly of passes and final code generation.

------------------------------------------------------------

DATA STRUCTURES and INSTRUCTION SELECTION

Several data structures are used to represent the shader
during compilation and pass decomposition.

The most important data structures, data types, and
constants are listed in [nv30f_prog.h].

The input to nv30fc is the intermediate representation of a
shader. This is a linked list of sym_t's, where each
fragment sym_t represents a tree of nodes (of type
node_t). The dependencies are ordered so that if A depends
on B, then B comes before A in the linked list. Hence the
last fragment sym_t represents the top-most tree. This is
the standard intermediate representation that all backends
see. sym_t and node_t are defined in [sym.h].

nv30fc first performs instruction selection. The goal is to
create a new DAG for this shader, where each node in the DAG
represents a nv30f hardware instruction. We use lburg to
perform instruction selection. However, our DAG nodes need
more state in order for lburg to function. Hence, as a
preprocess to instruction selection, a mirror DAG is built,
where each node is a [ir_node_t] (see the top of
[nv30f_instruction.bg2]). Each [ir_node_t] points back to
the associated node_t.

Equipped with node_t + ir_node_t information, nv30fc uses
lburg to perform instruction selection. The output is a new
DAG whose node type is nv30f_inode_t (see
[nv30f_prog.h]). Ideally, each inode would correspond to a
single hardware instruction. Unfortunately, there are
complications. Some shading language operations, such as
JOIN, translate into hardware as multiple instructions with
the same target variable. For example:

  float3 color = { r, g, b };

This is equivalent to a float3 JOIN operator, which might
translate to:

  MOV R0.x, R1.x;
  MOV R0.y, R2.x;
  MOV R0.z, R3.x;

The key here is that all instructions write to a different
component of the same variable. This information is not
well-represented in a tree/DAG form, so in reality, each
inode maintains a list of instructions -- specifically, a
linked list of nv30f_opnode_t (see [nv30f_prog.h]). Each
opnode is truly a hardware instruction -- no more, no
less. If an inode contains 3 opnodes, then all 3
instructions must be emitted within a single pass. In other
words, while inodes may contain multiple instructions, they
are treated atomically -- indivisible units. For this
reason, very few inodes contain multiple opnodes. A
high-level operation such as NORMALIZE translates into 3
inodes, each with a single opnode: DP3, RSQ, MUL. The float3
JOIN operation translates into a single inode with 3 MOV
opnodes.

Finally, as a post-process to instruction selection, one
more DAG is built, with node type psnode_t (see
[nv30f_prog.h]). The psnode_t is used when solving the
multipass "pass decomposition" problem generally and is used
to abstract the instruction selection DAG. The [data] field
of the psnode_t points to the associated nv30f_inode_t, and
the [psnode] field of the nv30f_inode_t points to the
associated psnode_t.

Sooooooooooooooo ... to summarize: input to nv30fc is a
linked list of sym_t/node_t's as intermediate code. This is
augmented temporarily with ir_node_t's. This combo is shoved
through lburg and produces a DAG of
nv30f_inode_t's. Finally, a psnode_t DAG is built, exactly
one psnode_t for every nv30f_inode_t.

After all of this work, nv30fc is ready to begin compiling.

------------------------------------------------------------

COMPILATION

When invoked from the main program, nv30fc executes the
following eight steps in order. Some of these steps may be
disabled using flags in the configuration file. The relevant
nv30f* source files are listed in brackets for reference.

1. Initialization.  nv30fc sets compiler flags from optional
   configuration file (see above). If no file found, use
   defaults. [passgen, options]

2. Code generation.  nv30fc emits instructions using
   the inode tree, described above. [instremit, prog]

3. Verification.  nv30fc rewrites instructions as needed to
   obey nv30 hardware constraints (e.g. cannot source more
   than 4 unique constant values in one instruction). [verify]

4. Assignment of constant, per-begin, and per-vertex
   temporaries.  nv30fc assigns constants using the DEFINE
   construct, assigns per-begin values using local program
   parameters and the DECLARE construct, and assigns
   per-vertex values using the input attribute
   registers. [calloc, balloc, valloc]

5. Optimization.  Before register allocation, nv30fc
   performs the following optimizations:

   * Collapse MUL, ADD into MAD.
   * Remove unnecessary MOV instructions.
   * Remove dead code.
   * Vectorize (parallelize) scalar operations.

   [optimize]

6. Register assignment.  nv30fc computes liveness
   information and assigns registers by graph coloring with
   simplification. [ralloc]

7. Optimization.  After register allocation, nv30fc performs
   the following optimizations:

   * Remove unnecessary MOV instructions.
   * Vectorize (parallelize) scalar operations.
   * Compute last value directly into output register.

   [optimize]

8. Output.  The final code is formatted, printed, and loaded
   using functions exported by the nv_fragment_program API
   extension. [passgen, prog]

------------------------------------------------------------

PASS ASSEMBLY PROCESS

In this stage, nv30fc takes a DAG that has already been
decomposed into passes and finalizes those passes. In
particular, the input is a DAG of psnode_t's whose pass
boundaries are marked. The output is a linked list of
pass_t's which can be fed to the rest of the RTSL
system. Each pass_t is really a nv30f_pass_t which contains
information such as local parameters and the fragment
program id.

Pass assembly involves ordering the passes, assigning
registers (texture units) to the first N-1 passes, and
generating final code (including restores / texture fetches)
for each pass.

This is all accomplished in [nv30f_buildpass.c].

------------------------------------------------------------

RUNTIME PROCESS

At render time, the nv_fragment_program extension is
enabled, and the program generated by nv30fc bound.
Per-begin values are also bound at run-time using the
program parameter functions in the API. After each non-final
pass, the framebuffer is copied to texture for use in
subsequent passes.

------------------------------------------------------------

SOURCE FILES

balloc        Handles per-begin temporary assignment.
buildpass     Assembles final list of rendering passes.
calloc        Handles constant temporary assignment.
compile       Handles internal compiler stages.
dot           Converts instruction DAGs to .dot files.
instremit     Handles code generation.
instruction   Handles instruction selection.
liveness      Handles liveness analysis.
optimize      Handles optimizations.
options       Handles compiler options.

passgen       Entry point of nv30fc, interfaces with other
              compiler stages.
	      
prog          Handles various compiler bookkeeping details.

psnode        Handles interface between internal nv30fc
              representation and external, generic psnode_t
              data type.

ralloc        Handles register allocation.
valloc        Handles vertex interpolant allocation.
verify        Handles hardware constraint verification.

------------------------------------------------------------

KNOWN ISSUES

Shaders that fail to compile are ones that use older RTSL
operators, such as PREVCOLOR, BUMPDIFF, and LUT. You can
avoid using these when writing nv30 fragment shaders since
you can perform your own diffuse and specular bump-mapping
calculations, and nv30 supports dependent texture reads.


