L07 - Rasterization

Lecture 07: Rasterization (slides)

Learning Objectives

By the end of this lecture, you will be able to:

project your models from world space to screen space,
write vertex shaders to transform and/or pass vertex data through the rasterization pipeline,
write fragment shaders to assign pixel colors,
rasterize your models.

Rasterization (object-first) versus Ray Tracing (pixel-first).

Our main goal today is to start using a new rendering technique called rasterization. The main difference between rasterization and what we have been doing so far with ray tracing is that ray tracing is a pixel-first approach to rendering, whereas rasterization is object-first. Recall that, in ray tracing, we start by sending a ray through a pixel, see what it hits (if anything), and then calculate the pixel color using some lighting model. Here are the steps we had in Lecture 2:

Ray Tracing:

1. set up your image and place your observer.
2. for each pixel in your image:
  a. create a ray originating at your camera position that passes through this pixel.
  b. determine the closest object in your scene intersected by the ray.
  c. determine the color of the pixel, based on the intersection.

Rasterization, on the other hand, starts with the objects in our scene, projects them onto our screen, determines which pixels our projected object covers and then applies some lighting model. Here is what these steps look like:

Rasterization:

1. set up view and place your observer.
2. for each triangle in your models:
  a. project triangle into viewing space.
  b. determine which pixels this triangle covers.
  c. for each pixel covered by the triangle:
    i. determine the depth of this object fragment.
    ii. if (depth < minimum depth of this pixel):
          A. determine the color of this pixel.

For us, the newest part of this process is the projection of the model triangles (or other geometric shapes like lines or quads). Everything else, such as setting up a view and shading our models is similar to what we did in ray tracing.

The rasterization pipeline.

In practice, modern graphics APIs define the steps above as a sequence of stages and provide us with ways to inject our own code into these stages. This code we will inject is called a shader. We'll issue commands to the API that will (1) write the necessary data to memory and (2) invoke the rasterization pipeline in order to actually draw our scenes. Before proceeding, it's important for us to understand some of the terminology of these stages:

Vertex Processing: transforms vertices onto the screen by first transforming them into the camera space, and then projecting them onto the screen.
Rasterization: the process of breaking up the projected geometry into fragments. A fragment is just a pixel with a bit of extra information, like depth of the rasterized geometry to the camera.
Fragment Processing: Determines the color of the fragment, e.g. using shading.

This pipeline is shown below where the gray boxes show the three main stages and the white boxes show which data is an input/output to each stage.

The first stage involves projecting the vertices of our geometric primitives to the screen, as shown by the dashed lines below (some of the dashed projection lines are omitted for clarity). In the example below, there are two primitives: one blue triangle and one red square. The rasterization process will also keep track of the depth of each fragment so that we can correctly render the blue triangle in front of the red square below.

After the vertex data is projected to the screen, the rasterization process determines which pixels are covered by our geometric primitives. Let's just focus on the blue triangle for now. Remember barycentric coordinates ($u, v, w$) from when we did ray-triangle intersections? If any of $u,\ v$, or $w$ were outside the $[0,1]$ range, then the ray did not intersect the triangle, because the values of the barycentric coordinates define a point outside the triangle. Similarly, we can check the barycentric coordinates of each pixel with respect to the projected triangle and again determine if the pixel is inside or outside the triangle. It would be wasteful to check every pixel, so we can optimize this by only checking the pixels in the bounding box of the projected triangle, which involves checking all the gray pixels (and blue ones) below.

As mentioned earlier, rasterizers also keep track of the depth of each fragment so that we can correctly render objects in order. To do so, we need to talk about the math behind projection, as well as the full vertex transformation pipeline.

From Object Space $\rightarrow$ World Space $\rightarrow$ Camera Space $\rightarrow$ Viewing Volume $\rightarrow$ Screen.

Our ultimate goal is to transform the surface points of our objects onto the screen. For models represented with a mesh, this involves transforming the vertices onto the screen. To do so, we'll break up the full transformation into 4 stages. The first stage involves placing the model in the scene and will feel familiar to what we did in Lecture/Lab 4. However, the last three stages are essentially the reverse of what we did in ray tracing. Remember that, in ray tracing, we were trying to express pixel coordinates in terms of how the scene is defined (the world space). In rasterization, we need to express our models in screen space, which first involves a transformation to the frame of reference of our camera.

Homogeneous coordinates & homogenization.

Before going into the four stages below, I want to make one more point about homogeneous coordinates. Previously, we had seen that homoegeneous coordinates are a convienient way to express all our transformations as a single 4x4 matrix. Position vectors are represented in homogeneous coordinates by setting the fourth coordinate to 1, whereas direction vectors have a fourth coordinate of 0.

The property I want to highlight right now is the idea of homogenization. This refers to the process of dividing the first three coordinates by the fourth one. The following two points should be understood as representing the same Cartesian point using homogeneous coordinates, and we will thus interpret them as being equivalent:

$$ \left[\begin{array}{c} x \\ y \\ z \\ w \end{array}\right] \leftrightarrow \left[\begin{array}{c} x / w \\ y / w \\ z / w \\ 1 \end{array}\right] $$

Stage 1: Model Matrix - transforming models from object space to world space.

When we imported a mesh (e.g. from a .obj file), the model vertices are defined in object space. Similar to how we rotated the bird in Lab 04, we can transform the model to place it into the world space of our scene. This is the global 3d coordinate system where all objects in the scene are layed out. We had previously called this the model matrix, which we will do again. So the first transformation is a transformation from object space to world space, which we will denote as $\mathbf{M}_m$ ($m$ for model).

Stage 2: View Matrix - transforming models from world space to camera space.

As mentioned earlier, our ray tracers required us to express pixel coordinates in world space, and we set up a change-of-basis matrix (with a translation by the eye $\vec{e} = (e_x, e_y, e_z)$) to do that (using the glMatrix targetTo function). Our camera matrix (in Lecture 4) was previously a combination of a change-of-basis $\mathbf{B}$, followed by a translation $\mathbf{T}$:

$$ \mathbf{C} = \left[\begin{array}{cccc} u_x & v_x & w_x & e_x \\ u_y & v_y & w_y & e_y \\ u_z & v_z & w_z & e_z \\ 0 & 0 & 0 & 1 \\ \end{array}\right] = \left[\begin{array}{cccc} 1 & 0 & 0 & e_x \\ 0 & 1 & 0 & e_y \\ 0 & 0 & 1 & e_z \\ 0 & 0 & 0 & 1 \\ \end{array}\right]\left[\begin{array}{cccc} u_x & v_x & w_x & 0 \\ u_y & v_y & w_y & 0 \\ u_z & v_z & w_z & 0 \\ 0 & 0 & 0 & 1 \\ \end{array}\right] = \mathbf{T}\mathbf{B}. $$

Our goal is a bit different now. We want to express the world coordinates of a model surface point in the frame of reference of the camera. This view matrix $\mathbf{M}_v$ ($v$ for view) is just the inverse of $\mathbf{C}$! We then have:

$$ \mathbf{M}_v = \mathbf{C}^{-1} = (\mathbf{T}\mathbf{B})^{-1} = \mathbf{B}^{-1}\mathbf{T}^{-1}. $$

Since $\mathbf{B}$ is orthonormal (all vectors in the columns of $\mathbf{B}$ are orthogonal to each other and have a unit length), then $\mathbf{B}^{-1} = \mathbf{B}^T$. Also, $\mathbf{T}^{-1}$ is the inverse of the translation, which is a translation by $-\vec{e} = (-e_x, -e_y, -e_z)$. The view matrix is then

$$ \mathbf{M}_v = \left[\begin{array}{cccc} u_x & u_y & u_z & 0 \\ v_x & v_y & v_z & 0 \\ w_x & w_y & w_z & 0 \\ 0 & 0 & 0 & 1 \\ \end{array}\right] \left[\begin{array}{cccc} 1 & 0 & 0 & -e_x \\ 0 & 1 & 0 & -e_y \\ 0 & 0 & 1 & -e_z \\ 0 & 0 & 0 & 1 \\ \end{array}\right] = \left[\begin{array}{cccc} u_x & u_y & u_z & -\vec{u}\cdot\vec{e} \\ v_x & v_y & v_z & -\vec{v}\cdot\vec{e} \\ w_x & w_y & w_z & -\vec{w}\cdot\vec{e} \\ 0 & 0 & 0 & 1 \\ \end{array}\right]. $$

This can also be interpreted as first translating to the eye and then rotating to align with the camera axis vectors.

Stage 3: Projection Matrix - projecting models to the viewing space.

So we have our models in the frame of reference of the camera. Now our goal is to project these points to our image plane. We also want this projection to work with our 4x4 transformation matrix framework. Specifically, we will use a perspective projection, where the center of projection is the origin of the camera coordinate system (the eye), and the plane on which we are projecting is our image plane which is a distance $d_n$ away from the eye. To perform the projection, we need to be able to describe a point $(x, y, z)$ (which is described in the camera coordinate system) as a point in this plane $(x_p, y_p, d_n)$ - note the intention to flip the direction of the z-axis after the projection.

The coordinates $x_p$ and $y_p$ can be calculated by using similar triangles:

$$ \frac{x}{-z} = \frac{x_p}{d_n}, \quad \frac{y}{-z} = \frac{y_p}{d_n}. $$

Therefore,

$$ x_p = x \left(\frac{-d_n}{z}\right), \quad y_p = y \left(\frac{-d_n}{z}\right), z_p = d_n. $$

Oh no, how are we going to express this using a transformation matrix?? The $\frac{1}{-z}$ part is not linear! Homogeneous coordinates to the rescue: this is where the homogenization part we covered earlier will help. What if we write our projection process, using a projection matrix $\mathbf{M}_p$ ($p$ for projection), like this?

$$ \mathbf{M}_p = \left[\begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & -1 & 0 \\ 0 & 0 & -\frac{1}{d_n} & 0 \\ \end{array}\right]\left[\begin{array}{c} x \\ y \\ z \\ 1 \end{array}\right] = \left[\begin{array}{c} x \\ y \\ -z \\ -\frac{z}{d_n} \end{array}\right] \rightarrow \mathrm{homogenize} \rightarrow \left[\begin{array}{c} -x d_n / z \\ -y d_n / z \\ d_n \\ 1 \end{array}\right] $$

Aha! If we always remember to homogenize our points after transforming them, then we get the projection we want. In the exercise below, try returning the projection matrix we just derived as a mat4. The aspect ratio of the canvas is 1 in case you need it, and the field-of-view (FOV) is set to 90 degrees. Try setting $d_n = 1$. Returning undefined will signal the exercise to use a more complete projection matrix (see below).

What do you notice about the result? While the mechanics of the projection work, we're projection EVERYTHING to the same plane. This means we can't tell what's in front of what.

Instead of projecting everything to a viewing (image) plane, we will use a viewing volume as outlined in light gray below. We will now call the image plane the near plane (hence, the $n$ subscript). The farthest plane in the viewing volume is called the far plane and is at a distance $d_f$ from the camera. For a vertical FOV $\alpha$, the perspective projection matrix is (derivation below):

$$ \mathbf{M}_p = \left[\begin{array}{cccc} \frac{1}{a\cdot \tan(\frac{1}{2}\alpha)} & 0 & 0 & 0 \\ 0 & \frac{1}{\tan(\frac{1}{2}\alpha)} & 0 & 0 \\ 0 & 0 & \frac{(d_f + d_n)}{d_n - d_f} & \frac{2d_f d_n}{d_n - d_f} \\ 0 & 0 & -1 & 0 \\ \end{array}\right] $$

where $a = n_x / n_y$ is the aspect ratio of the canvas. Another advantage of using a viewing volume is that the near and far planes can tell us which points should be discarded since they aren't in the viewing volume. Any $z$ value that is not within [-1, 1] should be discarded. This space is also called clip space because geometric primitives may be clipped if they are outside the planes bounding the viewing volume.

Derivation of the projection matrix to a view volume.

By convention, the near plane is assumed to have corners at $(x, y) = (\pm 1, \pm 1)$. The near plane will also be transformed to $z = -1$ and the far plane will be transformed to $z = +1$. The equation for $x_p$ and $y_p$ also involve a scaling factor of $2/w$ and $2/h$ ($w$ and $h$ are the image plane dimensions) to scale them into the $[-1, 1]$ range:

$$
x_p = -\frac{2d_n}{w z} x = \frac{1}{a\cdot\tan(\frac{1}{2}\alpha)}\left(-\frac{x}{z}\right), \quad y_p = -\frac{2d_n}{h z} y = \frac{1}{\tan(\frac{1}{2}\alpha)}\left(-\frac{y}{z}\right)
$$

since $h = 2d\tan(\frac{1}{2}\alpha)$ and $w = a \cdot h$. At this point, we can partially set up our projection matrix:

Note that this means we plan to homogenize by $-z$. We just need to figure out what goes in the third row. Expressing the equation for this row (after homogenization) as:

$$
z_p = \frac{1}{-z}\left(\alpha z + \beta\right) = -\alpha -\frac{\beta}{z},
$$

we can apply the fact that at $z = -d_n$, $z_p = -1$ and at $z = -d_f$, $z_p = +1$ to solve for $\alpha$ and $\beta$:

$$
\alpha = \frac{d_n + d_f}{d_n - d_f}, \quad \beta = \frac{2d_nd_f}{d_n - d_f}.
$$

Note that both of these values are negative since the near plane is closer than the far plane. Also note that this viewing volume is defined with a left-handed coordinate system.

Stage 4: Screen Matrix - transforming the near plane of the viewing volume to the final image.

The last step is to transform our near plane (in the view space) which has corners at $(\pm 1, \pm 1)$ to our final image (HTML canvas) which has the origin at the top-right corner (with $y$ going down) and has a width of $n_x$ pixels and a height of $n_y$ pixels. This can be achieved with the following "screen" (also known as a "viewport") transformation $\mathbf{M}_s$ ($s$ for screen):

$$ \mathbf{M}_s = \left[\begin{array}{cccc}\frac{n_x}{2} & 0 & 0 & \frac{n_x}{2} \\ 0 & -\frac{n_y}{2} & 0 & \frac{n_y}{2} \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array}\right] $$

Practical aspects of the transformation pipeline: `glMatrix` and nomenclature.

I strongly recommend using glMatrix as much as possible to avoid bugs involved with creating the matrices above, specifically:

View Matrix: mat4.lookAt (see implementation here).
Perspective Projection Matrix: mat4.perspective (see implementation here).

Which are further documented here. Please go back to the first exercise and try returning the result of mat4.perspective using a FOV of 90 degrees, an aspect ratio of 1, a near plane of $d_n = 1e-3$ and a far plane of $d_f = 1000$. What happens if you set $d_n = 1$ or $d_f = 1.5$?

Some intermediate matrices are also used frequently, so it's good to know about some naming conventions:

Model-View Matrix: Represents the transformation of the model to camera space: $\mathbf{M}_v\mathbf{M}_m$.
Model-View-Projection Matrix: Represents the transformations of the model to projection space: $\mathbf{M}_p\mathbf{M}_v\mathbf{M}_m$. Usually called the MVP matrix.

Remember to read transformations from right-to-left!

Introduction to `WebGL` - the rasterization API we we will use.

WebGL (Web Graphics Library) provides an API to rasterize our models using the GPU. This makes it very fast and will allow us to develop some more interactive applications. We won't cover everything in WebGL and we'll mostly focus on writing shaders in our course. It will still be important to understand how a WebGL application is built from start to finish, and I'll provide a lot of starter code for doing this. In the coming weeks, we'll also discuss how to upload data to the GPU and issue rendering calls to WebGL.

One of the central concepts in WebGL is the idea of a context. Just like the "2d" context we used to assign pixel colors in an HTML canvas, we can retrieve the webgl context using:

let gl = canvas.getContext('webgl'); // the gl object is a WebGL context.

You can also pass "webgl2" to access the WebGL2 API. The context has several functions which will allow us to write data to the GPU, create shader programs, and issue rendering calls.

Writing vertex and fragment shaders.

As mentioned earlier, WebGL offers us the ability to inject our own programs into the graphics pipelines to (1) process vertex data and (2) process fragments. These two steps are done using shaders which are essentially programs that will run on the GPU. Shaders are written in specific languages - for WebGL, they are written in the GL Shading Language (GLSL). The syntax of GLSL is similar to C, and there are some custom types built into the language to help us with linear algebra. Actually, a lot of the way we manipulate vectors and matrices in GLSL will feel more natural than the way we've been doing it in JavaScript with glMatrix, primarily because of the way operators are overloaded.

For example, the following is valid GLSL to manipulate 3d vectors and 3x3 matrices:

vec3 u = vec3(1, 2, 3);
vec3 v = vec3(4, 5, 6);
vec3 u_normalized = normalize(u);
float u_length = length(u);
vec3 u_scaled = 2.0 * u; // 2 * u will result in a compiler error (2 is an integer, but 2.0 is a float)
vec3 u_plus_v = u + v;
vec3 u_minus_v = u - v;
float u_dot_v = dot(u, v);
vec3 u_cross_v = cross(u, v);
vec3 u_times_v_componentwise = u * v;
vec3 reflected = reflect(u, v); // special built-in function to reflect the first vector across the other!
vec3 p = vec3(1, 2, 3); // some position vector
vec4 p_homogeneous = vec4(p, 1.0); // homogeneous representation of p

float u_x = u.x; // or u[0], also u.r
vec2 u_xy = u.xy;

mat3 A = mat3(1, 0, 0, 0, 1, 0, 0, 0, 1); // 3x3 identity, can also use mat3(1.0)
mat3 A_inverse = inverse(A);
mat3 A_transpose = transpose(A);
vec3 A_times_u = A * u;

Each shader should be understood as a mini program with an entry point at the main() function. You can write additional functions to assist in your implementation just like you would in C. Variables can also be declared globally so the entire shader can use them. In the case of the vertex shader, think of the input as some vertex data. At the very least, a vertex shader must write to a special variable called gl_Position (a vec4). In the case of a fragment shader, the input is a fragment, and the required output is the color we want to assign to the fragment, which is assigned in a special, reserved variable called gl_FragColor. Note that gl_FragColor is also a vec4 where the first three components correspond to the RGB values (between 0-1). The fourth component of gl_FragColor controls the transparency (0 for transparent, 1 for opaque).

Exercise: a ray tracer in a rasterizer?

To practice with GLSL, let's write a ray tracer in a fragment shader! This is done by rendering a full-screen quad that covers the entire near plane with corners at $(\pm 1, \pm 1)$. Assume the camera (at the origin) is looking down the $-z$ axis and the image plane is at a distance of $d = 1$. The fragment shader is processing a particular pixel. Try to render a unit sphere centered at the point $(0, 0, -2)$ and add a diffuse shading model. I have already extracted the x and y coordinates of the pixel and created the ray direction. The canvas has a width and height of 650.

Moving forward: always ask yourself which coordinate system am I currently in?

In the coming weeks, it will be very important to always ask yourself which frame of reference you are in when doing a particular calculation. The data you pass from a vertex shader to a fragment shader might be in world space, camera space or projection space. Lighting calculations will typically be done in camera space, so vertices should be transformed by the model-view matrix and the normals should be transformed by the inverse-transpose of the model-view matrix before doing a lighting calculation.

WebGL does the viewport (screen) transformation for you, so the gl_Position output (a required output of a vertex shader) should be transformed by the MVP (model-view-projection) matrix.

Also, please remember what you are/are not allowed to do in each programming language we are using, specifically when it comes to linear algebra. GLSL has built-in types and functions for vectors and matrices, with operator-overloading to make expressions more intuitive (and more like how it is written mathematically). JavaScript (with glMatrix) does not.