Lecture 02: Ray Tracing (slides)

Learning Objectives

By the end of this lecture, you will:

  • generate camera rays for a view directly aligned with the $z$-axis,
  • review some of the math used in computer graphics,
  • intersect rays with simple geometric primitives, such as planes, spheres and triangles.

Last week, we warmed up with some JavaScript and saw how to assign pixel colors in an HTML Canvas. This week, we will look at an approach for assigning pixel colors called ray tracing. We'll focus our attention on rendering simple scenes that contain spheres and a handful of triangles. An application of this could be like approximating the meatballs in Cloudy with a Chance of Meatballs from the top-right image, which used a tool called Arnold to render the frames in the movie. Another important ray tracer to know about is Pixar's Renderman.

Compared to other rendering techniques we will say later on, ray tracing has the advantage that it can produce very photorealistic effects. The disadvantage, however, is that it can be much slower than other techniques. Modern GPUs, however, are making ray tracing a much more competitive choice when rendering scenes in real-time.

The main idea of ray tracing

The main idea of ray tracing consists of sending "rays" from an observation point (e.g. a camera) through an image plane and into the (virtual) 3d world you want to visualize. The image plane is where the pixels are and we will send a ray through each pixel.

Ray tracing comes down to a few simple steps, which can be made more complicated if you want to achieve more interesting features:

  1. Set up your image and place your observer (which we will call a camera or eye).
  2. For each pixel in your image:
    a. Create a ray originating at your camera position that passes through this pixel.
    b. Determine the closest object in your scene intersected by the ray.
    c. Determine the color of the pixel, based on the intersection.

That's it! The hardest part, however, comes in step 2c, when we need to figure out the color of the pixel. This is hard because it could involve generating secondary rays which get generated due to reflection or refraction (more on this later). For now we'll just assign a single color for objects in our scenes and leave shading for the next lecture. Step 3b can be hard too (and hard to make efficient), but we will focus on simple geometries (planes, spheres and triangles) in our scenes.




In the demo above, click on the buttons to send rays from the camera into the scene. Note that each ray originates from the camera and it's direction is determined by the pixel it passes through. If a ray intersects one of the circles, we store the closest intersection point. We can then determine the color of the pixel the ray passes through by analyzing the lighting in the scene (the mini yellow circle is a light), or by sending secondary rays that reflect off the circle, or refract through it. Here, we just color the pixel the same color as the intersected circle (if any), otherwise, we retain the background color (white in this case).

Setting up a camera and image plane.

To define rays, we first need to set up a camera to look at our scene. If you were taking a picture with a real camera, what are some of the parameters that define the view? For one, there is the position of the camera, so we need to specify the camera position, which we will denote as $\vec{e}$ (for eye). We also need to specify the direction in which we are pointing the camera. For now, we will just let this be the $-z$-axis, i.e. $(0, 0, -1)$. We'll call this our gaze $\vec{g}$. This seems a bit restrictive but we'll set up more general views in a few weeks.

Finally, when you look through a camera, you can't see 360 degrees around you - you can only see a fraction of that. We'll call this the image plane: it's the area that you can see in front of you and we will assign colors to each pixel in the image plane. We'll assume that our camera is looking right at the center of the image plane, and that the offset from our camera to the image plane is some distance $d$.


(image adapted from here)

In physical 3d space, the image plane has width $w$ and height $h$ (while having $n_x$ pixels along the width and $n_y$ pixels along the height). The aspect ratio of the image will be denoted $a = \frac{w}{h} = \frac{n_x}{n_y}$. Instead of specifying a value for $w$ and $h$, it is more common to specify a horizontal or vertical field-of-view (FOV, denoted by $\alpha$) and calculate $w$ and $h$ using trigonometry. The horizontal FOV ($\alpha_h$) is the angle between left and right sides of the image plane and the vertical FOV ($\alpha_v$) is the angle between the bottom and top of the image plane. Try to derive an expression for the width ($w$) and height ($h$) in terms of $d$, the FOV (either $\alpha_h$ or $\alpha_v$) and the aspect ratio $a$. Assume we are still looking at the center of the image plane.

   

Solution Note that $\frac{1}{2}\alpha_h$ is the angle between the image center and one of the sides of the image. Using trigonometry, we know that $d$ is related to half the width of the image plane: $$ \tan\left(\frac{\alpha_h}{2}\right) = \frac{\frac{1}{2}w}{d}. $$

Therefore, $w = 2d\tan\left(\frac{1}{2}\alpha_h\right)$ and $h = \frac{2d}{a}\tan\left(\frac{1}{2}\alpha_h\right)$.

Note that if we used the vertical field-of-view $\alpha_v$ (the angle between the top and bottom sides of the image plane), then we have the relationship
$$
\tan\left(\frac{\alpha_v}{2}\right) = \frac{\frac{1}{2}h}{d},
$$

and we would have $h = 2d\tan\left(\frac{1}{2}\alpha_v\right)$.

Defining rays from the camera through each pixel.

The next thing we need to do is create rays! A ray is just a line, and lines are entirely defined by an origin (3d point) and a direction (3d vector). Our goal is to create a ray that passes through each pixel. The origin will always be the camera but the direction will depend on what kind of view you're trying to create:

  1. orthographic: Here, the direction vector is just the gaze direction - it's the same for every pixel. Orthographic views are good for engineering drawings but not so good for realistic renderings. We won't use these in our course.
  2. perspective: Here the direction vector is the vector from the camera to a pixel (represented in 3d space). This gives much more realistic views, so this is the one we'll use.


left: orthographic view, right: perspective view
(images from Interactive Computer Graphics)

For a perspective view, we need to compute the 3d coordinates of each pixel in the image plane. We've talked about the width and height of the image plane, but we haven't yet talked about the other ingredient we need: the coordinate system. For an HTML Canvas, the coordinate system starts at the top-left of the canvas. The x-direction goes from left-to-right and the y-direction goes from top-to-bottom.

For now, assume that our 3d (world) coordinate system has the origin at the eye, with the x-direction going from left-to-right and the y-direction going from bottom-to-top. Let's also assume that we are looking right at the center of the image plane. In our coordinate system, the right side of the image plane will have an x-coordinate of $\frac{w}{2}$ (and the left-side will have an x-coordinate of $-\frac{w}{2}$; the bottom and top of the image plane will have a y-coordinate of $-\frac{h}{2}$ and $\frac{h}{2}$, respectively.

So when we're processing a pixel (i, j) from the HTML Canvas, we need to transform it to our image plane system. For a pixel (i, j). We need to do a few things: (1) recenter the coordinates, (2) scale the coordinates and (3) account for the fact that the HTML Canvas has the y-coordinates going downwards. Assuming that the pixels along the width can be index from $[0, n_x - 1]$ and the pixels along the height can be indexed from $[0, n_y - 1]$ (i.e. using 0-based indexing for $n_x$ and $n_y$ pixels), we can perform the transformations using:

$$ x = -\frac{w}{2} + w \frac{(i + 0.5)}{n_x}, \quad y = -\frac{h}{2} + h \frac{(n_y - 0.5 - j)}{n_y}. $$

Why the 0.5? The 0.5 isn't completely necessary, but when we are sending a ray through a pixel, we are "sampling" the pixel at the center. Not having the 0.5 in the equations above would mean that you sample the pixel at the top-left corner of each pixel, which is fine. You can also generate a random number between [0, 1] to generate a random pixel sample.

Finally, for a pixel (i, j), it's 3d coordinates are $\vec{p} = (x, y, -d)$ with $x$ and $y$ given above and $d$ is the distance we specified from the camera to a point we're looking at.

Why $-d$ and not just $d$? When dealing with a physical system, we usually need to create a right-handed coordinate system. For our image plane, the x-axis goes from left-to-right. The y-axis goes from bottom to top. If you follow your right fingers along the x-axis and then "curl" them towards to y-axis, your right thumb will point towards the z-axis that gives a right-handed system. Since the image plane is located opposite this z-axis, it's z-coordinate is $-d$.

Therefore, for a perspective view, the direction from the camera to the pixel is $\vec{r} = \lVert \vec{p} - \vec{e}\rVert$. Our ray is described by the origin $\vec{e}$ and the direction vector $\vec{r}$. Note that we have normalized the direction of the ray since the magnitude doesn't really matter. Plus, a lot of the math we do soon will be more conveninent if the ray direction has a length of 1.

Intersecting rays with objects in your scene.

Now we need to do step 2b in the algorithm described above: intersect our rays with objects in our scene. We will focus on simple objects like planes, sphere and triangles. Before proceeding, let's describe our rays mathematically. Since a ray is a line, we can write any point on the ray as:

$$ \vec{x}(t) = \vec{e} + t\vec{r}, \quad t \ge 0. $$

Note that we require a non-negative value for the ray parameter $t$ since we don't want to include points that are behind the camera. The value $t = 0$ corresponds exactly with the location of the camera (our eye).

Ray-plane intersection.

Planes are useful to represent things like the ground. To describe an infinite plane in 3d, we need (1) a point on the plane ($\vec{c}$) and a vector perpendicular to the plane (called the normal) $\vec{n}$. Any point $\vec{p}$ on the plane satisfies:

$$ \vec{n} \cdot (\vec{p} - \vec{c}) = 0 $$

To calculate the intersection of a ray with a plane, we can inject our ray equation into the plane equation - i.e. replacing $\vec{p}$ with $\vec{x}(t)$ and solve for $t$:

$$ \vec{n} \cdot (\vec{e} + t\vec{r} - \vec{c}) = 0, \quad \rightarrow \quad \vec{n}\cdot(\vec{e} - \vec{c}) + t\vec{n}\cdot\vec{r} = 0 \quad \rightarrow \quad t = \frac{\vec{n}\cdot(\vec{c} - \vec{e})}{\vec{n}\cdot\vec{r}}. $$

What happens if the ray is parallel to the plane?

To infinity, and beyond!

When the plane and ray are parallel, then $\vec{n}$ and $\vec{r}$ are perpendicular and $\vec{n} \cdot \vec{r} = 0$. Mathematically, the expression we derived would go to infinity. Physically, this means that there is no intersection. It's a good idea to check if $\lvert \vec{n}\cdot\vec{r}\rvert \lt \epsilon$ (for some small value $\epsilon$) and skip the intersection calculation if the relation is satisfied.

Ray-sphere intersection.

There's a common expression in physics and engineering: "assume a sphere," which is used to approximate a complicated physical system as a simpler sphere. It's said that this originated with theoretical physicists assuming that cows can be represented as spheres. We will assume a lot of things are spheres, like the meatballs in Cloudy with a Chance of Meatballs or the planets in the Solar System.

   

A sphere is defined entirely by a center $\vec{c}$ (a 3d point) and a radius $R$ (a scalar). Every point on a sphere satisfies the following property: the distance between a point on the sphere to the center of the sphere is equal to the radius. Mathematically, this can be expressed as:

$$ \lVert \vec{p} - \vec{c} \rVert = R. $$

where $\vec{p}$ is a point on the surface of the sphere. Note that the distance can also be written as $\lVert \vec{p} - \vec{c}\rVert = \sqrt{ (\vec{p} - \vec{c}) \cdot (\vec{p} - \vec{c})}$. The square-root is kind of annoying, so we will square both sides:

$$ (\vec{p} - \vec{c}) \cdot (\vec{p} - \vec{c}) = R^2. $$

Now, just like with the derivation for ray-plane intersections, we'll substitute $\vec{p}$ for our ray equation ($\vec{x}(t)$). Our goal is to obtain an expression for $t$ that will tell us at what point along the ray the sphere intersects it. Making the substitution and moving everything over to the left-hand-side gives:

$$ \left(\vec{e} + \vec{r}t - \vec{c}\right) \cdot \left(\vec{e} + \vec{r}t - \vec{c} \right) - R^2 = (\vec{r}\cdot\vec{r})\ t^2 + 2\vec{r}\cdot(\vec{e} - \vec{c})\ t + (\lVert\vec{e}-\vec{c}\rVert^2 - R^2) = 0 $$

Notice that this is a quadratic equation in $t$ and that the term $\vec{r}\cdot\vec{r} = 1$ since we normalized our ray directions. We need to solve the equation

$$ t^2 + 2Bt + C = 0, \quad B = \vec{r}\cdot(\vec{e} - \vec{c}), \quad C = \lVert\vec{e}-\vec{c}\rVert^2 - R^2. $$

Using the quadratic formula gives:

$$ t = -B \pm \sqrt{B^2 - C}. $$

Note that we canceled a bunch of 2's. What happens if $B^2 - C \lt 0$? There is no intersection with the sphere, which you should check for. If $B^2 = C$, then the ray just grazes the sphere and intersects at one point. Now, for $B^2 - C \gt 0$, there are two possible solutions, which we will call $t_\min$ and $t_\max$:

$$ t_\min = -B - \sqrt{B^2 - C}, \quad t_\max = -B + \sqrt{B^2 - C}. $$

It might seem like we should always use $t_\min$, but we should actually check both possible solutions. Why? Well what if your camera is located inside the sphere? $t_\min$ would be behind you and $t_\max$ is the one we should keep. It might not seem useful to place a camera inside the sphere, but this situation is also characteristic of casting secondary rays to model refractive materials (more on this in a few weeks).

In class, we'll practice performing ray-sphere intersections - please see the repl below.



Ray-triangle intersection.

In the image at the top-right of this lecture, we can model the ground and meatballs using a plane and spheres, respectively. But how should we model some of the characters? These more complicated surfaces can be represented by a bunch of triangles called a mesh. We'll talk about meshes later on, but the basic building block of a mesh is a triangle. Other shapes are possible too but triangles are the most flexible so that's what we'll focus on.

This means we need to be able to intersect rays with 3d triangles. We'll represent triangles by three 3d points: $\vec{a}$, $\vec{b}$ and $\vec{c}$. In order to do the intersection, we'll use something called barycentric coordinates which is a way to represent a point within the triangle using two parameters: $u$ and $v$ where $0 \le u \le 1$ and $0 \le v \le 1$. This means any point $\vec{p}$ in the triangle can be represented as a linear combination of the triangle points with these barycentric coordinates.

$$ \vec{p}(u, v) = \vec{a} u + \vec{b} v + w \vec{c} $$

Wait, where did this $w$ come from? I thought we had two parameters? In barycentric coordinates, $w$ is related to $u$ and $v$ by $w = 1 - u - v$. We now have:

$$ \vec{p}(u, v) = \vec{a} u + \vec{b} v + (1 - u - v) \vec{c}. $$

   

To find the ray-triangle intersection, we'll set this expression equal to our ray equation:

$$ \vec{p}(u, v) = \vec{x}(t) \quad \rightarrow \quad \vec{a} u + \vec{b} v + (1 - u - v) \vec{c} = \vec{e} + \vec{r}\ t. $$

Note that we have 3 equations here (for each of the three components of the 3d vectors) and we have three unknowns: $u$, $v$ and $t$. We can express this as a system of equations using linear algebra:

$$ \left[ \begin{array}{ccc} (a_x - c_x) & (b_x - c_x) & - r_x \\ (a_y - c_y) & (b_y - c_y) & - r_y \\ (a_z - c_z) & (b_z - c_z) & - r_z \end{array}\right] \left[ \begin{array}{c} u \\ v \\ t \end{array}\right] = \left[ \begin{array}{c} e_x - c_x \\ e_y - c_y \\ e_z - c_z \end{array}\right] $$

I recommend using the glMatrix function in the mat3 namespace called invert to invert the matrix (doc) and then the transformMat3 function in the vec3 namespace to obtain the final solution (doc). Let $A$ be the matrix on the left-hand side of the equation above, $\vec{b}$ is the vector on the right-hand-side and $\vec{x} = [u, v, t]^T$ is the solution we are looking for. Here are some steps to solve this system with glMatrix:

  1. Compute $\mathbf{A}^{-1}$ using mat3.invert.
  2. Compute $\vec{x} = \mathbf{A}^{-1} \vec{b}$ using vec3.transformMat3.

Once you've solved the system of equations you should check if $0 \le u \le 1$ and $0 \le v \le 1$ and if $0 \le 1 - u - v \le 1$ to see if there is an intersection with the triangle. You should also check that $t \ge 0$ as we did with the other shapes.

Summary.

We did a lot of math today, starting with some algebra, then doing operations with vectors and finally setting up a system of equations with matrices and vectors. We didn't really talk about Step 2c in the ray tracing algorithm. We'll talk about shading next week, but for now we'll just assign a constant color to each shape in our scene. In a few weeks, we'll also talk about more general views and how to make some of the intersection calculations faster.


© Philip Claude Caplan, 2023 (Last updated: 2023-11-29)