A Summary of LearnOpenGL


I’ve prepared a concise summary of the lectures from https://learnopengl.com/ for a quick reference.

This text explains graphics concepts without using visuals.

OpenGL

OpenGL is a cross-platform API for rendering graphics, maintained by the Khronos Group. It is a specification that must be implemented by GPU vendors (or third parties) and made available through graphics driver software.

If you’re building your project with CMake and want to use OpenGL in your C++ application, your CMakeLists.txt should include the following commands:

find_package(OpenGL REQUIRED)
target_link_libraries("${PROJECT_NAME}" PUBLIC OpenGL::GL)

GLFW

A graphics application requires a window to render graphics, but creating a window is OS-specific. GLFW is a cross-platform library that abstracts away the details and provides a simple API for creating native windows (and more).

The GLFW source code can be included in a Git repository as a submodule.

git submodule add https://github.com/glfw/glfw /external/glfw

Then, it can be included in the build via add_subdirectory, since it contains a CMakeLists.txt that defines how it should be built.

add_subdirectory("${CMAKE_CURRENT_SOURCE_DIR}/external/glfw")
target_link_libraries("${PROJECT_NAME}" PUBLIC glfw)

GLAD

Due to the differences between implementations and platforms, OpenGL function addresses aren’t known at compile time. As a result, they need to be queried at runtime using platform-specific mechanisms. That’s why there are libraries like GLAD, which dynamically loads the function pointers.

To generate a GLAD loader:

  • Go to https://glad.dav1d.de/
  • Select a GL API version (3.3+ is recommended)
  • Select the Core profile
  • Enable Generate a loader option
  • Click the Generate button

You can also enable some OpenGL extensions that add new capabilities or optimizations, but they are not guaranteed to be supported by the graphics driver.

Download the generated zip file and extract it into external/glad. To keep things clean, we can define a static library called glad and associate the source and header files with it. Then, it can be added to the build just like we did with GLFW.

add_library(glad STATIC "${CMAKE_CURRENT_SOURCE_DIR}/external/glad/src/glad.c")
target_include_directories(glad PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/external/glad/include")
target_link_libraries(glad PUBLIC OpenGL::GL)
target_link_libraries("${PROJECT_NAME}" PUBLIC glad)

OpenGL Context

OpenGL is basically a state machine, and all rendering commands are executed based on the current state. An OpenGL context is a container that stores this state information, and GLFW will be responsible for creating one for us (during window creation). We have to set the current context, via glfwMakeContextCurrent; otherwise, GLAD won’t be able to resolve function pointers.

Only one context can be current on a thread at a time, and a context cannot be current on multiple threads at the same time.

For example, calling glClearColor(r,g,b,a) will update the state, and all subsequent glClear calls will use the specified color until another glClearColor call with a different set of arguments is made.

How does GLAD load OpenGL functions?

When a program that links to some shared library is launched, the operating system maps that library’s segments (e.g., text and data) into this process’s virtual address space. In our case, an OpenGL stub library (e.g., opengl32.dll or libGL.so) is dynamically loaded for our application. When an OpenGL context is created, the OS then loads the vendor-specific driver library into the process memory. A loader function such as glfwGetProcAddress can query the driver and get the addresses of actual implementations of the supported OpenGL functions. GLAD calls this loader for each OpenGL function name by iterating over a list for each version of the specification (until the one we set during GLAD generation) and stores those pointers in usable global function pointers.

Render Loop

Program statements are executed sequentially, and when the last one is done, the process terminates. We usually want to keep the application running until the user issues an exit command. A while loop is how this behavior can be implemented in code, and in computer graphics, this loop is often referred to as a render loop or game loop depending on the context. The exit condition in our case is the close flag of the GLFW window, and the glfwWindowShouldClose call returns true if this flag is set (e.g., after the user clicks on the close button). Every iteration of this loop (one cycle of work) is colloquially called a frame. A frame can also refer to the fully rendered 2D image — the output of that iteration.

Double Buffering

Updating a color buffer while displaying that same buffer on the screen would result in screen tearing or flickering. For this reason, GLFW creates two buffers, front and back, and draws to the back buffer while displaying the front buffer. When glfwSwapBuffers is called, the back buffer becomes the front and vice versa.

By buffers, we actually mean framebuffers. For now, we’re using the default ones created by GLFW. Later, we’ll manually create them for other purposes.

One thing to keep in mind is that we need to call glViewport with the correct width and height of the current window so that we draw to the entire buffer, but we can choose to use a smaller region as well. Another reason this call is important is that it sets the correct aspect ratio for viewport transformation, which will be discussed later.

It is advisable to register a callback function via glfwSetFramebufferSizeCallback so that the viewport is updated when the window is resized.

User Input

In addition to creating windows and contexts, GLFW can also receive input events. glfwPollEvents handles events that are currently in the event queue and calls the callbacks associated with those events. Callback functions can be registered using calls such as glfwSetMouseButtonCallback.

Minimal Application

With this knowledge, we can create our first OpenGL application that displays a single color background.

Some calls in this program might fail, but the necessary checks are omitted.

#include <glad/glad.h>
#include <GLFW/glfw3.h>
void FramebufferSizeCallback(GLFWwindow*, int, int);
int main() {
  glfwInit();
  GLFWwindow* window = glfwCreateWindow(800, 600, "LearnOpenGL", NULL, NULL);
  glfwMakeContextCurrent(window);
  glfwSetFramebufferSizeCallback(window, FramebufferSizeCallback);
  gladLoadGLLoader((GLADloadproc)glfwGetProcAddress);
  glClearColor(0.2f, 0.3f, 0.3f, 1.0f);
  while (!glfwWindowShouldClose(window)) {
    glClear(GL_COLOR_BUFFER_BIT);
    glfwSwapBuffers(window);
    glfwPollEvents();
  }
  glfwTerminate();
  return 0;
}
void FramebufferSizeCallback(GLFWwindow* window, int width, int height) {
  glViewport(0, 0, width, height);
}

Rendering Pipeline

The sequence of steps that OpenGL takes to transform 3D shapes into 2D images is called the rendering pipeline. This pipeline is designed as a streaming architecture and consists of several stages, each one requiring the output of the previous one as its input. These stages are highly specialized and can be executed in parallel on thousands of GPU cores today.

Running in parallel means that while stage-A processes its new input, stage-B works on A’s previous output, and so on.

Shaders

Each stage in the rendering pipeline executes small programs on GPU cores to perform its tasks. A program that runs on the GPU is called a shader. They can be written in one of the many shading languages that exist today. OpenGL Shading Language (GLSL) is the default for OpenGL and widely supported, so it will be the language of choice for us.

Abstract Pipeline Structure

The pipeline structure is defined by standards bodies, e.g., Khronos Group, and implemented in hardware by GPU vendors, e.g., Nvidia. The OpenGL rendering pipeline consists of the following stages (in the given order):

  • Vertex Specification
  • Vertex Shader
  • Tessellation
  • Geometry Shader
  • Vertex Post-Processing
  • Primitive Assembly
  • Rasterization
  • Fragment Shader
  • Per-Sample Operations

Bold-italic text indicates a programmable pipeline stage.

Vertex Specification

A vertex is a collection of attributes associated with a point in space. These attributes can include position, normal direction, texture coordinates, tangent vector, color, etc.

Since this is the first stage in the pipeline, vertex data must be provided by the application. The vertex data can be as simple as an array of positions where each element is a float corresponding to a value on one of three axes $(x,y,z)$. For example, a triangle formed by vertices $A$, $B$, and $C$ can be defined as follows:

float vertices[] = {
  // x, y, z
  -0.5f, -0.5f, 0.0f, // A
  0.5f, -0.5f, 0.0f, // B
  0.0f, 0.5f, 0.0f // C
};

How will OpenGL know this array represents a triangle and not two lines ($AB$, $BC$, and no $CA$)? We will tell OpenGL how to connect these points when initiating draw calls.

Sending data from CPU to GPU memory is relatively slow, so we want to send the data once and keep it in GPU memory for as long as we need it. We can store large amounts of vertex data in memory via Vertex Buffer Objects (VBO). We can create such a buffer by calling glGenBuffers which assigns an ID to this buffer so we can refer to it later.

We have to bind this buffer to a target such as GL_ARRAY_BUFFER before being able to modify it. Then, we can send the vertex data to GPU memory via glBufferData by specifying the target, data length, data itself, and the expected usage pattern. The pattern GL_STATIC_DRAW is optimal for data that doesn’t need to change frequently but will be read many times.

GLuint vbo;
glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0); // unbind the vbo

GLuint is just an alias for unsigned int.

If we forget to unbind the buffer, all subsequent operations with the same target will affect its state. In modern OpenGL (4.5+), we have Direct State Access (DSA) that allows us to modify object state without affecting the global state. The previous code can be rewritten using DSA as follows:

GLuint vbo;
glCreateBuffers(1, &vbo);
glNamedBufferData(vbo, sizeof(vertices), vertices, GL_STATIC_DRAW);

If a buffer was created using glGenBuffers, it may not be compatible with DSA.

We will keep using the pre-DSA ways of doing things to be compatible with version 3.3+.

Vertex Shader

The vertex shader is a programmable stage in the pipeline that handles the processing of individual vertices. It receives a single vertex and outputs a single vertex, performing transformations or other per-vertex calculations in between. One of its predefined outputs is gl_Position which is of type vec4, and it must be set in the shader.

Vertex data we stored in the previous section will be consumed by the vertex shader. For this purpose, we need to define a vertex input for each attribute in the buffer. Since we only have one attribute, that is the position, we define one vec3 input. It’s advised to assign a location to each attribute manually (as opposed to letting OpenGL do it) so that we don’t have to query the locations later. The following is a simple vertex shader that directly outputs the input position without doing any transformations.

You must specify the shader version at the top of the shader using the version directive.

#version 330 core
layout(location = 0) in vec3 i_pos;
void main() {
  gl_Position = vec4(i_pos, 1.0);
}

Why gl_Position has a fourth component and why it’s set to 1.0 will be discussed later.

Vertex shader files usually have the extension .vert.

Whether we have a single or multiple attributes, we have to tell OpenGL how to interpret the vertex data in memory. The way we do it is by calling glVertexAttribPointer with the argument’s index (location), number of components per attribute (3 for position), data type, whether to normalize data, stride (distance between consecutive attributes), and attribute offset in the buffer. After this, glEnableVertexAttribArray must be called for the correct location to activate the attribute.

glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 3 * sizeof(float), (void*)0);
glEnableVertexAttribArray(0);
glBindBuffer(GL_ARRAY_BUFFER, 0);

The VBO only stores raw vertex data, and it doesn’t remember the attribute settings we just made. So, all of these steps must be repeated whenever we want to draw an object. For this reason, there are Vertex Array Objects, which can store all the state needed to supply vertex data. In the following code, VAO remembers every state change that was done while it was bound.

There is no need to re-send the buffer data, it’s already associated with the VBO.

GLuint vao;
glGenVertexArrays(1, &vao);
glBindVertexArray(vao);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 3 * sizeof(float), (void*)0);
glEnableVertexAttribArray(0);
glBindVertexArray(0); // unbind the vao

On OpenGL 3.3+ core profile, VAOs are mandatory (you must bind one before drawing).

Shader Compilation

It would not be feasible to pre-compile shaders compatible with many hardware-driver combinations. If there are many shaders or shader variations, it would make more sense to compile them at runtime on the target platform. Moreover, when a shader is compiled at runtime, hardware-specific optimizations can be applied by the graphics driver.

The following code reads a vertex shader from a file, creates a shader object by calling glCreateShader, providing the source code via glShaderSource, and compiles the shader into an intermediate representation.

// read the shader file
std::ifstream fs("example.vert");
std::stringstream ss;
ss << fs.rdbuf();
fs.close();
std::string vertexText = ss.str();
// compile the shader
GLuint vertexShader;
vertexShader = glCreateShader(GL_VERTEX_SHADER);
glShaderSource(vertexShader, 1, &vertexText.c_str(), NULL);
glCompileShader(vertexShader);

Fragment Shader

A fragment contains all the data that is needed to shade a pixel. A fragment shader usually has a single color output. Unlike the vertex shader, there is no predefined output variables (they are deprecated). OpenGL assigns the location 0 to the first output by default, but it can also be specified manually, especially when there are multiple outputs. The following is a fragment shader that assigns a predefined color to the output. The fourth component in the color vector is the alpha value that is used in blending.

#version 330 core
out vec4 o_color;
void main() {
 o_color = vec4(1.0f, 0.5f, 0.2f, 1.0f);
}

The process to create and compile a fragment shader is almost the same as for a vertex shader, except for the shader type (GL_FRAGMENT_SHADER) passed to the glCreateShader function.

GLuint fragmentShader;
fragmentShader = glCreateShader(GL_FRAGMENT_SHADER);

Fragment shader files usually have the extension .frag.

Shader Program

A shader program is the final linked version of multiple shaders. During linking, outputs of each shader are linked to the inputs of the next shader (by their names, unless manually given locations), which can result in errors if there is a mismatch in types (e.g., vec3 vs vec4) or interpolation qualifiers (e.g., flat vs smooth). The following code creates a shader program, attaches both vertex and fragment shaders, which were compiled before, to this program, and deletes the shader objects since they’re no longer needed.

GLuint shaderProgram;
shaderProgram = glCreateProgram();
glAttachShader(shaderProgram, vertexShader);
glAttachShader(shaderProgram, fragmentShader);
glLinkProgram(shaderProgram);
glDeleteShader(vertexShader);
glDeleteShader(fragmentShader);

To activate a shader, we call glUseProgram with the program ID. The VAO stores all the state needed to draw our triangle, so we bind it. Then, we make a draw call by telling OpenGL how to interpret the data to assemble primitives, i.e., we set the draw mode to GL_TRIANGLES (see the OpenGL primitive documentation for more detail). The glDrawArrays call accepts two more inputs: the start index in the enabled arrays, and the number of vertices to render. By default, OpenGL fills the interior (i.e., faces) of polygon primitives, but this behavior can be changed by setting glPolygonMode to something different than GL_FILL, e.g., GL_LINE, which draws only the outline.

A collection of vertices, edges that connect them, and faces that are formed by loops constitute a mesh.

glUseProgram(shaderProgram);
glBindVertexArray(vao);
glDrawArrays(GL_TRIANGLES, 0, 3);

Multiple Shader Attributes

We’ve so far had only one attribute: position. A VAO can store multiple attributes and reference multiple VBOs if these attributes are stored in different buffers. In most cases, we can store all the attributes in a single VBO in interleaved format (e.g., position0, color0, position1, color1, …). However, if some attributes need to be updated more frequently than others, it might be better to store them in separate VBOs.

Let’s update our vertices array to include per-vertex color data:

float vertices[] = {
  // x, y, z, r, g, b
  0.5f, -0.5f, 0.0f, 1.0f, 0.0f, 0.0f, // A
  -0.5f, -0.5f, 0.0f, 0.0f, 1.0f, 0.0f, // B
  0.0f, 0.5f, 0.0f, 0.0f, 0.0f, 1.0f // C
};

The vertex shader needs to be updated to include this new color input and an output to pass the color data to the fragment shader.

#version 330 core
layout(location = 0) in vec3 i_pos;
layout(location = 1) in vec3 i_color;
out vec3 color;
void main() {
  gl_Position = vec4(i_pos, 1.0);
  color = i_color;
}

Similarly, the fragment shader needs to be updated to receive the color value from the vertex shader. We have to use the same name (color) and type (vec3) for both the vertex shader output and the fragment shader input.

#version 330 core
in vec3 color;
out vec4 o_color;
void main() {
  o_color = vec4(color, 1.0f);
}

Finally, we need to update the attribute pointers so that they point to the correct locations in the buffer. The second set of calls now has 1 as the index argument, and the stride has been doubled since one set of attributes is now 6 floats long (3 for position, 3 for color). The offset value for the color attribute pointer must be 3 floats to correctly skip the position attribute. Also, the vertices array, which is in CPU memory, has been updated to include color values; hence, we need to update the GPU memory by sending the new array via glBufferData.

glBindVertexArray(vao);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
// send the updated data
glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
// position
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 6 * sizeof(float), (void*)0);
glEnableVertexAttribArray(0);
// color
glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, 6 * sizeof(float), (void*)(3* sizeof(float)));
glEnableVertexAttribArray(1);
glBindVertexArray(0);

A VAO needs to associate the attribute layout with the VBO that stores the attribute data. Hence, we need to bind the VBO after binding the VAO and before calling glVertexAttribPointer.

Coordinate Spaces, Systems and Frames

Notice that so far, we’ve used points within the range $[-1,1]$ on all axes. Also, recall that our vertex shader did not do any transformations, and directly output the values we set via a VBO. For a vertex to be visible on the screen in OpenGL, it must be in Normalized Device Coordinates (NDC) after it is processed. NDC is a space where all coordinates are normalized to $[-1,1]^3$. These numbers, however, are merely percentages that need to be transformed to a coordinate frame, e.g., screen coordinates, to represent actual positions. Before we go any further, it’s important to define the terms “space”, “system” and “frame”.

A geometric space is an abstract framework that defines the geometric rules (axioms), e.g., how to measure distances or angles, or how lines behave, to represent physical space. A Euclidean space is one such space where Euclidean geometry rules apply; for example, distances are calculated using the Euclidean distance formula. A coordinate system describes how to uniquely specify each point in a space. A Cartesian coordinate system specifies points using real numbers called coordinates, which are the signed distances from perpendicular oriented lines called coordinate lines or axes. The point where these axes meet is called the origin. The direction vectors that represent these axes (e.g., $(1,0,0)$, $(0,1,0)$ and $(0,0,1)$) form an orthogonal basis, meaning that they are mutually orthogonal, and any vector in this system can be represented as a finite linear combination of these basis vectors. A coordinate frame is a specific instance of a coordinate system with a defined origin and basis. In computer graphics, a coordinate space usually means a frame of reference in space (a coordinate frame).

In graphics applications, some calculations can be done more efficiently and are more intuitive in certain coordinate spaces. We move a vector from one space to another by applying a transformation. Before diving into transformations, it’s important to build a solid understanding of vectors and matrices.

Vector Operations

A vector ($\vec{v}$) is a 1D array of numerical components. It can be of size $n$, which is the number of components the vector has. In computer graphics, we usually use vectors of size up to 4.

\[\vec{a} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix}\]

A vector space ($V$) defines a set of axioms (e.g., commutativity, associativity, etc.), and a set of vector operations (e.g., addition, multiplication, etc.) over a field (e.g., real numbers ($\mathbb{R}$)) in algebraic terms. A Euclidean space satisfies all the axioms of a vector space over the real numbers.

We can add or subtract a scalar ($c$) to or from a vector, or multiply or divide a vector by a scalar by simply applying this operation to each component of the vector.

\[c\vec{a} = c \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix} = \begin{bmatrix} c \cdot a_1 \\ c \cdot a_2 \\ c \cdot a_3 \end{bmatrix}\]

The length (magnitude) of a vector is defined as the square root of the sum of the squares of its components.

\[\|\vec{a}\| = \sqrt{a_1^2+a_2^2+a_3^2}\]

A vector can be normalized to obtain a unit vector (a vector with a length of 1) by dividing its components by its length. Unit vectors are easy to work with when we only care about a vector’s direction.

Normalizing does not change a vector’s direction.

\[\hat{a} = \frac{\vec{a}}{\|\vec{a}\|} = \frac{1}{\sqrt{a_1^2+a_2^2+a_3^2}} \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix}\]

Two vectors of the same size can be added or subtracted through component-wise addition or subtraction.

\[\vec{a} + \vec{b} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix} = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \\ a_3 + b_3 \end{bmatrix}\]

GLSL defines vector-vector multiplication as component-wise multiplication. However, there are more useful and specialized forms of multiplication: dot and cross products.

Dot Product

In graphics applications, it’s important to know how much two vectors align, i.e., whether they’re parallel, perpendicular, or somewhere in between. Dot product is the operation that tells us about this relationship. It can be calculated by summing the component-wise products. The same result can be obtained by multiplying the lengths of the two vectors and the cosine of the angle between them. The second method is more intuitive because this operation is defined, geometrically, as the length of one vector’s projection onto the other, multiplied by the other’s length. One can verify that the two equations are identical through the use of the law of cosines on a triangle formed by the two vectors (making an angle $\theta$) and their difference vector connecting both.

The result of a dot product is a scalar (not a vector).

\[\vec{a} \cdot \vec{b} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix} \cdot \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix} = a_1b_1 + a_2b_2 + a_3b_3 = \|\vec{a}\| \|\vec{b}\| \cos{\theta}\]

If two vectors are perpendicular, their dot product is zero ($\cos{90^\circ} = 0$).

Dot product is commutative, that is, $\vec{a}\cdot\vec{b}$ is equal to $\vec{b}\cdot\vec{a}$.

A geometric space, e.g., a Euclidean space, is a vector space plus an inner product that defines lengths of vectors, angles between vectors, or orthogonality. An inner product, e.g., the dot product, is an operation that takes two vectors and produces a single scalar in a way that encodes geometric meaning — it lets us talk about lengths, angles, and orthogonality inside a vector space.

Cross Product

This operation takes two non-parallel vectors as input and outputs a vector that is orthogonal to both inputs. It will prove useful in future chapters.

\[\vec{a} \times \vec{b} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix} \times \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix} = \begin{bmatrix} a_2b_3 - a_3b_2 \\ a_3b_1 - a_1b_3 \\ a_1b_2 - a_2b_1 \end{bmatrix}\]

Matrix Operations

A matrix ($M$) is a 2D array of elements, where each element is identified by its row and column indices. If a matrix has $m$ rows and $n$ columns, it’s an $mxn$ matrix, and these are called the matrix dimensions. Matrices can be used to solve systems of linear equations; for example, one matrix can store the coefficients while another stores the variables.

If both dimensions are the same, then the matrix is called a square matrix.

A matrix-scalar product multiplies each element of the matrix by a scalar. Addition and subtraction can be done element-wise if both matrices have the same dimensions.

\[A + B = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} + \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1 + 5 & 2 + 6 \\ 3 + 7 & 4 + 8 \end{bmatrix} = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix}\]

Two matrices, $A$ and $B$, can be multiplied (in this order) if the number of columns in $A$ is equal to the number of rows in $B$. Every element in a row of $A$ is multiplied by the corresponding element in a column of $B$. Then, these products are summed up to obtain one element in the resulting matrix $C$. The result obtained from processing row $i$ of $A$ and column $j$ of $B$ will end up in the $i^{th}$ row and $j^{th}$ column of $C$. This implies that the resulting matrix has the same number of rows as $A$ and the same number of columns as $B$.

\[AB = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 1 \cdot 5 + 2 \cdot 7 & 1 \cdot 6 + 2 \cdot 8 \\ 3 \cdot 5 + 4 \cdot 7 & 3 \cdot 6 + 4 \cdot 8 \end{bmatrix} \\ = \begin{bmatrix} 5 + 14 & 6 + 16 \\ 15 + 28 & 18 + 32 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}\]

Matrix multiplication is not commutative, that is, $AB$ is not the same as $BA$.

\[BA = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 5 \cdot 1 + 6 \cdot 3 & 5 \cdot 2 + 6 \cdot 4 \\ 7 \cdot 1 + 8 \cdot 3 & 7 \cdot 2 + 8 \cdot 4 \end{bmatrix} \\ = \begin{bmatrix} 5 + 18 & 10 + 24 \\ 7 + 24 & 14 + 32 \end{bmatrix} = \begin{bmatrix} 23 & 34 \\ 31 & 46 \end{bmatrix}\]

When a matrix is transposed, its rows become its columns and vice versa. If $M$ has the dimensions $mxn$, $M^T$ (transpose of $M$) has a dimension of $nxm$.

\[A^T = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}, B^T = \begin{bmatrix} 5 & 7 \\ 6 & 8 \end{bmatrix}\]

The transpose of a product is equal to the product of the transposes in reverse order.

\[(AB)^T = B^TA^T = \begin{bmatrix} 5 & 7 \\ 6 & 8 \end{bmatrix} \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} = \begin{bmatrix} 5 \cdot 1 + 7 \cdot 2 & 5 \cdot 3 + 7 \cdot 4 \\ 6 \cdot 1 + 8 \cdot 2 & 6 \cdot 3 + 8 \cdot 4 \end{bmatrix} \\ = \begin{bmatrix} 5 + 14 & 15 + 28 \\ 6 + 16 & 18 + 32 \end{bmatrix} = \begin{bmatrix} 19 & 43 \\ 22 & 50 \end{bmatrix}\]

Transformations

A vector is basically an $nx1$ matrix, if represented as a column vector (i.e., components appear in the same column); hence, it can be multiplied by an $mxn$ matrix ($M\vec{v}$). Through matrix multiplication, a vector can be transformed into another vector. We use matrices for transforming vectors, because they allow us to combine multiple transformations into a single matrix, which we’ll see later on.

GPUs are very good at multiplying thousands of matrices in parallel.

An identity matrix is an $nxn$ matrix that has $1s$ on its main diagonal (from top-left to bottom-right) and $0s$ elsewhere. When you multiply any compatible matrix or vector with it, you get the original matrix or vector back. So, it’s essentially a no transform.

\[IA = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 1 \cdot 1 + 0 \cdot 3 & 1 \cdot 2 + 0 \cdot 4 \\ 0 \cdot 1 + 1 \cdot 3 & 0 \cdot 2 + 1 \cdot 4 \end{bmatrix} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}\]

A diagonal matrix has non-zero entries only along its main diagonal.

Scaling

We can change the length (and direction) of a vector by scaling it. This is achieved by multiplying individual components by a scalar. If the same scalar is used for all components, it is a uniform scale operation; otherwise, it’s called a non-uniform scale.

A positive uniform scale operation does not change a vector’s direction. If it’s negative, then the vector points the opposite way.

We would like to form a scale matrix ($S$) so that the scaling operation could be represented as a matrix-vector multiplication. To obtain that matrix, let’s first write a set of equations that describes scaling for a vector in a Euclidean space defined by Cartesian coordinates:

\[x' = S_x \cdot x \\ y' = S_y \cdot y \\ z' = S_z \cdot z\]

Since there are 3 equations, there should be 3 rows in the scale matrix to store the coefficients for each equation. Also, since a 3D vector is a $3x1$ matrix, our matrix needs to have 3 columns to be compatible. So, this will be a $3x3$ matrix. Let’s rewrite the equations so that each one has 3 coefficients (columns):

\[x' = S_x \cdot x + 0 \cdot y + 0 \cdot z \\ y' = 0 \cdot x + S_y \cdot y + 0 \cdot z \\ z' = 0 \cdot x + 0 \cdot y + S_z \cdot z\]

We want to scale each axis independently; hence, we want no contribution from other axes. For this purpose, we set the coefficients of other components to 0. In this type of scenario, we obtain a diagonal matrix. These equations can be written in matrix form as follows:

\[\begin{bmatrix} x' \\ y' \\ z' \end{bmatrix} = \begin{bmatrix} S_x & 0 & 0 \\ 0 & S_y & 0 \\ 0 & 0 & S_z \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix}\]

Translation

We can move (translate) a vector by adding another vector to it. Similar to scaling, we would love to represent this too as a matrix multiplication, which will help us combine both matrices into one. Again, let’s start by writing a set of equations that translate a vector:

\[x' = x + T_x \\ y' = y + T_y \\ z' = z + T_z\]

Wait… can we obtain $x+T_x$ through matrix multiplication? This seems impossible… and it is, in the same dimensional space. The reason is that matrix multiplication is a linear transformation; but, translation is an affine transformation.

A transformation ($L$) is linear if it satisfies the following condition, where $a$ and $b$ are scalars:

\[L(a\vec{u} + b\vec{v}) = aL(\vec{u}) + bL(\vec{v})\]

An affine transformation ($A$) has the following form, where $L$ is the linear part, and $\vec{c}$ is a constant vector (e.g., translation vector):

\[A(\vec{u}) = L\vec{u} + \vec{c}\]

This is not linear because:

\[A(\vec{u} + \vec{v}) = L(\vec{u} + \vec{v}) + \vec{c} \neq A(\vec{u}) + A(\vec{v}) = L(\vec{u}) + \vec{c} + L(\vec{v}) + \vec{c}\]

There is, however, an augmentation technique we can use to obtain a translation matrix. But first, let’s expand the equations to include all the components, which must have a corresponding coefficient in each row of this matrix. It’s obvious that these coefficients should be 0. On the other hand, translation amounts must be preserved; hence, they are multiplied by 1.

\[x' = 1 \cdot x + 0 \cdot y + 0 \cdot z + T_x \cdot 1 \\ y' = 0 \cdot x + 1 \cdot y + 0 \cdot z + T_y \cdot 1 \\ z' = 0 \cdot x + 0 \cdot y + 1 \cdot z + T_z \cdot 1\]

It looks like our vector is not $(x,y,z)$ anymore, but rather $(x,y,z,1)$. Similarly, each row appears to have one more coefficient that is the translation amount. Let’s try to convert this to a matrix multiplication using the available information:

\[\begin{bmatrix} x' \\ y' \\ z' \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix}\]

This operation is valid because the translation matrix ($T$) has a dimension of $3x4$ and it is multiplied by a $4x1$ vector, and the resulting vector is of size $3x1$. More importantly, it gives the correct result. So, we’ve finally obtained a translation matrix by introducing a new dimension.

When a 3D point is represented in a 4D projective space, the new coordinate system is referred to as homogenous coordinates.

There is a problem though. The $w$ component we added to our original vector makes it impossible to perform multiplications with our scale matrix since the dimensions $3x3$ and $4x1$ are not compatible ($3\neq 4$). As a workaround, we could add one extra column of $0s$ to our scale matrix:

\[\begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} S_x \cdot x \\ S_y \cdot y \\ S_z \cdot z \end{bmatrix}\]

This seems to work, but we do not just perform one transformation on a vector and call it a day; it’s often necessary to apply a series of transformations to the same vector. Let’s say we intend to apply a translation next, can we do it? Notice that we no longer have a $4x1$ vector; we have lost the $w$ component, which makes it impossible to perform this operation. It’s clear that we have to preserve the 4D representation while operating on the vector.

What dimensions does the scale matrix need to have to produce a $4x1$ vector when multiplied by a $4x1$ vector? Yes, the answer is $4x4$. But, what values should we have in this new row? The $w$ component of the result must be 1, which suggests $(0,0,0,1)$.

\[\begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} S_x \cdot x \\ S_y \cdot y \\ S_z \cdot z \\ 1 \end{bmatrix}\]

Now, let’s try to apply scaling followed by translation. When using column vectors, this chain of operations is written left to right, but performed right to left. It follows the nested functions analogy: $f(g(h(x))) = (f \circ g \circ h)(x)$.

\[TS\vec{v} = \begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} \\ = \begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} S_x \cdot x \\ S_y \cdot y \\ S_z \cdot z \\ 1 \end{bmatrix} = \begin{bmatrix} S_x \cdot x + T_x \\ S_y \cdot y + T_y \\ S_z \cdot z + T_z \\ 1 \end{bmatrix}\]

Matrix multiplication is associative, that is, $(AB)C = A(BC)$; hence, we are free to combine any anjacent pair without changing the order. This allows us to collapse the entire transformation chain into one matrix, and multiply the combined result with the vector. If these are some predefined transformations or do not change often, this can save a lot of CPU/GPU time and memory resources. Let’s combine the translation and scale matrices:

\[\begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} S_x & 0 & 0 & T_x \\ 0 & S_y & 0 & T_y \\ 0 & 0 & S_z & T_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix}\]

It the vector was represented as a row vector, then the multiplication would be done left to right in reverse order, i.e., we would take the transpose of the transformation chain: $(TS\vec{v})^T=\vec{v}^TS^TT^T$. Notice that the vector dimensions become $1x4$, and the transform matrices are of size $4x4$, which explains the need to reverse the order to make them compatible for multiplication.

\[\begin{bmatrix} x & y & z & 1 \end{bmatrix} \begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ T_x & T_y & T_z & 1 \end{bmatrix} \\ = \begin{bmatrix} x & y & z & 1 \end{bmatrix} \begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ T_x & T_y & T_z & 1 \end{bmatrix}\]

Rotation

One way to represent rotations is by using three separate rotations around coordinate axes, applied in a specific sequence. For example, we first rotate around $x$ by $\alpha$, then around $y$ by $\beta$, and finally around $z$ by $\gamma$. These are called Euler angles. In different industries, these rotations might have different names; for example, in avionics, rotations around $x$, $y$ and $z$ are called pitch, yaw and roll, respectively, given that $y$ is up. The following are the most common rotation matrices, derived for the right-handed basis orientation.

\[R_z R_y R_x = \begin{bmatrix} \cos{\gamma} & -\sin{\gamma} & 0 & 0\\ \sin{\gamma} & \cos{\gamma} & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \cos{\beta} & 0 & \sin{\beta} & 0\\ 0 & 1 & 0 & 0 \\ -\sin{\beta} & 0 & \cos{\beta} & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & \cos{\alpha} & -\sin{\alpha} & 0 \\ 0 & \sin{\alpha} & \cos{\alpha} & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]

This representation is easy to understand and visualize, but it’s not perfect. Before we point out the issues, let’s make some observations. The first thing to notice is that a rotation around a certain axis preserves the coordinates on that axis (through multiplication by 1), which is expected. Another thing to notice is that the order matters due to the non-commutative nature of matrix multiplication, but in what order should we apply these rotations?

An important thing to know is that there are two types of rotations: intrinsic and extrinsic. These describe the frame of reference you’re rotating about, which completely changes how the same sequence of angles plays out. With intrinsic rotations, the object is rotated about its local frame, which means that each rotation causes the local coordinate axes to move; the next rotation in the sequence happens relative to the new orientation. On the other hand, extrinsic rotations are about a fixed frame, e.g., world frame, or the parent object’s frame.

When we talk about rotations, we usually mean intrinsic rotations.

The problems associated with the representation above is not clear at first glance. To give you a clue, the first axis (rightmost in the matrix multiplication, outermost in a three-gimbal mechanism) can spin freely as it’s fixed in the world frame, the middle one is othogonal to both the first and the last by definition, but there is a chance for the first and last to align when the middle axis is at its extremes (e.g., at 90 degrees). When two axes align, rotations around both will have the same effect; hence, we lose one degree of freedom, which is called gimbal lock. Changing the multiplication order does not prevent this from happening, it just changes the pair that gets aligned.

To avoid gimbal lock, we could limit the movement of the middle axis, and in some cases, we could get away with it. For example, in an FPS game, players rarely look up to the sky or down to the ground, and it won’t bother them when the rotation hits its limitations as it would also be physically impossible for a human’s head to move beyond those angles. However, this is not a fix, just a mitigation. To eliminate the possibility of a gimbal lock altogether, modern graphics applications represent rotations using quaternions.

Algebraic Explanation of Gimbal Lock

Let’s say we have rotated around the $y$-axis by ${90^\circ}$, then the transformation becomes:

\[\begin{bmatrix} \cos{\gamma} & -\sin{\gamma} & 0 & 0\\ \sin{\gamma} & \cos{\gamma} & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 0 & 1 & 0\\ 0 & 1 & 0 & 0 \\ -1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & \cos{\alpha} & -\sin{\alpha} & 0 \\ 0 & \sin{\alpha} & \cos{\alpha} & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\] \[\begin{bmatrix} \cos{\gamma} & -\sin{\gamma} & 0 & 0\\ \sin{\gamma} & \cos{\gamma} & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & \sin{\alpha} & \cos{\alpha} & 0\\ 0 & \cos{\alpha} & -\sin{\alpha} & 0 \\ -1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\] \[\begin{bmatrix} 0 & \cos{\gamma}\sin{\alpha} - \sin{\gamma}\cos{\alpha} & \cos{\gamma}\cos{\alpha} + \sin{\gamma}\sin{\alpha} & 0 \\ 0 & \sin{\gamma}\sin{\alpha} + \cos{\gamma}\cos{\alpha} & \sin{\gamma}\cos{\alpha} - \cos{\gamma}\sin{\alpha} & 0 \\ -1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]

Using trigonometric identities, e.g., $\cos{\gamma}\sin{\alpha} - \sin{\gamma}\cos{\alpha} = \sin{(\alpha-\gamma)}$, we can rewrite this matrix as:

\[\begin{bmatrix} 0 & \sin{(\alpha-\gamma)} & \cos{(\alpha-\gamma)} & 0 \\ 0 & \cos{(\alpha-\gamma)} & -\sin{(\alpha-\gamma)} & 0 \\ -1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]

Notice that the final rotation matrix only depends on the difference $\alpha-\gamma$, not on $\alpha$ and $\gamma$ individually; they’re now coupled. Substituting $\theta$ for $\alpha-\gamma$, we can see that we now have two degrees of freedom, $\beta$ and $\theta$, one less compared to before: $\alpha$, $\beta$, $\gamma$.

Transformation Order

In a matrix-based system, the order matters because the multiplication operation is not commutative. When deciding on an order, we must consider in what space each operation should happen. There is no right or wrong answer — it all depends on what result we want to achieve at the end. Let’s analyze $TS\vec{v}$, which we derived before, in reverse order ($ST\vec{v}$).

\[\begin{bmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} = \begin{bmatrix} S_x & 0 & 0 & S_x \cdot T_x \\ 0 & S_y & 0 & S_y \cdot T_y \\ 0 & 0 & S_z & S_z \cdot T_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix}\]

As you can see, if translation is applied first, then the translation vector is scaled as well. If it was rotation that followed translation, then the object would be rotated about a shifted origin, which would result in an arc shaped movement. If scale follows rotation, then it’s applied with respect to the new orientation, which would make even a uniform scale look non-uniform. In many applications, we want to scale first, then rotate, and finally translate: $TRS\vec{v}$.

GLM

In graphics applications, it’s common, and often necessary, to perform matrix operations on the CPU. For example, in a game engine, object hierarchies are stored in CPU memory, and the transforms need to be re-calculated only when an object’s local transform or a parent transform changes, which can be done more efficiently on the CPU. The OpenGL Mathematics Library (GLM) is a header-only C++ math library that provides a large set of classes and functions that follow the same naming conventions and functionality as GLSL. It can be added to a CMake project as we did with GLFW.

git submodule add https://github.com/g-truc/glm /external/glm
add_subdirectory("${CMAKE_CURRENT_SOURCE_DIR}/external/glm")
target_link_libraries("${PROJECT_NAME}" PUBLIC glm)

Now, we can include the required GLM headers and define the transformation matrices. We usually start with a unit matrix, and call either one of glm::rotate, glm::scale, or glm::translate, to obtain a combined matrix. Since GLM, like OpenGL, represents matrices in column-major order, we place the first transformation to apply at the end in the multiplication.

#include <glm/glm.hpp>
#include <glm/gtc/matrix_transform.hpp>
#include <glm/gtc/type_ptr.hpp>
int main() {
  // vertex specification & shader creation
  // ...
  auto unit = glm::mat4(1.0f);
  // a uniform scale of .5
  auto scale = glm::scale(unit, glm::vec3(0.5f, 0.5f, 0.5f));
  // rotate 90 degrees around the z-axis
  auto rotate = glm::rotate(unit, glm::radians(90.0f), glm::vec3(0.0f, 0.0f, 1.0f));
  // translate by (.3, .2, .1)
  auto translate = glm::translate(unit, glm::vec3(0.3f, 0.2f, 0.1f));
  // construct the transform matrix
  auto transform = translate * rotate * scale;
  // ...
}

A transformation is typically defined per object — it applies to all vertices of that object. When a draw call (e.g., glDrawArrays) is issued, GPU launches many shader invocations in parallel — one per vertex, fragment, etc. GLSL defines per-draw, read-only constants called uniforms that are stored in a dedicated, broadcast-friendly area in GPU memory. Every thread can access these uniforms at no additional cost.

Uniform variables are global to the program object; if both vertex and fragment shaders define the same uniform, the linker treats them as referring to the same data.

In our vertex shader, we can define a uniform of type mat4 for the transform matrix. Then, we multiply the position vector with this matrix to obtain a final position. Notice that we represent the position in homogenous coordinates so that it’s compatible with the matrix, and the $w$ component is 1.0 since this is a position vector.

#version 330 core
layout(location = 0) in vec3 i_pos;
uniform mat4 transform;
void main() {
  gl_Position = transform * vec4(i_pos, 1.0);
}

To send the transform matrix to the shader, we first need to query its location via glGetUniformLocation. It’s advised to cache this location for later use since every communication with the graphics driver adds some latency. Then, we can send the transform data by calling glUniformMatrix4fv with the following arguments: location, the number of matrices to set, whether to transpose the matrix, and a pointer to the matrix data.

glUseProgram(shaderProgram);
GLuint transformLoc = glGetUniformLocation(shaderProgram, "transform");
glUniformMatrix4fv(transformLoc, 1, GL_FALSE, glm::value_ptr(transform));

Common Transformations

In most 3D applications, we define a few different coordinate spaces (or frames) to make it easier to perform and reason about certain operations. We set initial vertex positions with respect to a local frame that’s defined per object (model) so that they can be created independently (by multiple artists). Then, we bring multiple objects into a common (world) frame so that we can talk about their relative properties and make them interact with each other. We are only interested in a portion of this world that is visible to a camera, so we bring everything to a new (view) frame where the camera is at the origin. Finally, we define a volume that contains all the visible vertices, and discard the rest.

The use of multiple spaces requires us to apply a sequence of transformations to obtain the final (screen) position. However, they make it easier to selectively update any of the transform matrices. For example, if a 3D scene contains multiple cameras, and we want to give the user the ability to switch between them, then we only need to update the view matrix and can keep the rest.

The matrix that transforms a point defined in an object’s local frame to a point in the world frame is called the model matrix, and it’s obtained the same way we constructed the transform matrix in the previous chapter. The matrix that moves points in the world frame to the camera’s local frame is called the view matrix. The process to construct a view matrix is a bit different because of how we represent a camera in computer graphics. Unlike object, world and view spaces, clip space does not define a coordinate frame — it defines a bounded volume. The coordinates in the clip space are only meaningful relative to the $w$ component, and their values are bounded by it. The matrix that transforms view space to clip space is called the projection matrix. Once everything is in clip space, an operation called perspective division maps clip space to NDC space.

Camera

View space can be defined as the camera’s coordinate frame. GLM provides glm::lookAt to calculate the view matrix, which accepts three inputs: eye (camera position in world space), center (target position in world space), and up (positive $y$ direction in world space). GLM internally calculates a forward ($z$) vector by subtracting the camera position from the target position. Then, it obtains a right ($x$) vector by taking the cross product of the forward and up vectors. Because the provided up vector might not be orthogonal to both the forward and right vectors, GLM calculates a local up ($y$) vector by taking the cross product of these two vectors. The final three direction vectors with the camera at the origin constitute a new coordinate frame.

static constexpr glm::vec3 WORLD_UP = glm::vec3(0.0f, 1.0f, 0.0f);
auto position = glm::vec3(-2.0f, 2.0f, 2.0f);
auto target = glm::vec3(0.0f, 0.0f, 0.0f);
auto view = glm::lookAt(position, target, WORLD_UP);

The view matrix can also be constructed manually by first calculating the camera’s basis vectors, then forming the rotation and translation matrices, and finally combining them. The way we form the rotation matrix is by placing each basis vector in a column — glm::mat4 constructor accepts column vectors as input. When the matrix is constructed this way, every row contains a component from each basis vector — a rotation is expressed as a linear combination of the basis vectors.

Don’t forget to normalize the results, because we’re only interested in directions.

auto forward = glm::normalize(target - position);
auto right = glm::normalize(glm::cross(forward, WORLD_UP));
auto up = glm::cross(right, forward);
auto rotate = glm::mat4(glm::vec4(right, 0.0f), glm::vec4(up, 0.0f), glm::vec4(-forward, 0.0f), glm::vec4(0.0f, 0.0f, 0.0f, 1.0f));
auto translate = glm::translate(glm::mat4(1.0f), -position);
auto view = rotate * translate;

To make the camera the origin point of this new coordinate frame, we apply a translation that is equal to the negative of the position vector — their sum is $(0,0,0)$. OpenGL uses the right-handed coordinate system, and expects the camera to look along the negative $z$-axis in view space, i.e., objects in front of the camera should have negative $z$ coordinates. As a result, what’s considered “forward” in world space should map to negative $z$ in view space. To achieve this, we negate the forward vector when constructing the rotation matrix, regardless of its world space direction. Also, notice that translation is applied first, then rotation. This is because we want to rotate objects around the camera, not the world origin.

Camera Movement

In many applications, like most games, we have a moving camera that the user can control in some way. So far, we’ve hardcoded the arguments when creating a view matrix. In practice, this matrix can be updated as frequently as every frame, e.g., in an FPS game. To implement this, all we need to do is to call the glm::lookAt function in the render loop, and specify a dynamic target that is in front of the camera. We’ve learned that the target position is needed to calculate a forward vector. This calculation, however, can also be done in reverse — we can obtain a target by adding the forward vector to the current position. The forward vector must be initialized beforehand, though.

auto forward = glm::vector3(0.0f, 0.0f, -1.0f);
auto view = glm::lookAt(position, position + forward, up);

It would be nice if we could move the camera with key presses. Remember that GLFW can also receive input events — we can read the WASD keys to update the camera position. glfwGetKey is the function we call to read key status, which takes two inputs: the window pointer, and a key ID. The return value is compared against a predefined action, e.g., GLFW_PRESS, for confirmation.

Despite not being shown explicitly in some code blocks, if something has to be done every frame, e.g., reading inputs, it goes in the render loop.

static constexpr float speed = 0.05f;
if (glfwGetKey(window, GLFW_KEY_W) == GLFW_PRESS)
  position += forward * speed;
if (glfwGetKey(window, GLFW_KEY_S) == GLFW_PRESS)
  position -= forward * speed;
if (glfwGetKey(window, GLFW_KEY_D) == GLFW_PRESS)
  position += right * speed;
if (glfwGetKey(window, GLFW_KEY_A) == GLFW_PRESS)
  position -= right * speed;

One thing to notice is that the final position after 1 second depends on how many frames have been rendered, which is hardware-dependent. However, we usually want to have consistent results across different hardware. We can achieve this by scaling the increment amounts with the real time that passed to render the frame. Since we can’t calculate the frame time before the current frame ends, we use the most recent value instead. The time between two consecutive frames is called the delta time, and it can be calculated by subtracting the last frame’s time from the current time (at the beginning of the loop).

auto deltaTime = 0.0f;
auto lastFrame = 0.0f;
while (!glfwWindowShouldClose(window)) {
  float currentFrame = glfwGetTime();
  deltaTime = currentFrame - lastFrame;
  lastFrame = currentFrame;
  if (glfwGetKey(window, GLFW_KEY_W) == GLFW_PRESS)
    position += forward * speed * deltaTime;
  // ...
}

Even though we can move on the $xz$-plane, we are currently stuck with the same camera orientation — basis vectors do not change. To turn around, we would need to update the forward vector, and the others accordingly. Since we are talking about rotations, we could use Euler angles to represent them. We can use the avionics terms to describe the rotations: pitch, yaw, and roll. We will bind these to mouse movements. However, a mouse moves on a plane, so a roll movement is not possible. Front-back movement can represent pitch, and left-right movement can be interpreted as yaw. We can achieve $360^\circ$ coverage with combinations of pitch and yaw values.

Counter-clockwise rotations are considered positive in OpenGL. We usually represent rotations in radians, which is the format expected by GLM. A rotation that makes an angle $\beta$ (yaw) around the $y$-axis has a projection along the $x$-axis with length $\cos{\beta}$, and along the $z$-axis with length $\sin{\beta}$. Similarly, a rotation by an angle $\alpha$ (pitch) around the $x$-axis has a projection on the $xz$-plane with length $\cos{\alpha}$, and along the $y$-axis with length $\sin{\alpha}$. If we combine these, we obtain the following equations for the components of the forward vector.

forward.x = cos(yaw) * cos(pitch);
forward.y = sin(pitch);
forward.z = sin(yaw) * cos(pitch);

If forward vector is modified, right and up vectors should be updated as well to re-orient the coordinate frame.

right = glm::normalize(glm::cross(forward, WORLD_UP));
up = glm::normalize(glm::cross(right, forward));

Remember that we can register callbacks for input events with GLFW during initialization. We can move the update logic related to camera axes into the cursor position callback. We will assume that they are defined globally, and are accessible in this function. In a serious project, we would need to re-structure this code.

In the following callback function, we calculate the difference in mouse position between calls, then scale it with a sensitivity term, and finally add the horizontal ($x$) difference to yaw and the vertical ($y$) difference to pitch. The $y$ difference is negated because screen coordinates range from top to bottom. Also, if you look at the calculation of the forward vector again, you’ll see that at pitch angles close to $\pm90^\circ$, both $x$ and $z$ components approach 0 while $y$ goes to 1, resulting in the forward vector aligning with the up vector. Consequently, the cross product used to calculate the right vector becomes unstable, oscillating near $(0,0,0)$, which causes sudden $180^\circ$ flips (i.e., “up” becomes “down” and vice versa). Hence, it’s advised to limit the pitch to some safe range like $[-89^\circ,89^\circ]$.

We change the cursor mode to GLFW_CURSOR_DISABLED during initialization so that the cursor becomes invisible and can’t leave the window, while allowing for unlimited mouse movement (i.e., position is no longer clamped to screen edges).

void CursorPosCallback(GLFWwindow* window, double xpos, double ypos) {
  static constexpr float PITCH_LIMIT = glm::radians(89.0f);
  static constexpr float SENSITIVITY = 0.001f;
  static auto firstEnter = true;
  static glm::vec2 mousePos;
  static glm::vec2 mouseLast;
  static glm::vec2 mouseDiff;
  mousePos.x = xpos;
  mousePos.y = ypos;
  if (firstEnter)
    mouseLast = mousePos;
  firstEnter = false;
  mouseDiff.x = (mousePos.x - mouseLast.x) * SENSITIVITY;
  mouseDiff.y = (mouseLast.y - mousePos.y) * SENSITIVITY;
  mouseLast = mousePos;
  yaw += mouseDiff.x;
  pitch += mouseDiff.y;
  pitch = std::clamp(pitch, -PITCH_LIMIT, PITCH_LIMIT);
  // update camera's basis vectors
  // update the view matrix
}
int main() {
  // ...
  glfwSetInputMode(window, GLFW_CURSOR, GLFW_CURSOR_DISABLED);
  glfwSetCursorPosCallback(window, CursorPosCallback);
  // ...
}

Since the cursor can enter the window at an arbitrary position, we initially set mouseLast to mousePos to prevent a sudden jump.

Projection

Clip space is the result of applying a projection matrix to a region of the view space defined by some boundaries. This bounded region is called the viewing volume, and any point inside this volume that survives the depth test will end up on the screen. In clip space, points are represented using homogenous coordinates, i.e., $(x,y,z,w)$, and are not yet normalized, i.e., they’re not in the form $(x’,y’,z’)$. The $w$ component was added for convenience — to enable translation to be expressed as matrix multiplication. At projection stage, we repurpose this component to store the depth information. But, $z$ already represents depth (distance from camera) in view space, why do we need to use the $w$ component? After applying the projection, $z$ is no longer the original depth — it’s been remapped for the depth buffer (usually to $[0,1]$ range). The projection matrix typically puts the original view space $z$ value into $w$. Note that this is only needed when perspective projection is used — for perspective division that happens after the projection matrix is applied. On the other hand, in orthographic projection, $w$ remains 1 throughout the pipeline. Now, let’s explore these two types of projection.

Orthographic Projection

This type of projection is an affine transformation — it preserves straight lines, and ratios along a line (e.g., midpoints stay midpoints). It can be expressed as a combination of a linear transformation and a translation in Cartesian coordinates. To create an orthographic projection matrix, we first define a cubic viewing volume (a cuboid) bounded by six axis-aligned planes: near ($n$), far ($f$), left ($l$), right ($r$), bottom ($b$), and top ($t$). Then, we calculate the scaling factors and translation amounts that map each point in this volume to clip space ($[l, r][b, t][n, f] \rightarrow [-1,1]^3$), which is equal to NDC when using orthographic projection.

$x_e$ is the eye (view) space, $x_c$ is the clip space, and $x_n$ is the NDC space coordinate.

\[\frac{x_c}{1-(-1)} = \frac{x_e-\frac{r+l}{2}}{r-l} \Rightarrow x_c = \frac{2x_e-(r+l)}{r-l} \\ y_c = \frac{2y_e-(t+b)}{t-b} \\ \frac{z_c}{1-(-1)} = -\frac{z_e-\frac{f+n}{2}}{f-n} \Rightarrow z_c = \frac{2z_e-(f+n)}{n-f}\]

We subtract the midpoint, e.g., $(r+l)\div2$, from each coordinate so that the points on the left map to $[-1,0]$ while those on the right map to $[0,1]$. The cuboid is usually centered on the $xy$-plane, i.e., $l$ and $b$ are equal to negative $r$ and $t$, respectively. By convention, near and far planes are given as positive distances. As opposed to view space, NDC uses the left-handed coordinate system, i.e., far maps to 1, and near maps to $-1$. Scale along the $z$-axis is negated, because larger (less negative) $z$ coordinates represent points that are closer to the near plane. This set of equations can be written in matrix form as follows:

\[\begin{bmatrix} x_n \\ y_n \\ z_n \\ w_n \end{bmatrix} = \begin{bmatrix} x_c \\ y_c \\ z_c \\ w_c \end{bmatrix} = M_{orthographic}\vec{v_e} = \begin{bmatrix} \frac{2}{r-l} & 0 & 0 & -\frac{r+l}{r-l} \\ 0 & \frac{2}{t-b} & 0 & -\frac{t+b}{t-b} \\ 0 & 0 & -\frac{2}{f-n} & -\frac{f+n}{f-n} \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x_e \\ y_e \\ z_e \\ w_e \end{bmatrix}\]

This matrix can be created by calling glm::ortho with the plane coordinates as inputs. To prevent stretching or squashing due to a mismatch in aspect ratio between the viewport and the viewing volume, we multiply the width (length along the $x$-axis) of the cuboid with the aspect ratio ($width/height$).

auto size = 2.0f;
auto aspect = 16.0f / 9;
auto near = 0.1f;
auto far = 100.0f;
auto left = -size * aspect;
auto bottom = -size;
glm::mat4 projection = glm::ortho(left, -left, bottom, -bottom, near, far);

Perspective Projection

Orthographic projection has its uses in many engineering applications since it preserves relative sizes of objects, i.e., there is no sense of depth. However, in many other 3D applications, we want to see realistic results. In the context of projection, realism can be achieved by simulating the natural phenomenon of perspective foreshortening — objects appear smaller when they are moved farther away from the eye (camera). For example, a human that is close to the camera may appear the same height as a mountain that is miles away. It’s clear that a cuboid cannot represent this viewing volume — if it barely covers a human close to the near frame, it will cover only a fraction of the mountain at the other end, due to both faces having the same area. A more appropriate volume would have a pyramid-like shape, which extends in cross-sectional area with increasing distance. However, its top will be cut off due to near plane being slightly larger than zero (for various reasons). This unique shape is called a frustum.

In orthographic projection, we projected each component of a point independently through linear transformations (scale + translate). In perspective projection, we have to map points on both near and far planes (and those in between) to the same $[-1,1]$ range, which implies that the larger plane must be “squeezed” more, and the scale amount is proportional to the depth. Scaling by $z$ (depth) is not a linear transformation — it’s division by a component of the input vector. This cannot be represented as matrix multiplication; hence, it has to happen in a separate step called perspective division, after the projection matrix has been applied. Since we will lose the original $z_e$ value when we move from view space to clip space, we will store it in the $w_c$ component.

We can hypothetically project any point inside the frustum onto the near plane to have a better understanding of where each point will end up in the final image. In the process, we calculate the ratios between the $x_e$ and $y_e$ coordinates and their projections ($x_p$ and $y_p$) using the properties of similar triangles.

\[\frac{x_p}{x_e} = \frac{n}{-z_e} \Rightarrow x_p = x_e\frac{n}{-z_e} \\ \frac{y_p}{y_e} = \frac{n}{-z_e} \Rightarrow y_p = y_e\frac{n}{-z_e}\]

Once they’re on the near plane, we can linearly map both $x_p$ and $y_p$ to NDC ($x_n$ and $y_n$) just like we did in orthographic projection. However, we were supposed to go from eye space to NDC, so we need to rewrite the projected coordinates in terms of the eye space coordinates, which we calculated above. Remember that we can’t represent division by $-z_e$ in matrix form, which is the reason we’ll store the value in $w_c$. Hence, we can get rid of the $-z_e$ in the denominator by multiplying everything by it. Notice that the multiplication of the NDC coordinates with $z_e$ are just clip space coordinates.

\[x_n = \frac{2x_p}{r-l}-\frac{r+l}{r-l} \Rightarrow x_n = \frac{2nx_e}{-z_e(r-l)}-\frac{r+l}{r-l} \\ \Rightarrow -z_ex_n = \frac{2nx_e}{r-l}+\frac{z_e(r+l)}{r-l} = x_c \\ -z_ey_n = \frac{2ny_e}{t-b}+\frac{z_e(t+b)}{t-b} = y_c\]

Projecting $z_e$ onto the near plane would result in all having the same $-n$ value since they are along the same axis. So, there is not much information to gain from that. We know that $z_n$ does not depend on $x_e$ or $y_e$, so we can write it as a linear combination of $z_e$ and $w_e$. In eye space, $w$ is 1, so the equation can be written as follows:

\[z_n = \frac{z_c}{w_c} = \frac{az_e+bw_e}{-z_e} = \frac{az_e+b}{-z_e}\]

We want to map $[-n,-f]$ to $[-1,1]$ when transforming eye coordinates to NDC, which gives us two equations to solve for $a$ and $b$.

\[-1 = \frac{-na+b}{n},\, 1 = \frac{-fa+b}{f} \\ \Rightarrow a = -\frac{f+n}{f-n},\, b = -\frac{2fn}{f-n} \\ z_c = -z_ez_n = -\frac{f+n}{f-n}z_e -\frac{2fn}{f-n}\]

We now have all the elements needed to construct the perspective projection matrix. Notice that the clip space $w$ component is equal to eye space $z$ after the multiplication, which is the value used in perspective divide to normalize all components of the clip space vector. This final step is performed by the GPU, and the result is the NDC coordinates.

\[\begin{bmatrix} x_c \\ y_c \\ z_c \\ w_c \end{bmatrix} = M_{perspective}\vec{v_e} = \begin{bmatrix} \frac{2n}{r-l} & 0 & \frac{r+l}{r-l} & 0 \\ 0 & \frac{2n}{t-b} & \frac{t+b}{t-b} & 0 \\ 0 & 0 & -\frac{f+n}{f-n} & -\frac{2fn}{f-n} \\ 0 & 0 & -1 & 0 \end{bmatrix} \begin{bmatrix} x_e \\ y_e \\ z_e \\ w_e \end{bmatrix}\]
auto fov = glm::radians(45.0f);
auto aspect = 16.0f / 9;
auto near = 0.1f;
auto far = 100.0f;
glm::mat4 projection = glm::perspective(fov, aspect, near, far);

Depth Buffer

Perspective projection remaps $z$ into a normalized range $[-1,1]$ in NDC; however, unlike orthographic projection, this mapping is nonlinear (by design), which gives higher precision to depths closer to the near plane. For example, a point halfway between the near and far planes of the frustum will end up closer to the near side of the cube. This mimics how we see in real world — our eyes are more sensitive to depth changes nearby. On the other hand, it has side effects like depth fighting in distant geometry, i.e., there is not enough precision to reliably determine which vertex is in front of the other, and this leads to flickering, tearing or shimmering.

OpenGL stores a per-fragment depth information in a depth buffer (or $z$-buffer). Just like a color buffer, a default depth buffer is created by GLFW. When depth testing is enabled, OpenGL compares a fragment’s depth with the existing value in the buffer; if the fragment is in front, value in the buffer is overwritten. This way, objects closer to the camera become the ones appearing in the final image. Depth testing is disabled by default — an OpenGL capability can be enabled via a glEnable call by specifying an ID. When it’s enabled, depth buffer should be included in the glClear calls to remove residual data from the previous frame.

glEnable(GL_DEPTH_TEST);
while (!glfwWindowShouldClose(window)) {
  glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
  // ...
}

Since we have enabled depth testing, we can correctly render more complex 3D shapes. Let’s replace our triangle with a cube by specifying the positions of its 36 vertices — a cube has six faces (quads), each of which is formed by two triangles (six vertices). Of the four vertices that define a quad (when connected), two are shared between the triangles that form the quad. Each corner of the cube is also shared by three faces. Depending on how we split the faces in half, the same vertex may appear three to six times in the vertices array.

float vertices[] = {
  -0.5f, -0.5f, -0.5f, // 0
  0.5f, -0.5f, -0.5f, // 1
  0.5f, 0.5f, -0.5f, // 2
  0.5f, 0.5f, -0.5f, // 2
  -0.5f, 0.5f, -0.5f, // 3
  -0.5f, -0.5f, -0.5f, // 0
  -0.5f, -0.5f, 0.5f, // 4
  0.5f, -0.5f, 0.5f, // 5
  0.5f, 0.5f, 0.5f, // 6
  0.5f, 0.5f, 0.5f, // 6
  -0.5f, 0.5f, 0.5f, // 7
  -0.5f, -0.5f, 0.5f, // 4
  -0.5f, 0.5f, 0.5f, // 7
  -0.5f, 0.5f, -0.5f, // 3
  -0.5f, -0.5f, -0.5f, // 0
  -0.5f, -0.5f, -0.5f, // 0
  -0.5f, -0.5f, 0.5f, // 4
  -0.5f, 0.5f, 0.5f, // 7
  0.5f, 0.5f, 0.5f, // 6
  0.5f, 0.5f, -0.5f, // 2
  0.5f, -0.5f, -0.5f, // 1
  0.5f, -0.5f, -0.5f, // 1
  0.5f, -0.5f, 0.5f, // 5
  0.5f, 0.5f, 0.5f, // 6
  -0.5f, -0.5f, -0.5f, // 0
  0.5f, -0.5f, -0.5f, // 1
  0.5f, -0.5f, 0.5f, // 5
  0.5f, -0.5f, 0.5f, // 5
  -0.5f, -0.5f, 0.5f, // 4
  -0.5f, -0.5f, -0.5f, // 0
  -0.5f, 0.5f, -0.5f, // 3
  0.5f, 0.5f, -0.5f, // 2
  0.5f, 0.5f, 0.5f, // 6
  0.5f, 0.5f, 0.5f, // 6
  -0.5f, 0.5f, 0.5f, // 7
  -0.5f, 0.5f, -0.5f // 3
};

Duplicates are fine since they do not inherently harm performance, as long as there is enough GPU memory. In fact, they are often necessary, e.g., when we need to encode per-face attributes such as surface normals. However, in this case, we only have position attributes, and there is a more elegant way of representing this data.

OpenGL provides a variety of targets for a buffer to bind to. One such target is GL_ELEMENT_ARRAY_BUFFER, which indicates that the buffer contains indices. An index buffer is often referred to as an Element Buffer Object (EBO). An EBO is VAO-specific, which means a VAO can only refer to one such buffer. During indexed rendering, the stored indices are used to access the elements in any vertex attribute buffer (VBO) that the bound VAO refers to (via attribute pointers).

A VBO can be unbound before the VAO is, as long as the attribute pointers have already been configured. In contrast, unbinding an EBO while a VAO is bound will disassociate it from that VAO.

We can reduce the size of the vertices array from 36 positions to just 8, one for each corner, by introducing an indices array that defines the triangles.

float vertices[] = {
  -0.5f, -0.5f, -0.5f, // 0
  0.5f, -0.5f, -0.5f, // 1
  0.5f, 0.5f, -0.5f, // 2
  -0.5f, 0.5f, -0.5f, // 3
  -0.5f, -0.5f, 0.5f, // 4
  0.5f, -0.5f, 0.5f, // 5
  0.5f, 0.5f, 0.5f, // 6
  -0.5f, 0.5f, 0.5f, // 7
};
unsigned int indices[] = {
  0, 1, 2,
  2, 3, 0,
  4, 5, 6,
  6, 7, 4,
  7, 3, 0,
  0, 4, 7,
  6, 2, 1,
  1, 5, 6,
  0, 1, 5,
  5, 4, 0,
  3, 2, 6,
  6, 7, 3
};

Previously, we had to store $36\cdot 3\cdot 4=432$ bytes of data (a float is 4 bytes in most systems); now, we only need a storage area of $8\cdot 3\cdot 4+12\cdot 3\cdot 4=240$ bytes, which is a $44\%$ reduction. If a byte or short is sufficient to store the indices, and vertex reuse is high in the mesh, memory savings can be even more. However, GPUs today have huge amounts of memory, which makes it pointless to look for small optimizations like this. Now, let’s see how we would use the indices buffer to draw a cube.

GLuint ebo;
glGenBuffers(1, &ebo);
glBindVertexArray(vao);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ebo);
glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 3 * sizeof(float), (void*)0);
glEnableVertexAttribArray(0);

In the render loop, we simply bind the VAO and call glDrawElements by specifying the draw mode, index count, index type, and an index buffer pointer if no EBO is used.

glUseProgram(shaderProgram);
glBindVertexArray(vao);
glDrawElements(GL_TRIANGLES, 36, GL_UNSIGNED_INT, 0);
glBindVertexArray(0);

MVP

To obtain the final position of a vertex in clip space, we apply the model, view and projection (MVP) transforms in succession ($PVM\vec{v}$). Our vertex shader now has three uniform mat4 inputs each of whom shall be sent separately to the GPU. The model matrix should also encode parent transformations if there is a hierarchy.

#version 330 core
layout(location = 0) in vec3 i_pos;
uniform mat4 model;
uniform mat4 view;
uniform mat4 projection;
void main() {
  gl_Position = projection * view * model * vec4(i_pos, 1.0);
}
auto model = glm::mat4(1.0f);
model = glm::rotate(model, glm::radians(-55.0f), glm::vec3(1.0f, 0.0f, 0.0f));
auto view = glm::mat4(1.0f);
view = glm::translate(view, glm::vec3(0.0f, 0.0f, -3.0f));
glm::mat4 projection;
projection = glm::perspective(glm::radians(45.0f), 800.0f / 600.0f, 0.1f, 100.0f);
// ...
glUseProgram(shaderProgram);
auto modelLoc = glGetUniformLocation(shaderProgram, "model");
glUniformMatrix4fv(modelLoc, 1, GL_FALSE, glm::value_ptr(model));
// ...

Debugging and Logging

OpenGL cannot output its debug messages unless our application provides a way to display them. We can receive these messages by registering a callback function via glDebugMessageCallback and process them within this function. The most common way to handle debug messaging is by using a logger like spdlog to either store the messages or directly output them to a chosen target (sink) as formatted text.

git submodule add https://github.com/gabime/spdlog /external/spdlog
add_subdirectory("${CMAKE_CURRENT_SOURCE_DIR}/external/spdlog")
target_link_libraries("${PROJECT_NAME}" PUBLIC spdlog)

If you use the default logger, which is created implicitly when a call like spdlog::info is made, it’s a colorized console logger that outputs to stdout, which is good enough for our purposes.

#include <spdlog/spdlog.h>
void DebugMessageCallback(unsigned int, unsigned int, unsigned int, unsigned int, int, const char*, const void*);
int main() {
  // ...
  glDebugMessageCallback(DebugMessageCallback, nullptr);
  // ...
}
void DebugMessageCallback(unsigned int source, unsigned int type, unsigned int id, unsigned int severity, int length, const char* message, const void* userParam) {
  std::string sourceStr, typeStr;
  switch (source) {
  case GL_DEBUG_SOURCE_API:
    sourceStr = "API";
    break;
  case GL_DEBUG_SOURCE_WINDOW_SYSTEM:
    sourceStr = "Window System";
    break;
  case GL_DEBUG_SOURCE_SHADER_COMPILER:
    sourceStr = "Shader Compiler";
    break;
  case GL_DEBUG_SOURCE_THIRD_PARTY:
    sourceStr = "Third Party";
    break;
  case GL_DEBUG_SOURCE_APPLICATION:
    sourceStr = "Application";
    break;
  case GL_DEBUG_SOURCE_OTHER:
    sourceStr = "Other";
    break;
  }
  switch (type) {
  case GL_DEBUG_TYPE_ERROR:
    typeStr = "Error";
    break;
  case GL_DEBUG_TYPE_DEPRECATED_BEHAVIOR:
    typeStr = "Deprecated Behavior";
    break;
  case GL_DEBUG_TYPE_UNDEFINED_BEHAVIOR:
    typeStr = "Undefined Behavior";
    break;
  case GL_DEBUG_TYPE_PORTABILITY:
    typeStr = "Portability";
    break;
  case GL_DEBUG_TYPE_PERFORMANCE:
    typeStr = "Performance";
    break;
  case GL_DEBUG_TYPE_MARKER:
    typeStr = "Marker";
    break;
  case GL_DEBUG_TYPE_PUSH_GROUP:
    typeStr = "Push Group";
    break;
  case GL_DEBUG_TYPE_POP_GROUP:
    typeStr = "Pop Group";
    break;
  case GL_DEBUG_TYPE_OTHER:
    typeStr = "Other";
    break;
  }
  switch (severity) {
  case GL_DEBUG_SEVERITY_HIGH:
    spdlog::error("OpenGL {} {}: {}", sourceStr, typeStr, message);
    break;
  case GL_DEBUG_SEVERITY_MEDIUM:
    spdlog::warn("OpenGL {} {}: {}", sourceStr, typeStr, message);
    break;
  case GL_DEBUG_SEVERITY_LOW:
    spdlog::info("OpenGL {} {}: {}", sourceStr, typeStr, message);
    break;
  default:
    spdlog::debug("OpenGL {} {}: {}", sourceStr, typeStr, message);
  }
}

These debug messages may help you identify common issues such as buffer-target mismatches; however, there are more nuanced issues that do not necessarily generate error messages, but can only be identified visually while the application is running. Tools like RenderDoc allow you to capture frames, step through every draw call, view GPU resources (e.g., textures), inspect pipeline stages, and visually debug meshes and vertex attributes.


To be continued…

Making a Game Engine


I’m writing this to reflect on my progress so far in making my own game engine, Kuki, that is named after my eldest cat. My journey began by following the tutorials at learnopengl.com to learn graphics programming. After some time, I realized that I was just copy-pasting code and running it locally. This is often the case with tutorials—without a clear goal, it’s easy to lose focus and interest. So, I wanted to do something meaningful with the stuff I was learning. The choice was obvious: creating a game engine.

The first thing I needed to do was to decide on an architecture. Luckily, there aren’t many options when it comes to game engine architecture. The most well-known model is the Game Object–Component model, in which data and behavior are tightly coupled. Alternatively, there is the Entity‑Component‑System (ECS) model, which emphasizes separating data from behavior. In ECS, component data is stored in contiguous memory arrays and systems can efficiently process many components in a cache-friendly manner. ECS is easy to scale and suitable for performance critical applications. There is a shift toward ECS in modern game engines, and there are a lot of open source examples. I found the ECS architecture easy to understand and more straightforward to implement because of the clear separation between data and control paths. So, ECS was the way to go.

I had very little knowledge of C++ when I started the OpenGL tutorials. Even though I had a couple of years’ experience with C#—a language syntactically similar to C++—I struggled in the beginning because of the differences in memory management. When I designed a Transform component, I included a raw pointer to the parent. When it came to resolving world-space positions of the entities in the scene, I had invalid memory access exceptions. I knew that std::vector, which was the underlying data structure for storing components of the same type, would reallocate when new elements were added to it in order to preserve the contiguous nature of the storage. However, I didn’t account for this behavior until it resulted in an error. I had to switch to using parent entity IDs instead of raw pointers to resolve the issue. I avoided using smart pointers because they added overhead, disrupted cache efficiency, and, more importantly, were unnecessary for my design.

Since my main focus was the visual aspects of the engine, I wanted to define the components and systems that are directly related to rendering first. In Unity, three main components impact how an object is rendered: Transform, MeshFilter, and MeshRenderer. These components define where, what and how to render, respectively. MeshFilter is a wrapper around the Mesh component which stores OpenGL buffer IDs that define a mesh, and contains some metadata. MeshRenderer comprises shader and texture IDs, color values, and properties such as roughness or metalness that the shader may require. Components are all simple structs with fixed sizes for efficient storage. I have defined a generic class named ComponentManager<T> to manage component addition, removal, etc. This class stores component instances in a dynamic array and keeps mappings between entity IDs and array indices for efficient look-ups.

In most ECS implementations, entities are simply IDs, and there can be additional metadata such as names or tags associated with those IDs. In my game engine, I have a class called EntityManager to manage entity creation, deletion, renaming, etc. This class creates ComponentManager<T> instances on demand when methods like AddComponent<T> are called for a valid entity. Operations associated with a component type is passed to the corresponding manager instance. To store the ComponentManager<T>* instances, I use std::type_index as the key.

template <typename T>
ComponentManager<T>* EntityManager::GetManager() {
  static_assert(std::is_base_of<IComponent, T>::value, "T must inherit from IComponent.");
  auto type = std::type_index(typeid(T));
  auto it = typeToManager.find(type);
  if (it == typeToManager.end()) {
    typeToManager.emplace(type, new ComponentManager<T>());
    nameToType.emplace(ComponentTraits<T>::GetName(), type);
    idToType.emplace(ComponentTraits<T>::GetId(), type);
    typeToMask.emplace(type, ComponentTraits<T>::GetMask());
  }
  return static_cast<ComponentManager<T>*>(typeToManager[type]);
}

EntityManager keeps a counter of type unsigned int to assign an ID to the next entity. At the moment, I’m not concerned about the limits or reuse of these values. I also keep a map between IDs and entity names. To prevent duplicates, I’ve implemented a Trie data structure to keep track of the names. A reference to an std::string must be passed to the Create method for each new entity. If the name already exists, a numeric suffix is appended to the requested name. This Trie is updated after entity deletion or rename operations to enable name reuse.

In addition to these, EntityManager defines methods that facilitate iterating over entities. These helper methods take in a function pointer and execute it on entities that satisfy some specified constraints. One such method, shown below, checks whether an entity possesses the required components and, if so, calls the specified function.

template <typename... T, typename F>
void EntityManager::ForEach(F func) {
  for (const auto id : ids)
    if (HasComponents<T...>(id)) {
      auto components = GetComponents<T...>(id);
      std::apply([&](T*... args) { func(id, args...); }, components);
    }
}

Entities usually live in a scene; therefore, each scene in my engine has its own EntityManager instance. I also defined a method in my Scene class that returns a pointer to the active Camera component (in case the scene contains more than one). Most games require multiple scenes and the ability to switch between them, which I implemented in my SceneManager class. Together with the InputManager and other systems, these components form an Application that serves as the base for any program utilizing my engine’s functionality.

The first application that I built by extending this Application class was obviously an editor. When working with an engine, you quickly realize that even a simple change such as moving an entity requires you to re-compile the entire application. An editor provides an interface to manipulate object properties at runtime. Like most open source engines, I decided to use Dear ImGui to create a UI for my editor. It’s an immediate mode GUI implementation that is quite easy to use. However, I still didn’t want to deal with the UI code, so I let AI generate it. And, as long as it can be maintained and updated by AI, I don’t need to know the details of the implementation. Even though I had to manually edit the UI code a few times, AI is doing a fantastic job today.

To display and manipulate component properties in the UI, I needed to define an interface called IComponent that every component struct had to extend. This interface lets the editor retrieve an array of Property instances and set new values as the values are edited through the UI. The PropertyType informs the UI how to display that particular property (e.g., as a color wheel).

struct KUKI_ENGINE_API Property {
  using PropertyValue = std::variant<int, float, bool, glm::vec3, glm::vec4...>; // some types are omitted
  std::string name{};
  PropertyValue value{};
  PropertyType type{PropertyType::Number};
  Property(const std::string& = "", const PropertyValue& = 0, PropertyType = PropertyType::Number);
};
struct IComponent {
  virtual ~IComponent() = default;
  virtual const std::string GetName() const = 0;
  virtual std::vector<Property> GetProperties() const = 0;
  virtual void SetProperty(Property) = 0;
};

The KUKI_ENGINE_API is a macro that is needed to export symbols when the engine is built as a shared library. I initially made the engine a shared library, then later turned everything static so that only an executable file is created. When working with shared libraries, I had to deal with some issues. One of them was the scope of globals in these libraries. For context, I’m using CMake as the build tool. The engine is a target linked against statically built C libraries such as GLFW and GLAD, while the editor is another target linked against the engine’s shared library. When the editor program is executed, it becomes a process, and all the required shared libraries (DLLs on Windows) are loaded into the process memory. A library like GLAD defines many functions in its global scope. You may think that these globals are shared between the engine and the editor, but that’s not the case. When the Application class (in the engine library) initializes an OpenGL context in its Init method, the editor’s ImGui initialization code later attempts to access OpenGL functions but finds a different copy of the globals, resulting in an error because no valid context exists. The solution I found was to also build GLAD and GLFW as shared libraries, which guarantees that there is exactly one copy of their globals in the memory. I don’t know how this would work out in a multi-threaded scenario though.

The first system I defined was the RenderingSystem. Initially, I was using the default framebuffer for rendering since that’s how it’s done in the tutorial. But, when I created the editor, I needed to display the scene within an ImGui window. So, I created a custom framebuffer, which also enabled me to apply post-processing in a separate pass. My Shader class looks more or less like the one in the tutorial. The LitShader class that extends the Shader class, defines additional methods to set material and light properties. Moreover, I implemented instanced rendering by automatically grouping objects by shared meshes to reduce draw calls. For now, I’m using OpenGL buffer IDs to determine if a mesh is referenced by multiple entities in the scene; however, more advanced methods are needed to determine if two meshes with different IDs are nearly identical. Although I allow per-instance properties such as colors, I do not account for texture variations; if two copies of the same mesh reference different textures, the last one processed will determine which texture is used. This means I might need to issue a new draw call for each variation.

void RenderingSystem::DrawScene() {
  std::unordered_map<unsigned int, std::vector<unsigned int>> vaoToEntities;
  std::unordered_map<unsigned int, Mesh> vaoToMesh;
  app.ForEachEntity<MeshFilter>([this, &vaoToMesh, &vaoToEntities](unsigned int id, MeshFilter* filter) {
    auto vao = filter->mesh.vertexArray;
    vaoToMesh[vao] = filter->mesh;
    vaoToEntities[vao].push_back(id);
  });
  for (const auto& [vao, entities] : vaoToEntities)
    DrawEntitiesInstanced(&vaoToMesh[vao], entities);
}

An important part of a game engine is the ability to work with assets, because basic shapes and colors quickly become monotonous. However, I needed specialized classes to load and store assets. For the storage part, I noticed that the task was quite similar to managing entities and their components. An asset needed to be easily convertible to an entity once it was loaded into a scene. So, an AssetManager would simply be an instance of EntityManager, and instantiating an asset would mean copying its components from one manager to another. For the loading part, I struggled a bit while trying to load models using Assimp. First, I was unable to load model textures from FBX models. When I switched to using glTF I was able to load the textures. Second, the loaded model parts had incorrect rotations because I converted quaternion values to Euler angles before storing them in the Transform component. Since Euler angles are susceptible to gimbal lock, switching the storage to quaternions resolved the issue. Finally, load operations were stalling the main thread, and I resolved this by asynchronously loading the textures. However, OpenGL buffers must be created on the main thread, as the OpenGL context is only valid there.

The tutorial begins with the Blinn-Phong model and then introduces the PBR pipeline. Transitioning to PBR, however, was challenging. I had to load an environment map, generate both an irradiance map and a prefilter map, and implement instanced rendering. Achieving correct functionality required significant effort—including extensive debugging with RenderDoc—but ultimately, the results were stunning.

Features that are not yet fully functional include frustum culling and a scripting system. I’m also dissatisfied with the frame rates in simple scenes. CPU profiling in Visual Studio revealed significant performance improvements when I added dirty flags to Transform components. This change ensured that the entity hierarchy is traversed only when a change necessitates recalculating the world transform. Nonetheless, there is still room for improvement, and I am actively working on further enhancements.

Pathfinding and AI in Unity


Following my experiments with procedural map generation, I wanted to see some action on the maps I generated. So, I decided to add agents to the scene to observe how they navigate through the map. And, in order to make it look more interesting, I wanted the agents to behave intelligently and perform certain tasks.

For navigation, I decided not to use Unity’s AI navigation package because I wanted to implement it myself. I specifically wanted to learn about the A* pathfinding algorithm and integrate it into my project. I found this amazing video series by Sebastian Lague to learn how to implement the algorithm. Since my terrain was created from point samples generated by Poisson Disk Sampling, it wasn’t a grid, so I needed to adapt the code in the tutorial to arbitrary graphs. The transformation seemed straightforward: the most noticeable difference between a grid, which is a special kind of graph, and the graph that represented my procedurally generated terrain was that the nodes in my graph can have arbitrary number of neighbors. However, it wasn’t going to be that simple, because the game’s poor performance in the case of many pathfinding requests would force me to parallelize pathfinding.

Running pathfinding tasks in parallel implied that I had to use Unity’s job system to enable multithreaded code execution. This meant that I had to convert my A* pathfinding implementation to a struct that implemented the IJob interface. In addition to this, I had to get rid of all the managed (≈ garbage collected) objects that the job used and switch to thread-safe types because the Burst compiler doesn’t support managed types.

The first change I made was to convert the managed node class that represented the vertices of my graph to a struct. Since the terrain is generated at the start of the game and it remains the same throughout runtime, the node positions could be made read-only. For pathfinding, I only needed to use the position on the xz-plane since the point samples were initially contained within a 2D disk on this plane and later assigned an elevation (y) value using Perlin Noise. I also needed the 3D positions of the nodes for placing objects at these locations. So, I decided to declare three fields (x, y and z) and define two properties (Pos2 and Pos3) that would return the 2D and 3D positions, respectively. In addition to the coordinates and positions, I also added an index field to facilitate the lookup of nodes. The GraphNode struct ended up looking like this:

public readonly struct GraphNode : IEquatable<GraphNode> {
  public readonly int index;
  public readonly float x;
  public readonly float y;
  public readonly float z;
  public readonly Vector2 Pos2 => new(x, z);
  public readonly Vector3 Pos3 => new(x, y, z);
  public readonly bool Equals (GraphNode other) => index == other.index;
  // ...
}

The A* pathfinding algorithm requires three values: g is the cost from the start node to the current node, h is the heuristic cost from the current node to the end node, and f is the sum of g and h. It begins by exploring the neighbors of the start node and stores them in the open list, sorted by their f values in an ascending order (lowest f cost first). A node whose every neighbor has been visited is moved to the closed list. Then, a new node is taken from the open list until the end node is added to the closed list. If the algorithm discovers a shorter path to any node in the open list during its execution, it updates the g cost of that node which results in an updated f cost. Hence, the open list should be able to efficiently reorder the nodes in the case of an update, which I will discuss later. For the algorithm to be able to trace the path backwards to construct the path after the end node has been reached, each node must point to the neighbor with the lowest g cost to indicate which step was taken; hence, another field (named parent) is required to store a reference to this neighboring node. Remember that this reference cannot be a pointer to a class instance since we’re not allowed to use managed types.

You may be wondering why I didn’t include the cost fields in the GraphNode struct. Notice that these costs are different for every pathfinding request that has a different start and end point, which means every job needs its own copy of the graph. Remember that the terrain, hence the node positions, is the same for each job. Storing both the positions and the costs in the same struct would result in unnecessary duplication of the coordinate values which did not change. So, it made more sense to define a separate struct to store the costs and other pathfinding related data that differed from job to job. The nodes that stored the positions could be shared between jobs (more on that later). With the addition of couple other fields that are implementation specific, the PathNode struct turned out like this:

public struct PathNode : IEquatable<PathNode>, IComparable<PathNode> {
  public readonly int index;
  public int parent;
  public bool isExplored;
  public bool isObstacle;
  public float g;
  public float h;
  public readonly float F => g + h;
  public readonly bool Equals (PathNode other) => index == other.index;
  public readonly int CompareTo (PathNode other) => other.F.CompareTo(value: F); // reverse the order so that lower F means higher priority
  // ...
}

The index value of a PathNode must be equal to the index of the corresponding GraphNode for convenience. The parent field is an integer that is the index of the neighboring node where a candidate path passes through. Even though I wanted to use a set to store the explored nodes (or indices), I opted to add a field called isEplored to indicate if a node has been explored, which I will explain later. I also added a field called isObstacle to indicate if the node is an obstacle because the map could contain trees. But the map was not supposed to change, hence the tree positions; so, why not include this field in the GraphNode struct? Because, tree positions are determined at a later stage after the nodes are constructed, and I thought I could add dynamic obstacles to the scene in the future. The CompareTo method of the IComparable interface for the float type in the expression a.CompareTo(b) returns greater than 0 if a is greater than b and vice versa; it returns 0 if they are equal. In my open list implementation, I sort items based on priority in a descending order (highest priority at the top). So, I implemented the CompareTo method in a way that returns greater than 0 when a has a lower f cost (higher priority) than b.

Now, it’s time to clear up the things I left unexplained in the preceding paragraphs. They are either related to the native container types that come with the Collections package or custom containers created by myself. These containers are thread-safe and can be used for data that is shared between a worker thread and the main thread.

Recall that the graph, which represents the terrain, is constant and can be shared between threads. NativeArray is the container I used to store GraphNode instances that constitute the terrain. This same array (graphNodes) is passed to the constructor of each PathfindingJob instance that is created. For this to work properly, and to receive appropriate warnings/errors when a job attempts to write to this container, it is required to add the ReadOnly attribute before the field declaration in the job definition that references this graph. The array itself is created by the main thread in a MonoBehaviour instance which is also responsible for disposing of it (by calling its Dispose method) once it’s no longer needed, to prevent memory leaks. This array is created using the Persistent allocator since its lifetime exceeds the four frame limit of the TempJob allocator. The edges of this graph are stored in a NativeParallelMultiHashMap of type <int, int> where the key is the index of a node and the value is a set that contains the indices of the node’s neighbors. This structure (graphEdges) is also created and handled the same way graphNodes is.

There can also be containers that are local to the job struct. These are allocated and deallocated by the job itself. Remember that the open list had to efficiently sort and store the PathNode instances. The pathfinding tutorial by Sebastian Lague also touches on this topic, and the data structure that is used for this job is a heap, specifically a max heap since higher priority nodes are at the top of the heap. However, the implementation in the tutorial could not be used in a job since it must be a native container. So, I had to convert that implementation to a NativeHeap by looking at the examples, and I was successful in doing so after some trial and error. The constructor of this custom native container looks like the following.

[NativeContainer]
[NativeContainerSupportsDeallocateOnJobCompletion]
public unsafe struct NativeHeap<T> : IDisposable where T : unmanaged, IEquatable<T>, IComparable<T> {
  public NativeHeap (int initialCapacity, Allocator allocator) {
#if ENABLE_UNITY_COLLECTIONS_CHECKS
    m_Safety = AtomicSafetyHandle.Create();
    AtomicSafetyHandle.SetStaticSafetyId(handle: ref m_Safety, staticSafetyId: s_staticSafetyId);
    AtomicSafetyHandle.SetBumpSecondaryVersionOnScheduleWrite(handle: m_Safety, value: true);
#endif
    capacity = initialCapacity;
    if (capacity <= 0)
      capacity = 16;
    m_Length = 0;
    m_AllocatorLabel = allocator;
    m_Buffer = (T*)UnsafeUtility.MallocTracked(size: capacity * UnsafeUtility.SizeOf<T>(), alignment: UnsafeUtility.AlignOf<T>(), allocator, callstacksToSkip: 1);
  }
  // ...
}

It is possible to add the DeallocateOnJobCompletion attribute before the declaration of the field of type NativeHeap in the job struct so that the job system automatically deallocates the container when the job is finished. For this to work, the custom container implementation must have the NativeContainerSupportsDeallocateOnJobCompletion attribute and comply with the requirements specified in the description page of this attribute. Recall that I had to add a field called isEplored to the PathNode struct because I could not use a NativeHashSet for the closed list. The reason is that this native container does not support the DeallocateOnJobCompletion attribute, and I did not want to manually dispose of it inside the Execute method of the job to keep it clean of code that is not directly related to the pathfinding algorithm itself. The fields of PathfindingJob are as follows:

public struct PathfindingJob : IJob {
  [DeallocateOnJobCompletion] NativeHeap<PathNode> queue;
  [ReadOnly] readonly NativeArray<GraphNode> graphNodes;
  [ReadOnly] readonly NativeParallelMultiHashMap<int, int> graphEdges;
  NativeArray<PathNode> pathNodes;
  NativeList<GraphNode> path;
  readonly PathEndpoints endpoints;
  // ...
}

The PathEndpoints struct is a pair of integers that specify the start and end node indices. The path is a NativeList created outside the job and filled in during its execution. It is not created in the job’s local scope so it can be accessed later, after calling Complete on the JobHandle. The queue is the open list, which is a priority queue implemented as a heap.

Before diving into how the pathfinding requests are handled, it is necessary to describe how they are created. These requests come from agents that display intelligent behaviour such as following other agents and destroying them, which brings us to artificial intelligence, but not in the form of a neural network. I wanted something simple since the behaviour I wanted to observe consisted of a few basic actions. I found behaviour trees to be more suitable in my case compared to state machines or goal oriented action planning. I’ve found this nice tutorial by Mina Pêcheux to learn about the basics and see how it’s implemented. My implementation looked more or less the same: I defined a LeafNode class and two composite node classes (SelectorNode and SequenceNode) that extend this class. To facilitate passing of data between the nodes of a behaviour tree, I implemented a basic Blackboard class and let all the nodes refer to a common instance. This class defines two dictionaries, one for the information that is being stored and one for the observers that will be notified when certain information is updated. I limited the number of observers for each key to one, which was enough for my purposes.

public sealed class Blackboard {
  readonly Dictionary<string, object> data = new();
  readonly Dictionary<string, Action> subs = new();
  public void Subscribe (string key, Action callback) {
    if (!subs.ContainsKey(key))
      subs[key] = null;
    subs[key] += callback;
  }
  public void Unsubscribe (string key, Action callback) {
    if (subs.ContainsKey(key))
      subs[key] -= callback;
  }
  // ...
}

The root node of an agent’s behaviour tree is a SequenceNode which has four children that define all the actions that an agent can perform. These actions are detecting a target, finding a path to the target, moving towards the target by following the path, and attacking the target. They are evaluated in sequence, and if any one of them does not succeed, the remaining nodes are skipped. These behaviours are all defined in separate classes that extend the LeafNode class. This tree is evaluated in the Update method of the agent’s MonoBehaviour class. The Start and Update methods of this class are given below.

public sealed class AgentBehaviour : MonoBehaviour {
  // ...
  void Start () {
    var detectTarget = new DetectTarget(blackboard, transform, getClosestEnemy: playerEvents.GetClosestEnemy, scanInterval: gameSettings.targetScanInterval);
    var findPath = new FindPath(blackboard, transform, requestPath: playerEvents.RequestPath);
    var followPath = new FollowPath(blackboard, transform, moveSpeed: gameSettings.unitMoveSpeed, attackRange: gameSettings.unitAttackRange);
    var attackTarget = new AttackTarget(blackboard, transform, attackRange: gameSettings.unitAttackRange, initiateAttack: playerEvents.InitiateAttack);
    behaviourTree = new SequenceNode(blackboard, detectTarget, findPath, followPath, attackTarget);
  }
  void Update () {
    behaviourTree.Evaluate();
  }
}

Here, gameSettings and playerEvents are ScriptableObject instances that store the game parameters and gameplay related actions, respectively. Agents need to trigger certain actions to perform some of their tasks that require collaboration from other scripts. For example, they do not know about each other by design; hence, they have to trigger an Action at a fixed interval (unless there is already a valid target) in order to get the Transform component of the enemy that is closest to them. As a result, these delegates are passed to any tree node that needs to trigger them. One such action is RequestPath of type Action<Vector2, Vector2, Action<GraphNode[]>> that is defined in PlayerEvents. Here is the FindPath node where these requests are made:

public sealed class FindPath : LeafNode {
  // ...
  public FindPath (Blackboard blackboard, Transform transform, Action<Vector2, Vector2, Action<GraphNode[]>> requestPath) : base(blackboard) {
    // ...
    blackboard.Subscribe(key: "Destination", callback: OnDestinationUpdated);
  }
  ~FindPath () {
    blackboard.Unsubscribe(key: "Destination", callback: OnDestinationUpdated);
  }
  void OnDestinationUpdated () {
    pathFound = false;
    findPath = true;
    destination = blackboard.GetData<Vector2>(key: "Destination");
    var pos2 = new Vector2(x: transform.position.x, y: transform.position.z);
    requestPath?.Invoke(pos2, destination, OnPathFound);
  }
  void OnPathFound (GraphNode[] path) {
    findPath = false;
    pathFound = true;
    blackboard.SetData(key: "Path", value: path);
  }
  public override NodeState Evaluate () {
    if (pathFound)
      return NodeState.Success;
    if (findPath)
      return NodeState.Busy;
    return NodeState.Failure;
  }
}

Now, let’s see how these requests are handled. The RequestPath invocation requires three arguments: agent position, target position and a delegate to the method that will process the returned path. At the receiving end of these requests, there is a queue that stores them. This is crucial because attempting to process all the requests as they arrive would cause poor performance when there are many such requests in the duration of one frame. Instead, they are processed in batches such that in each frame at most n requests are removed from the queue and n pathfinding jobs are run. The following code schedules jobs that will be executed in parallel by worker threads on multiple CPU cores. A copy of the pathNodes is created for each job since it contains information about the obstacles on the terrain. For the special case of batch size of one or a queue with only one item (not included in the code below), the Temp allocator is used over TempJob for faster allocation, and the job is directly executed (not scheduled).

var numJobs = Mathf.Min(a: gameSettings.pathfindBatchSize, b: pathRequestQueue.Count);
var jobData = new List<(JobHandle handle, NativeArray<PathNode> pathNodes, NativeList<GraphNode> path, Action<GraphNode[]> handler)>();
for (var i = 0; i < numJobs; i++) {
  var (endpoints, handler) = pathRequestQueue.Dequeue();
  var path = new NativeList<GraphNode>(allocator: Allocator.TempJob);
  var pathNodes = new NativeArray<PathNode>(length: this.pathNodes.Length, allocator: Allocator.TempJob);
  pathNodes.CopyFrom(array: this.pathNodes);
  var job = new PathfindingJob(graphNodes, graphEdges, pathNodes, endpoints, path, allocator: Allocator.TempJob);
  var handle = job.Schedule();
  jobData.Add(item: (handle, pathNodes, path, handler));
}
for (var i = 0; i < jobData.Count; i++) {
  var (handle, pathNodes, path, handler) = jobData[index: i];
  handle.Complete();
  pathNodes.Dispose();
  var pathArr = path.AsArray().Reverse().ToArray();
  path.Dispose();
  handler?.Invoke(obj: pathArr);
}

When the path is returned to the FindPath node that requested it, it is written to the blackboard which notifies the FollowPath node. Then, the agent is moved towards the position of the next GraphNode in the path at a constant speed, and this is repeated until it reaches the position of the last node in the array. Index 0 in the path array is the start node which points to the current position of the agent, so the next index should be set to 1 for each new path. If the target is also moving, which is the case here, the path will become invalid as the target starts to get farther away. There is a simple check to determine if the path the agent is currently following will lead to the enemy: calculate the distance from the final node in the path to the enemy and compare it to the distance from the agent’s current position to the enemy, if the final node is not closer to the enemy, request a new path. The Evaluate method of the FollowPath is given below with some parts omitted; attackRange is used as the epsilon value for distance comparisons.

public override NodeState Evaluate () {
  if (path == null)
    return NodeState.Failure;
  var pos3 = transform.position;
  var pos2 = new Vector2(x: pos3.x, y: pos3.z);
  if (target != null) {
    // check if the target is getting away, request a new path if that's the case
  }
  if (nodeIndex > lastIndex)
    return NodeState.Success;
  moveVector = (path[nodeIndex].Pos3 - pos3).normalized;
  transform.Translate(translation: moveSpeed * Time.deltaTime * moveVector);
  if (Vector2.Distance(a: pos2, b: path[nodeIndex].Pos2) < attackRange)
    nodeIndex++;
  return NodeState.Busy;
}

Battles between agents are simulated by one script that stores a dictionary whose keys are the Transform components of the agents in the scene, and values are AgentData instances that contain agent stats such as health points or attack power. When an agent gets close enough to its target, it can initiate an attack by triggering the InitiateAttack action which accepts the Transform components of the attacker and the defender, respectively. Unlike pathfinding requests, battles are not queueed or executed in batches since they are not computationally heavy. Each battle is started via StartCoroutine calls and executed over multiple frames. The attacker gets to hit first, then they take turns. The expression that calculates the damage dealt by the attacker can become very complex as the number of parameters (stats, items, powerups, etc.) increase. If the HP of one side drops to zero, the other side wins. The Fight method that is run as a coroutine is given below.

IEnumerator Fight (TransformPair pair) {
  var attacker = agents[key: pair.A];
  var defender = agents[key: pair.B];
  while (attacker.healthPoints > 0 && defender.healthPoints > 0) {
    var damage = Mathf.RoundToInt(f: attacker.attackPoints * (1f / (defender.defensePoints + 1)) + 1);
    defender.healthPoints -= damage;
    (attacker, defender) = (defender, attacker);
    yield return null;
  }
  battles.Remove(item: pair);
  agents[key: pair.A] = attacker;
  agents[key: pair.B] = defender;
  var loser = attacker.healthPoints > 0 ? pair.B : pair.A;
  // ...
}

The pairings are stored in a HashSet called battles to avoid duplicates. However, representing the fighting pairs as a Tuple would result in (a, b) and (b, a) being treated as different battles, where a and b are Transform instances. To prevent this, I defined a new type called TransformPair that produces the same hash code for the same pair regardless of the order. Once the loser is decided, its entry is removed from the dictionary agents, then it can be destroyed or returned to its pool if object pooling is employed (e.g., if there is re-spawn mechanic).

The agents in the scene are rendered via DrawMeshInstanced calls, similar to how I render the trees. I will skip this part since I explained the process in my previous post. To demonstrate the things I covered in this blog post, I spawned 48 agents at random positions and let them perform the actions defined by their behaviour trees. I rendered their active paths via Gizmos for better visualization. To make the scene prettier, I used tree models from the Low Poly Tree Pack package. The following is the recording of the legendary battles that took place in my procedurally generated map.

Agents battling it out until only one remains standing