Building a static site generator in Rust

Matthew J. Berger
13 min read
On this page

This site is generated by bamboo, the static site generator I wrote in Rust and use to build matthewberger.dev/articles. Every page you read here starts as a markdown file with a TOML header on disk, gets turned into HTML by bamboo, and lands in a dist/ directory that gets uploaded to GitHub Pages. There is no server doing work at request time, no database, no framework. The deployed site is a few hundred plain files.

This post is about how that machine is shaped. The pieces, why they ended up that way, and what the awkward parts cost. I will walk through the code in the order data flows through it: read the config, walk the content tree, render markdown, expand shortcodes, run templates, drop files, post-process.

The whole codebase is in the matthewjberger/bamboo repo. Under ten thousand lines of Rust including tests, no unsafe, no proc macros, edition 2024. The library crate is on crates.io as bamboo-ssg and the CLI as bamboo-cli.

#What a static site generator actually is

A static site generator's input is a tree of markdown and configuration. Its output is a directory of HTML, CSS, and assets, laid out so a dumb file server can serve them at the right URLs. "Static" means the work happens at build time. Once dist/ is written, the site has no runtime dependencies. No PHP, no Node process, no database. You hand the directory to GitHub Pages or a CDN bucket and that is the whole deployment.

I wrote my own because the existing ones do not compose the way I want. Zola, Hugo, Jekyll, Eleventy, and mdBook are all fine. Each has a paint-yourself-into-a-corner moment for some thing I wanted to do (multi-collection sites with their own templates, fingerprinted assets and Sass without a Node toolchain, a search index I controlled the shape of, an embedded default theme that needs no separate clone step). bamboo started as a small tool to host one site and accreted features until it could host the four I care about.

#Shape: builder, renderer, CLI

bamboo splits cleanly into three pieces.

SiteBuilder reads a source directory and produces an in-memory Site. ThemeEngine consumes that Site and renders to disk. The CLI ties them together with clap for argument parsing and a notify watcher for live reload.

The split matters because it makes the library usable as a library. You can build the Site, mutate it (drop drafts, reorder posts, attach extra metadata), then hand it to a different renderer. The Site type is the IR:

pub struct Site {
    pub config: SiteConfig,
    pub home: Option<Page>,
    pub pages: Vec<Page>,
    pub posts: Vec<Post>,
    pub collections: HashMap<String, Collection>,
    pub data: HashMap<String, Value>,
    pub assets: Vec<Asset>,
}

Every field is pub. The Page/Post/Collection types all flatten a Content substruct in via #[serde(flatten)], so templates see the inner fields directly instead of having to walk through content.title. That flattening turns the struct nesting into a flat shape in the Tera context, which is what authors actually want when they write {{ post.title }}.

#Reading the content tree

The content walker is walkdir with a filter chain: keep files only, keep .md extension only, skip filenames starting with _ except _index.md, skip the posts/ directory (handled separately), skip any directory holding a _collection.toml. Sort the result lexically for deterministic output. Parse each file in parallel with rayon:

let parsed_pages: Vec<(Page, PathBuf, PathBuf)> = file_entries
    .par_iter()
    .map(|(path, relative)| {
        let page = self.parse_page(path, relative)?;
        Ok((page, path.clone(), relative.clone()))
    })
    .collect::<Result<Vec<_>>>()?;

For this site (a few dozen posts) the parallel parse is barely measurable, but it scales to hundreds of files without effort and the call site stays trivial.

The same module finds collections (every top-level directory under content/ with a _collection.toml is one), data files (data/**/*.{toml,yaml,json}, nested directories become nested maps), and static assets (static/**/* copied verbatim).

#Frontmatter

A bamboo content file looks like this:

+++
title = "My Post"
date = "2024-01-15"
tags = ["rust"]
+++

Body markdown.

TOML between +++, or YAML between ---. The parser scans line-by-line for the closing delimiter, hands the slice to toml::from_str or serde_yml::from_str, and gets back a HashMap<String, serde_json::Value>:

pub struct Frontmatter {
    #[serde(flatten)]
    pub raw: HashMap<String, Value>,
}

The Value map preserves every field the author wrote, known or not. Templates can reach for arbitrary {{ post.frontmatter.whatever }} keys, which is the only way I have found to let authors invent new conventions without editing the schema in the SSG.

Dates can also live in the filename. A file named 2024-01-15-hello.md parses out 2024-01-15 as the date and hello as the slug. The function in parsing.rs returns Option so a short filename like about.md falls through harmlessly:

pub fn parse_date_from_filename(filename: &str) -> Option<(String, String)> {
    let name = filename.strip_suffix(".md").unwrap_or(filename);
    let date_part = name.get(..10)?;
    ...
}

name.get(..10)? matters. If name is shorter than 10 bytes the slice is None and the function returns immediately rather than panicking on out-of-range indexing.

#Markdown rendering

pulldown-cmark is the parser. It emits an event stream of Start(Tag::Heading), Text, Code, Start(Tag::CodeBlock), etc. bamboo does not use the convenience html::push_html on the entire stream because two things need custom handling: code blocks and headings.

Code blocks need syntax highlighting. The stream looks like Start(CodeBlock(Fenced(lang))) → many Text events → End(CodeBlock). bamboo accumulates the text into a buffer and hands it to syntect at the end event:

Event::End(TagEnd::CodeBlock) => {
    in_code_block = false;
    let highlighted = if let Some(ref lang) = code_block_lang {
        self.syntax_set
            .find_syntax_by_token(lang)
            .map(|syntax| {
                highlighted_html_for_string(
                    &code_block_content,
                    &self.syntax_set,
                    syntax,
                    theme,
                )
                .unwrap_or_else(|_| escape_html(&code_block_content))
            })
            ...

Headings need anchor links so URLs can point at sections. The renderer slugifies the heading text, dedupes against a HashSet<String> of already-emitted ids so two ## Setup headings produce setup and setup-1, and emits:

<h2 id="setup"><a class="anchor" href="#setup">#</a>Setup</h2>

It also pushes every heading into a Vec<TocEntry> so templates can render a table of contents without re-parsing. The TOC is just (level, id, title) triples.

#Shortcodes

Plain markdown is fine for prose, awful for "embed a youtube video" or "show a notice box." Hugo-style shortcodes fill that gap:

{{< youtube id="dQw4w9WgXcQ" >}}

Two flavors. The inline form has no body; the block form wraps markdown content between an opening and matching closing tag:

inline:  {{< name arg="value" >}}

block:   {{% name arg="value" %}}
         body markdown
         {{% /name %}}

The shortcode processor runs before pulldown-cmark, scans the text for tags, replaces each with its rendered Tera template, and hands the result forward. Block bodies are themselves rendered through markdown first, so:

{{% note type="info" %}}
This is **emphasized**.
{{% /note %}}

emits a note containing <strong>emphasized</strong>.

The processor tracks fenced code blocks (triple-backtick and triple-tilde) so a shortcode tag inside a code block stays literal. It does not track inline single-backtick code spans, which means an inline-code example of an opening shortcode delimiter in prose still gets parsed as a real shortcode and triggers a hard error. I noticed this writing the draft of this post — the first build failed on a one-liner mention of the delimiter in flowing text, and the fix was to move every shortcode example into a fenced block instead. The scanning is hand-rolled because the shortcode pass runs before pulldown-cmark, so I cannot lean on its code-fence detection. Properly handling inline backticks belongs on the same list.

One special shortcode renders no template:

{{< ref "post-name.md" >}}

It resolves a source-path to the URL its content will deploy at. A registry is built during the initial content walk that maps source paths to final URLs, including permalink overrides. Internal links survive renames and custom permalinks through this indirection.

#Templates

Tera is the templating engine. Jinja2-style: inheritance, includes, filters, macros. Every page handed to the renderer gets a Tera context containing the full Site plus the page itself plus a few extras (pagination state, related posts, prev/next post for blog entries).

The built-in default theme is embedded into the binary with include_str!:

const DEFAULT_BASE_TEMPLATE: &str = include_str!("../themes/default/templates/base.html");
const DEFAULT_INDEX_TEMPLATE: &str = include_str!("../themes/default/templates/index.html");
const DEFAULT_PAGE_TEMPLATE: &str = include_str!("../themes/default/templates/page.html");
const DEFAULT_POST_TEMPLATE: &str = include_str!("../themes/default/templates/post.html");
...

A fresh Tera::default() gets each one stuffed in with add_raw_template. The upside is cargo install bamboo-cli produces a working SSG that needs no theme clone step. The downside is templates are not user-editable without dropping override files into the site's own templates/ directory. Site-level overrides are registered after the theme, so Tera's later-registration-wins semantics let them shadow the theme without any priority machinery.

Custom themes load from a directory pattern instead. The directory layout mirrors the embedded one: templates/, templates/partials/, templates/shortcodes/, and a static/ for theme assets.

#Custom filters

Four bamboo-specific Tera filters: slugify, reading_time, word_count, toc. They live in theme.rs::register_custom_filters and are plain closures:

tera.register_filter(
    "slugify",
    |value: &tera::Value, _args: &HashMap<String, tera::Value>| {
        let text = value.as_str().unwrap_or("");
        Ok(tera::Value::String(slugify(text)))
    },
);

The toc filter takes the heading list off a post or page and emits a nested <ul> indented by heading level. It is rendered inside templates with {{ post.toc | toc | safe }} (Tera's autoescape makes you opt in with safe for trusted HTML).

#Parallel rendering

After parsing, the engine renders pages, posts, and collections in parallel with rayon's par_iter. Each task writes into its own slot in dist/, so nothing needs coordination:

site.posts.par_iter().try_for_each(|post| {
    self.render_post(site, post, prev_post, next_post, output_dir)
})?;

Templates are registered up front and never mutated afterward. The renderer captures the Tera instance by reference into each closure. Tera is Send + Sync, so the parallel pass works without locks.

#Asset pipeline

Once HTML lands on disk, an optional pass walks output_dir and runs four kinds of post-processing:

  1. Sass/SCSS compilation (grass, a pure-Rust Sass compiler)
  2. CSS minification (lightningcss)
  3. JS minification (minify-js)
  4. HTML minification (minify-html)

Plus optional content-hash fingerprinting: each .css and .js output gets renamed from style.css to style.a3f8c91d.css, and every HTML and XML file gets walked to rewrite references. Cache headers can be aggressive (Cache-Control: public, max-age=31536000, immutable) after that without breaking deployments.

The order is load-bearing. Sass → minify CSS/JS → fingerprint → minify HTML. Fingerprinting before HTML minification means the rewrite step can still find unminified href="/style.css" attributes. Fingerprinting after would require parsing minified HTML.

#Responsive images

Opt-in. If [images] is set in bamboo.toml, the engine walks the output for images, resizes each to a configured set of widths (320, 640, 1024, 1920 by default), and writes each variant as both WebP and JPEG. Then it walks every HTML file and rewrites:

<img src="/photo.jpg">

into:

<picture>
  <source type="image/webp" srcset="/photo-320w.webp 320w, /photo-640w.webp 640w, ...">
  <source type="image/jpeg" srcset="/photo-320w.jpg 320w, /photo-640w.jpg 640w, ...">
  <img src="/photo.jpg">
</picture>

so the browser picks an appropriate resolution. The image crate does the decode and resize. The webp crate handles WebP encoding. The JPEG encoder comes from image::codecs::jpeg.

The variants that the pipeline already generated have a -{width}w.{ext} suffix, which is how the scanner avoids regenerating them on subsequent builds: any file whose stem ends in -320w / -640w / etc. is skipped as already-derived.

Each output is its own module taking the in-memory Site and writing a single file:

  • feeds::generate_rssrss.xml
  • feeds::generate_atomatom.xml
  • sitemap::generate_sitemapsitemap.xml
  • search::generate_search_indexsearch-index.json

The XML files are produced with format! and a tiny xml::escape helper for the five special characters. For five entities and a handful of fixed templates this is fine. For anything with attributes-in-elements or namespaces I would reach for a real serializer.

The search index is a JSON array of { title, url, tags, date, excerpt, content } objects, where content is the rendered HTML run through strip_html_tags and truncated to 5000 characters per entry — small enough that the client-side Fuse.js search stays responsive even on sites with hundreds of posts. The default theme's search page pulls the file at request time and feeds it to Fuse.js client-side. There is no server-side search component because there is no server.

After the build, every HTML file under output_dir is grepped for href="..." by byte scan (the templates are trusted, so a regex is overkill). Anything that looks like an internal link gets resolved against the output tree:

  • /about/ checks output_dir/about/index.html
  • /style.css checks output_dir/style.css
  • https://example.com/about/ (when the base URL is https://example.com) gets normalized to /about/ and checked the same way

External links and #fragment-only links are skipped. Broken links print a warning. There is a link_check_ignore config field for prefixes that share the deployment domain with sibling projects on the same host that bamboo cannot see into.

#Incremental builds for bamboo serve

The dev server runs bamboo serve --open, watches the project tree with notify, debounces FS events by 300ms, then rebuilds. The rebuild is incremental.

Every tracked file (content/, data/, static/, templates/, bamboo.toml) gets SHA-256 hashed at the start of each build. The hashes are compared against the cache at .bamboo-cache/build-state.json. The diff produces a ChangeClassification:

pub enum ChangeClassification {
    Full,
    Targeted { changed_files: Vec<PathBuf> },
}

A change in bamboo.toml, any templates/*, or any deletion forces a full rebuild. A change to content/posts/2024-01-15-hello.md produces a Targeted classification, which expands to a HashSet<RenderTarget>:

pub enum RenderTarget {
    Page(String),
    Post(String),
    Collection(String),
    Pagination,
    AllTaxonomies,
    Feeds,
    Sitemap,
    SearchIndex,
    All,
}

A post edit invalidates pagination, feeds, sitemap, search, taxonomies, and the index page (the index lists recent posts). A page edit only invalidates sitemap and search. The render functions accept the target set as Option<&HashSet<RenderTarget>> and consult it before doing the work, so an incremental pass writes only the few files that actually changed.

This is finicky. The win is that editing a post and saving rebuilds in ~50ms instead of ~500ms on a site this size. With tower-livereload pushing a WebSocket message after each rebuild, the browser refreshes before I have tabbed back to it.

The error path is interesting. If a build fails mid-rebuild, the server stashes the error string in an Arc<Mutex<Option<String>>> and an Axum middleware intercepts the next page request, returning a styled error overlay with the message. As soon as the next successful build completes, the overlay clears and live reload fires. Editing markdown with a misformatted frontmatter no longer produces a 404 from the dev server; it produces a readable error screen.

#What I left out, and why

A real plugin system. bamboo's "plugin" is "drop a Tera template into templates/" plus "edit bamboo.toml." That covers everything I want from a personal blog. Zola has the same answer.

A proc-macro layer for typed frontmatter. The frontmatter is a HashMap<String, Value>. Templates speak through the map. A typed struct per content type would be tidier in pure-Rust terms, but every site has a different schema, and I do not want to compile a per-site bamboo binary. The flat map wins.

Configurable markdown extensions. pulldown-cmark options are hardcoded: tables, footnotes, strikethrough, task lists, heading attributes. I have never wanted any of these off.

A theme registry. Themes are directories. The built-in is embedded. I have not built infrastructure to publish or install third-party themes because nobody has asked.

#Dependencies

A few non-obvious choices:

  • pulldown-cmark: event-stream API, fast. The custom code block highlighting and TOC extraction are straightforward against the event stream and awkward against any "give me HTML" wrapper.
  • syntect: syntax highlighter, the same one bat uses. The default theme set covers enough languages.
  • Tera: Jinja2-style templating with filters and macros. Performance is fine for site sizes that fit in memory, which is "every site I care about."
  • grass: pure-Rust Sass compiler. The reason there is no Node.js anywhere in this toolchain.
  • rayon: parallel iterators. Used in content parsing, page/post/collection rendering, taxonomy term rendering, image resizing, and asset minification. Each loop writes to its own file so there is nothing to coordinate.
  • walkdir: recursive directory walks. Filter-then-collect-then-parallel is the pattern everywhere.
  • lightningcss / minify-html / minify-js: the three minifiers. lightningcss in particular is faster and produces tighter output than the JS-ecosystem alternatives.
  • axum + tower-livereload + notify: the live-reload dev server stack. notify watches the disk, the rebuild thread fires through an mpsc::channel, a tokio::sync::broadcast fans out reload signals to connected WebSocket clients.

#Closing

The whole repo is at matthewjberger/bamboo. The source for this site is at matthewjberger/articles. If you want a ready-to-deploy starter with GitHub Pages CI, matthewjberger/bamboo-template is one gh repo create --template away.

Share this post