A Precious Side Project — This Website

Published Jul 10, 2023 - 57 min read - Text Only

Table of contents

Previously
Early Friction
Today, with Intermediate Representation
- Generating an Intermediate Representation
- Using the Intermediate Representation
Future
- Small shout out
- Camping

Hooray! This website is now no longer fully bound to an unmaintained static site generator! So, what's new, and what's changed? There's a long story on how it got to where it is. Bear with me for a moment.

Previously

I used and still use a fork of Mendoza, a documentation oriented static site generator for the Janet programming language. I saw the potential for differentiating my content with its powerful markdown-esque macro markup.

For example, adding dialog with stickers on the side.

In the Mendoza markdown, it looks like this:

## Previously

I used ...

@sticker-left[cendyne/hold-this]{
  For example, adding dialog with stickers on the side.
}

Which calls these Janet functions:

(defn sticker [name]
  (def sticker-parts (string/split "/" name))
  (def character (get sticker-parts 0))
  (def sticker-name (get sticker-parts 1))

  (def sticker-image {:tag "img"
    "src" (string "https://s.cdyn.dev/S/128/" character "/" sticker-name)
    "alt" sticker-name
    "class" "sticker"})
  {:tag "div" "class" "sticker-container" :content sticker-image})

(defn sticker-left [name content]
  {:tag "div"
  "class" "sticker-left"
  :content [
    (sticker name)
    {:tag "div" "class" "im-message" :content {:tag "div" "class" "im-message-left" :content content}}
  ]
})

The final expression, which is the return value, is a dictionary that represents an HTML element. This dictionary can have mixed types for keys and values. Any key that is strictly a string type will become an HTML attribute, while keywords like :tag instruct the render function which HTML tag to use.

Early Friction

Mendoza enabled my own consistent site-wide style. When I needed to rewrite the HTML for different CSS styling, I did not have to rewrite or edit any of the prior documents.

This level of customization in combination with my writing style — using lots of stickers and other media along the way — came at a significant cost.

Unmaintained since adoption

Since I started writing in 2021, Mendoza has not had much attention, besides making it still function with the latest compiler. Sure, there are no regressions and it does what it is intended for: it builds Janet's documentation website. Beyond that, it shows few signs of polish and comes with plenty of rough edges.

I think I'm the only one out there using Mendoza to write content nearly every month.

In my first year of publication, 2021, the first significant friction was how it handled static content. Every time I saved an edit, it would copy a few hundred files from the static folder into the build folder. Some of that was self-inflicted. For every sticker and image on this site, I had at least a jpeg, webm, avif, and jpeg-xl format encoded. At first I used imagemagick. Later, I used squoosh instead to do reliably convert images.

Back then, I was less familiar with Mendoza's internals. In 2022, I did end up forking it. I'll get to that. See section "A final experiment."

The second point of friction was building on more than one computer. It turns out running different versions of Janet on different architectures with other system dependencies like imagemagick with the right environment for all this at once is painful to synchronize. I'll also cover how I solved this. See section "Dev Containers"

The third point of friction is if there's a markdown parsing error, then the entire site build stops and errors. There is no graceful degradation. Common issues include quotes around curly braces requiring escapes, and parentheses requiring escapes after macros. Also, the fact that Mendoza "markdown" only extends as far as headings. Other styling, such as italics and bold, require macros. Oh! And if I start a paragraph with a styling macro, like italics, the rest of the paragraph never ends up in a paragraph HTML tag. At times I have to manually say: yes, I am intending to put a paragraph here. In short, there are several problems with the format I use to author.

Parsing content content/posts/2023-07-10-a-precious-side-project.mdz as mendoza markup
error: parser has unchecked error, cannot consume
  in parser/eof [src/core/parse.c] on line 935
  in capture-value [/usr/local/lib/janet/mendoza/markup.janet] on line 26, column 3
  in peg/match [src/core/peg.c] on line 1694
  in markup [/usr/local/lib/janet/mendoza/markup.janet] on line 150, column 16
  in <anonymous> [/usr/local/lib/janet/mendoza/markup.janet] on line 176, column 51
  in <anonymous> [/usr/local/lib/janet/mendoza/markup.janet] on line 175, column 41
  in require-1 [boot.janet] (tailcall) on line 2963, column 18
  in read-pages [/usr/local/lib/janet/mendoza/init.janet] on line 87, column 25
  in read-pages [/usr/local/lib/janet/mendoza/init.janet] on line 84, column 20
  in read-pages [/usr/local/lib/janet/mendoza/init.janet] on line 84, column 20
  in load-pages [/usr/local/lib/janet/mendoza/init.janet] on line 91, column 3
  in build [/usr/local/lib/janet/mendoza/init.janet] (tailcall) on line 142, column 14

Not even an input file line number!? Nope! Sometimes it will say "byte 1234," as if I read files in terms of bytes.

When I write for an hour or so at a time, I usually come back to some syntax error. As a result, I have to bisect my content by cutting out and copying back in portions of the post at a time to see where the issue is.

Once I spent 2 hours tracking down an issue because of a mis-interpreted open left curly brace several paragraphs before. It seemed to gracefully treat it as a character literal at first, and then later saw it differently, depending on content later in the article. Experiences like that reinforce my hope to pivot my authorship to another format.

What about the macros you keep mentioning?

Remember the sticker code above? Basic styling like italics is done with @em{...}, and bold with @strong{...}. These are called macros, as they are being called to rewrite the document's objects.

Mitigations

Along the way, I've gotten enough patches in place to make authoring at this scale tolerable.

Dev Containers

Around the time I needed to switch my authoring computer, I looked for alternatives. At first, I tried to set up a remote VSCode session in a FreeBSD jail on my NAS, since it seemed like a stable place to do it. However, FreeBSD is not really supported.

Later, a friend mentioned that VSCode Dev Containers are a thing. I looked into it, tried it out, and whoa! My Docker at work skills came very handy and I have been a fan of using dev containers since.

Though, I do wish Docker were less resource intensive on Mac.

I still use VSCode Dev Containers to author with Mendoza and likely will to the end of this year.

Extracting stickers

One of my first projects on Cloudflare workers was my stickers service. After all, my sticker collection was growing past one hundred source images and having so many alternative forms slowed the static build process down to half a minute.

I spent a month or two learning Rust and practicing with an image library there to resize things on the fly. It was unfulfilling work and ultimately in web assembly, did not have the speed I desired.

Then, I found out that Cloudflare has image resizing and image encoding for workers. It is unfortunately paid and only available to the professional plan, which is per domain, rather than per user. This functionality is also not functional on local. Which makes testing a pain.

My sticker service now derives scaled images in a requested format on the fly and caches them in Cloudflare KV. I upload the stickers with Insomnia (a program much like postman) and it handles the rest - including serving with content negotiation.

An upload form in an application called Insomnia. A url is set and several parameters like character, artist, name, and file are set.

Also, before I had to look in the folder to see which stickers I wanted to use. Now, I have a "sound board" where I can click on any sticker I want and it sets my clipboard to paste the embed code for that sticker.

Several cendyne stickers in a grid, each one is labeled and is clickable.

With this, my static build times cut to a fifth! It no longer copies hundreds of files every time I update a markdown file!

Instead of six hundred static files, it is closer to a hundred and twenty files each time after removing stickers.

Extracting media

As of August 2022, my articles include Tweets in a privacy respecting way. Previously, Tweets were mere screenshots. At no point in the past or the future were there cross-origin requests to Twitter, YouTube, or others without your consent as a reader.

A problem. Remember how the build time shrank after extracting all the stickers? In a few months of writing, Tweets and other media would undo that performance gain.

As of this article, there are 191 Tweets embedded on this website.

To prevent a repeat of degraded authorship-experience (like user-experience, but writing!), I opted for a similar but separate solution for media content. After all, photos and videos are far larger in size than 512x512 .pngs.

Isn't that premature optimization? Aren't software engineers not supposed to do that?

This is not premature. This work prevents a performance regression! The same problem persists: copying static files for every document edit slows the editing experience. Extracting the media would remove all remaining static content that grows over time.

While the edit and refresh delay was reduced by extracting stickers, it was still too long.

Not only did I want to serve more images directly, I wanted to serve more videos too. This compounded with my desire to add socially sourced content. Tweets, Toots, and YouTube videos come with several files: profile pictures, several attached images – even custom emojis, video previews (also called poster), and videos or "gifs" as .mp4 and .webm files.

I would not add social media embeds until I could handle the big file problem: videos.

And I really wanted to have videos and "gifs." Web-scaled images are within KV's limits. Videos go beyond KV's limits (25 megabytes). I could not reuse the KV approach for videos, and I would rather use the same storage backend for arbitrary binary content.

Unlike the Cloudflare workers image API used for the stickers service, the closest video equivalent has a high price tag.

A $100+ total hosting bill for my blog? No way.

I needed a storage solution that supports more than a few megabytes of binary content with a low cost to store and to serve.

Cloudflare had just released their Amazon Simple Storage Service (S3) competitor: Cloudflare R2 Object Storage.

What does R2 stand for?

Cloudflare's R2 marketing material says really requestable, repositioning records, ridiculously reliable, radically reprogrammable, ... They're just making up stuff at this point. It's just a product label.

Now, videos and images comfortably sit in R2 and I do not have to worry about Tweets, Toots, or YouTube content being too large for KV! And, in fact, R2 supports content range requests which is essential to serving videos on the web.

However, R2 is not the only acronym service involved. Before content goes into R2, an immutable key is generated by HMAC-tagging (archived) the content with a namespace key.

I roll my own cryptography and take the risks. In this case, I desired a PRF to isolate content between logical websites or namespaces, which will be on the same resource: an R2 bucket. The same concept is employed in UUIDv3 and UUIDv5 to prevent collisions with duplicate content across different namespaces.

Workers can only use 128MB of RAM. How do you get around that? What if it is a really big video?

Honestly, I have not hit that limit yet.

SubtleCrypto does not have an iterative digest or sign capability. While considered low-level and decorated with warnings, SubtleCrypto is fairly limited. It only exposes a bare minimum of low-level cryptographic functionality to JavaScript execution environments.

One possible way forward is to create an immutable key derived from the cryptographic digests of each chunk while writing the object to R2. Then, each digest would be composed into an overall digest like a merkle tree.

Unfortunately, R2 does not have a move or rename function, so the content would have to be read and written back under the new key, and then the old object deleted.

That sounds really complicated. Is there not another way?

Large video streaming method and rant

There is a relatively simple solution which requires no changes to this backend! Use HTTP Live Streaming (HLS) for large videos! HLS, by design, splits video content into chunks which can be accessed at random by clients.

HLS can be used for more than just live streaming! In addition to .mp4 videos, Twitter offers .m3u8 for video content to its clients. Essentially, an .m3u8 file is like a playlist that says which video chunk is for which time among several video quality streams. The client may switch between video quality streams at will to account for changing bandwidth conditions between chunks.

HLS, simple? It seems so complicated that entire companies exist just to serve HLS content. Try searching for it and the entire first screen is just sponsored results!

Search results that show companies like bunny.net, magine pro, magewell, and lightcast. Each one shows 'sponsored' above.

There is a lot of money in video production and none of that big money is used to democratize the technology. They monopolize HLS and Dynamic Adaptive Streaming over HTTP (DASH) by gate keeping with patents, paywalled standards, and crushing / merging with competition.

DASH is described in ISO/IEC 23009-1:2022. While the ISO website visibly suggests that this standard costs money and will happily charge you, DASH is available in the Publicly Available Standards catalog under 23009-1. DASH is somehow documented in over three hundred pages, and involves way too much XML...

Despite that industry's efforts to obscure the technology that your web browser and phone supports, HLS is openly described in RFC8216. The IETF RFC presents a simpler and wider supported defacto standard to sharing large video content than DASH.

Resuming from the tangent above... I mentioned that the arbitrary binary content is stored on R2.

Some internal details, you can skip

For example, take the soundboard image above. The original source image is stored in R2 at the path KWCA/PYB9wAPGQgmRCY0iCa6lslvnQhZJZavrUxeUliwdw38. Don't worry, you do not have to make much sense of it. I'll simplify it. The path looks like: <site id>/<HMAC(site key, bytes)>. This is referenced by the entity with an id jx3EsTkU, which happens to involve another HMAC. And finally, the URL you called above references the entity identified by sZnpIRps. Guess what? Another HMAC is involved! And then any runtime specific parameters, such as resize to 645 wide, are canonicalized and HMAC'd again to find derived content.

So in short, sZnpIRps, the content URL you hit above, is found and combined with any runtime parameters to find an existing derived image. If one is not found, then it finds the file in the source entity, identified by jx3EsTkU, and uses the image API to transform the original asset stored on R2 at KWCA/PYB9wAPGQgmRCY0iCa6lslvnQhZJZavrUxeUliwdw38. It then saves the result back into R2 at some other path like KWCA/<another base64url HMAC tag here>, and serves that result for all similar requests in the future.

A diagram where a url id points to a content negotiated entity. That points to a source entity and a derived entity. Both source and derived entities point to independent items in R2.

What's with all these HMACs?

Given I am uploading social content, I do not want to store duplicates all over the place. I am using deterministic processes to ensure that when a duplicate is uploaded, it is idepmotent and so no duplicate storage is incurred within one namespace.

Once my worker was live, I could comfortably offload all my media to it and rely on upload scripts to handle it from there.

And indeed, my build times changed to under ten seconds! I was only producing 80 HTML files or so.

What took it ten seconds? All those embeds for Tweets were stored as JSON and referenced during the build process. Each file, one at a time, blocked the HTML rendering process.

Remember how I want to preserve and respect the privacy of my readers? Well it came as a cost to the build process. There were over a hundred JSON files that functioned like a database for all the remote social content I had collected and where their media was stored.

{
  "name": "Jake Williams",
  "username": "MalwareJake",
  "timestamp": 1687005507,
  "text": "Totally normal, definitely not a bubble, investment cycle.\nhttps://fortune.com/2023/06/14/mistral-ai-startup-record-113-million-seed-round-arthur-mensch/",
  "photos": [
    {
      "url": "https://c.cdyn.dev/x6zxHIcY",
      "width": 1080,
      "height": 1016,
      "blurhash": "LEQmCrD%_3xu9Fs:%MRj~qoLM{xu"
    }
  ],
  "videos": [],
  "icon": "https://c.cdyn.dev/vscF0lY1",
  "iso8601": "2023-06-17T07:38:27-0500",
  "banner": "https://c.cdyn.dev/h_3_s8On",
  "banner_blurhash": "EGF~,QIA9ERh-=ofISt7_4WC9Eof",
  "color": "#10A3FF"
}

Which becomes...

Jake Williams

@MalwareJake@twitter.com

Totally normal, definitely not a bubble, investment cycle.
fortune.com/2023/06/14/mistral-ai-startup-record-113-million-seed-round-arthur-mensch/

Saturday, Jun 17, 2023

I needed something I could use to self service and find the Tweets, Toots, Youtube videos, and so on that I gather for later embedding. It turns out, KV is terrible if you want to paginate content! Following the same pattern of using yet another Cloudflare technology, I tried out the D1 open alpha. It is just distributed-ish SQLite, and SQLite is great at paginating results.

There is a database size limit. I am nowhere near that small 100MB ceiling.

A screenshot of a list of embedded Tweets. Extra debug information shows like as HTML, as JSON, etc. on each Tweet. It appears very similar to how it would be when embedded on this site.

Anyway, for local development, Mendoza now creates an iframe to my social archive and some JavaScript will automatically resize the content vertically so it appears seamless while editing. Then, during the build process on GitHub, I'd find all iframes and inline each iframe's content. By the time it gets to you, dear reader, there are no iframes and it is seamlessly integrated with the article.

This social archive service is just a pleasant frontend for that data and it converts it into formats I desire, from HTML, to simplified HTML for RSS content, and now my document IR. Yes, I'll get to that soon! See section "Today, with Intermediate Representation."

Vendor Lock-in

Have you not stepped into a vendor lock-in relationship? Isn't that risky?

Somewhat! I am using the technologies that are available and can handle zero to 🍊 front page without downtime or additional cost.

Much of the structured information is still around in Git, it is just not included in the build process. However, the media and all the derivations are only on R2 now.

If I got kicked off Cloudflare, which is really hard to do (archived,) I would have a bad time. For more on Cloudflare removing customers, see How Cloudflare got Kiwi Farms wrong (archived,) and Cloudflare explains why Kiwi Farms was its most dangerous customer ever (archived.)

Okay, but why Cloudflare this and that? Why not AWS, GCP, Azure, Akamai, Bunny, you name it?

First, because I am already familiar with Cloudflare and AWS. And AWS is 💰💰💰 expensive! I absolutely would not trust my credit card to AWS in the face of abuse or sudden spikes in traffic.

Second, the moment I touched Cloudflare Workers, I saw that a model of handling HTTP requests that I could fully keep in my head at all times. It meant that for the first time, I could really take control, diagnose, and fix issues without tons of abstraction like Tomcat and Spring.

And third, the cycle time on developing with Cloudflare Workers is so insanely fast that I feel like I'm working with PHP again, just without all the cruft. Also, a shout out to hono 炎 🔥. It has just the right amount of trade off for complexity, size, and composability for my interests. I can finally have fun again with Cloudflare workers.

Gergely Orosz

@GergelyOrosz@twitter.com

Google Domains had been the world's 3rd most popular registrar, by monthly domain registrations 2021 to now.

(#1 is, of course, GoDaddy by a huge margin. #2 is NameCheap: both companies' bread-and-butter is domains)

Congrats to the Google Domains team for pulling this off.

Wednesday, Jun 21, 2023

thats-not-how-this-works-thats-not-how-any-of-this-works

Also, who in their right mind would trust Google Cloud Platform at this point? They just sold their registry (archived,) the third most used registry in the world, as if it were a failure! Google can't help but shut down the things that work, the things that make businesses successful. Something is terribly wrong with their long term commitment. Something is terribly wrong with their long term culture and leadership.

No amount of "We can give you credits to move onto GCP" emails, and yes I do get those, will convince me to risk my employer's future on Google.

Mitigation results

Now, finally, finally, my site builds in under a second.

The less time I spend waiting on it, the faster I can judge my writing in its near published format and fix any site breaking syntax errors, and get back to writing more. After all, the harder it is to write, the less I will write.

And, I have interesting things to say, surely.

Dynamic content

I want my content to go live at a specific time without having to run a command on my machine beforehand.

A simulated screenshot showing the text wrangler deploy dash dash env production.

I would also like my content to be viewable before its publish time for proofreading. This is impossible to do with a purely static website.

You want a lot of things.

This article details how I got each and every one with my own efforts!

Pivoting the origin

At first I considered trying OpenResty again. Though, the last time I wrote something for it, I did not find it fun to maintain. It is reliable though, if you want to do something for yourself.

What did you write in OpenResty?

A friend runs a Patreon and was concerned about the the rampant scraping that occurred. While I was at a tech conference, I banged out something that uses YAML and HTTP Basic Authentication for per-month limited-time access credentials. They upload their assets and touch up the YAML file over time and it keeps the content linked from Patreon posts password protected. The password is delivered out of bounds to the subscriber. It is really clunky.

Instead, I opted for Workers Sites, which is separate from Workers Pages. Sites generates a manifest file that is statically imported into the build process. The manifest specifies which pathnames go to which immutable files in KV. During deployment, it also compares the new manifest with KV and removes unlinked files. I say immutable in that the files are appended with a truncated hash. Instead of changing files in place, files are either created anew or garbage collected.

Here's a sample from a silly site cendyne.gay:

{
  "main.css": "main.6da346e39b.css",
  "pride-background.png": "pride-background.174ca9ae29.png",
  "sign.png": "sign.de9888ffc3.png"
}

The manifest is a dictionary where the keys are the pathnames (without the leading /) and the values are the keys in KV where the content is stored.

wrangler kv:key --namespace-id db7b05b5778b494d9e356aaa21facefd list
[
  {
    "name": "main.6da346e39b.css"
  },
  {
    "name": "pride-background.174ca9ae29.png"
  },
  {
    "name": "sign.de9888ffc3.png"
  }
]

wrangler kv:key --namespace-id db7b05b5778b494d9e356aaa21facefd get "main.6da346e39b.css"
html, body, div {
  margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
...

As you can see, those files exist as values in KV.

This origin change is not necessary to do what comes next.

Rewriting content

Here's where things get interesting. I noticed that Cloudflare also has a neat streaming HTML transformer library.

By having additional attributes, such as data-publish-date="2022-09-11T16:14:04Z", I could choose to delete the element entirely or alter it in some way. I altered my Posts page index to have the publish date embedded on each list item.

Then, at request time, the worker would transform the HTML with all the posts to show only the published posts, by comparing to the current time and deleting elements that are in the future. This also meant that future publications could be uploaded as is and be otherwise unlinked to the public eye!

With HTML rewriters, I could progressively enhance a static website using metadata packed in data-* attributes.

This change makes the site a hybrid of static and dynamic, where the only database is the origin where the static HTML is stored. No "Hug of Death" will take down this site now! There is no single point of failure to crumple under the weight of 🍊's front page.

Now, if only I had something reliable for people to discover my content, like Really Simple Syndication (RSS)...

Adding Really Simple Syndication

The solution I used requires a bit of adjacent knowledge. Here's how Mendoza allowed me to create an index page.

Accessing the site map

There was a trick to making the Posts page index. The markdown for the post page is empty! Instead, the list of links is created in the template, which is processed after all the pages are parsed and structured in memory.

<div class="content">
  {{ content }}
  <ul>
    {{ (seq [post :in (posts)] (render-toc post)) }}
  </ul>
</div>

There are two functions to make note of above, a simplified version is included below.

The posts function, which searches the sitemap for all posts
The render-toc function, which generates almost-HTML for every post found

(defn posts []
  (sort
    (((dyn :sitemap) "posts") :pages)
    (fn [a b] (> (a :pub-date) (b :pub-date)))
    ))

(defn render-toc [node]
  {:tag "li"
   "data-publish-date" (iso8601 (node :pub-date))
   :content [{
    :tag "span"
    :content [
      (node :date)
      " "
      {:tag "a" "href" (node :url) :content [(node :title)]}
      ]
    }]
  })

The template processing phase has access to the entire website, where each entry is keyed by its destination file path and the value is a "document" with a title and any other metadata added in the Janet header. Literally a "site map."

The Janet header above the content acts a lot like the YAML Header you'll see in markdown files. However, because it is Janet, I can execute any code I want: like importing more macros or setting up per-page state.

In the same dictionary, labeled node above, is the document's content, accessed with the key :content, which value is a list of almost-HTML dictionaries.

Transforming the site map to an RSS feed

Remember how the sticker HTML is made? The return value was a map or dictionary with a :tag key and several string attributes, as well as a :content key. The same structures are inside the document's :content!

Which means that, like the Posts index page, I can access the site map and process the almost-HTML content of each page.

However, I had to do some tweaks. For example, in RSS you do not want to use relative URLs (archived.) To solve this, I had to write a visitor which would descend into every node or element or dictionary observed and find matching links, images, and sources for images and videos and rewrite the URL before being rendered as HTML in the RSS.

Several of my HTML structures were made for a browser-centric experience which permit styling and scripts. HTML that goes into RSS is limited - more limited than what you can send over email.

And so, I figured out that I could set other keyword keys on my output almost-HTML and my RSS transformer would detect and rewrite the content it was tied to.

(defn sticker-left [name content]
  (def sticker-parts (string/split "/" name))
  (def character (get sticker-parts 0))
  (def sticker-name (get sticker-parts 1))
  {:tag "div"
   "class" "sticker-left"
   :meta-node "sticker"
   :meta-data {
     :character character
     :name sticker-name
     :content content
   }
   :content [
     (sticker name)
     {:tag "div" "class" "im-message" :content {:tag "div" "class" "im-message-left" :content content}}
   ]
})

Observe the :meta-node and :meta-data values.

(defn- post-process-sticker-message [node]
  (def content (get-in node [:meta-data :content]))
  (def sticker-parts (string/split "/" sticker-name))
  (def character (get sticker-parts 0))
  (def sticker-name (get sticker-parts 1))
  {:tag "table" "border" "0" :content [
    {:tag "tr" :content [
      {:tag "td" "width" "128" :content {:tag "img"
        "src" (string "https://s.cdyn.dev/S/128/" character "/" sticker-name)
        "alt" sticker-name
        :no-close true}}
      {:tag "td" :content (post-process content)}
    ]}
  ]})
(varfn post-process [node]
  (cond
    (bytes? node) node
    (indexed? node) (map post-process node)
    (dictionary? node) (if (node :tag)
      (do
        (cond
          (= :sticker-left (node :meta-node)) (post-process-sticker-message node)
          (= :sticker-right (node :meta-node)) (post-process-sticker-message node)
          true (post-process-tag node)
        ))
      node)
    true node))

And now observe that this visitor descends and recognizes any almost-HTML dictionaries which look like stickers. Upon matching, it then rewrites the content into a table, rather than relying on external CSS.

In rewriting my content for an RSS experience, I began to think about how I could eventually export my blog content from the almost-HTML to something useful outside the blog.

In October, I created an issue to track my desire to create a path out of this unmaintained static site generator.

A final experiment

Many months passed and eventually I had the energy and focus to transition my content towards something I could build from.

A GitHub issue, where the first entry is dated October first, 2022. It has a list of check boxes that describe the parts of the project to do.

I would not rewrite all of my content in another markup, markdown, whatever format by hand. I needed to prove that I could structurally transform and interpret the existing content.

In good engineering practice: I first performed a feasibility test. Rather than convert to another form of HTML, I reduced the scope to emitting a plain-text file. It took a few evenings to make. The results were flimsy, yet functional.

A text version of the post DEF CON 30, beginning with 'What brought me to DEF CON'. There is creative use of white space to differentiate sections of the article.

This experiment revealed all the things I needed to annotate with metadata. From quotes and embedded content to a home-grown shared glossary I added a year back.

For example, I referenced PRF above and reference IR below. I can also optionally include a glossary section in the article with a definition I wrote elsewhere.

Pseudo Random Function PRF: A Pseudo Random Function or PRF is a keyed function that produces uniformly random data. It takes an input and reliably produces the same output with a fixed length. It sounds a lot like a hash function, and often is made with a hash function with additional mechanics around it.
Intermediate Representation IR: A structure or union of structures that represent some source code. It can be examined and manipulated with further processing. Compilers use Intermediate Representations (IRs) to optimize code. Some IR structures are only available to the compiler as processing infers additional contextual information.

I had to extend my fork of Mendoza to add more functionality after it rendered content to an HTML file, such that it would save a .txt version too. After all, it was only designed to generate Janet documentation. There was no concept of multiple artifacts per source file.

A good experiment brings greater knowledge on the way forward. This delivered on that. A great experiment provides something useful that enables future work. This also delivered on that.

Today, with Intermediate Representation

Following the .txt output above, my next goal was to make a JSON output.

Why JSON? What would you have me do, use XML? No!

Ultimately I need a serializable intermediate representation (IR).

JSON is a fantastic format to serialize. It is easy to define schemas, either through JSON Schema or TypeScript type declarations. It also crosses the language barrier!

In contrast, extensible data notation (EDN) and its Janet equivalent Janet data notation (JDN) is neat and accessible from Janet. However, unless I want to write a .jdn parser in TypeScript, I'd be locked to processing inside the Janet language. That would not move me forward, when most of my new code is in TypeScript.

After all this Janet stuff, why TypeScript?

Whoa, I don't only write in TypeScript. Its just the most convenient language to write web workers in. The Rust 🦀 equivalent is not ready and is not fun.

Honestly, I disliked TypeScript at first. I wanted Haskell type (archived) rigidity at first. So, I tried Elm and backed away after the complexity of just crossing into JavaScript was too high. Good thing too... Elm is practically dead now (archived.)

Once I realized TypeScript is really just annotated JavaScript, suddenly I could write with ease. TypeScript is just JavaScript with a companion language pre-processor that warns about type violations. It also happens to remind you to check for runtime signals too and is aware of your checks, such as typeof objectHere == "string".

Generating an Intermediate Representation

With JSON as the IR, I had a target to export meta-data and content to. In general, any dictionary with a :meta-node key will be directly emitted as an IR structure, with it's child content also transformed. That leaves any remaining HTML, like blockquote, hr, h2, and so on. These are also transformed into IR nodes.

A select few HTML tags are supported, the rest emit a warning and it is ignored.

At this time, there is full coverage of blog content into document-ir.

Now that a JSON friendly IR is ready, let's write it out to a file!

{
  "type": "document",
  "content": [
    // ...
    {
      "content": [
        {
          "text": "\u000A",
          "type": "text"
        },
        {
          "id": "1670048257873960964",
          "type": "tweet"
        },
        {
          "text": "\u000A",
          "type": "text"
        }
      ],
      "type": "array"
    },
    // ...
    {
      "type": "paragraph",
      "content": [
        {
          "text": "Now that a JSON friendly ",
          "type": "text"
        },
        {
          "content": [
            {
              "text": "IR",
              "type": "text"
            }
          ],
          "definition": {
            "abbreviation": [
              {
                "text": "IR",
                "type": "text"
              }
            ],
            "key": "definition-ir"
          },
          "type": "definition-reference"
        },
        {
          "text": " is ready, let's write it out to a file!\u000A",
          "type": "text"
        }
      ]
    },
    {
      "type": "text",
      "text": "\u000A\u000A"
    },
    // ...
  ]
}

Huh, what is with that \u000A? Oh... that is a newline (\n). Well, I guess that is something to optimize out.

This odd and unnecessary whitespace is an artifact of the authoring document. Every sentence I write is its own line, and so every line ends up with its own "type": "text" node ending in "\000A".

There is also a small "type": "tweet" node in there. Where is the Tweet's content?

Indeed, the authoring document lacks the Tweet content by design. It will have to be replaced in post-processing with content supplied by the social archive.

Remember how the social archive was used to replace iframes in the HTML before going to production? Similarly, social document-ir nodes will be replaced before going to production.

To solve the whitespace issue, as well as that unnecessary "type": "array" node, I wrote WhitespaceTransformer.ts and ArrayCollapseTransformer.ts. As for the Tweet, that was yet another transformer which does two passes. The first finds all social content to fetch, it fetches all the content in bulk from the social archive, and then the second pass replaces all social nodes that matched with the corresponding document-ir which the social archive produces.

After post processing, it now looks like:

{
  "type": "document",
  "content": [
    // ...
    {
      "type": "card",
      "header": {
        "type": "card-header",
        "title": [
          {
            "type": "text",
            "text": "Gergely Orosz"
          }
        ],
        "username": "GergelyOrosz",
        "usernameDomain": "twitter.com",
        "url": "https://twitter.com/GergelyOrosz",
        "imageUrl": "https://c.cdyn.dev/Wg-ozM6I"
      },
      "attribution": {
        "type": "card-attribution",
        "date": "2023-06-21T15:57:57.000Z",
        "url": "https://twitter.com/GergelyOrosz/status/1671548015948046343"
      },
      "content": {
        "type": "card-content",
        "content": [
          {
            "type": "text",
            "text": "Google Domains had been the world's 3rd most popular registrar, by monthly domain registrations 2021 to now. ("
          },
          {
            "type": "link",
            "content": [
              {
                "type": "text",
                "text": "#1"
              }
            ],
            "url": "https://twitter.com/hashtag/1"
          },
          {
            "type": "text",
            "text": " is, of course, GoDaddy by a huge margin. "
          },
          {
            "type": "link",
            "content": [
              {
                "type": "text",
                "text": "#2"
              }
            ],
            "url": "https://twitter.com/hashtag/2"
          },
          {
            "type": "text",
            "text": " is NameCheap: both companies' bread-and-butter is domains) Congrats to the Google Domains team for pulling this off."
          }
        ]
      },
      "original": {
        "type": "tweet",
        "id": "1671548015948046343"
      }
    },
    // ...
    {
      "type": "paragraph",
      "content": [
        {
          "type": "text",
          "text": "Now that a JSON friendly "
        },
        {
          "type": "definition-reference",
          "definition": {
            "abbreviation": [
              {
                "type": "text",
                "text": "IR"
              }
            ],
            "key": "definition-ir"
          },
          "content": [
            {
              "type": "text",
              "text": "IR"
            }
          ]
        },
        {
          "type": "text",
          "text": " is ready, let's write it out to a file!"
        }
      ]
    },
    // ...
  ]
}

This post processing occurs in GitHub Actions after the site builds with Mendoza. Not only will this document-ir be useful to generate HTML, it will be quite usable to make RSS and plain text content too!

Using the Intermediate Representation

While you can view the JSON representation for any post I have now, for example: A Precious Side Project — This Website (json), this is only an interesting tech demonstration rather than something of value to you as a reader.

What really brings value to you, dear reader, is providing an artifact tailored to your experience. Be it with a modern browser, RSS, or even Internet Explorer 6.

Internet Explorer 6 in a virtualized windows 95 environment visiting a page on cendyne dot dev. It certainly lacks the styling of the modern day, though it is faithful to the content presented.

While I can support silly uses cases like Internet Explorer 6 in 2023 now, there were several things I needed to deliver before I could swap the HTML re-written content with Document IR generated content. Namely:

Emit near equivalent HTML for fully featured HTML pages.
Replace the RSS Feed.
Replace index pages like Posts and Topics.
Replace the sitemap.xml.
Reimplement the extra features like HTTP Referer messaging, and more.

A new site map

Before I can do anything more, I really need my own manifest for which document JSON files map to which paths. There is a manifest for workers sites. However, it does not identify which files are specifically document-ir files.

As part of my document-ir post processing action, I now generate an enhanced sitemap JSON file.

It looks something like this:

{
  "version": "b1792837db9611dd5cda46cec1bc858c86204808",
  "branch": "main",
  "collectiveDigest": "5Xb4BXzPzgzwf7Xum2jp2xVpM2ZCFTxYwvUXroDsRKc",
  "date": "2023-07-04T06:22:24.774Z",
  "documents": {
    "/index.html": {
      "source": "/index.json",
      "title": "Hello, I'm Cendyne",
      "url": "/index.html",
      "contentDigest": "suImAKzRXfuFxGNs97ySnMRNiUVXDUbsvhbQJrE-ks0",
      "description": "I write about security, software architecture, management, and applied cryptography."
    },
    "/posts/2021-04-11-website.html": {
      "source": "/posts/2021-04-11-website.json",
      "title": "A New Year, A New Website",
      "url": "/posts/2021-04-11-website.html",
      "pub-date": 1618174005,
      "pubDate": "2021-04-11T20:46:45.000Z",
      "lastMod": "2021-04-11T20:46:45.000Z",
      "contentDigest": "eVQ7_KoeojURovRlhvgo6Luj8WlvGA2LHVZa7-jDSe0",
      "date": "2021-04-11",
      "description": "The new website for the year, a reflection on 2020 and now 2021",
      "guid": "eb24d936-9b85-4f02-baa2-24147e3d242a",
      "image": "https://c.cdyn.dev/pr45JDDH"
    },
    // ...
  }
}

In a similar manner to how wrangler injects a JSON file at deploy time, I generate this file before deployment and import it. Additional information is embedded, such as the Git commit hash, Git branch, and build date. Every page also tracks its own canonicalized JSON digest for tracking over time to see if it has been modified.

The digests are synced against another service which tracks when a digest was first seen. The dates returned are then merged into the enhanced sitemap for the last modification date. Those dates are important for generating an accurate sitemap.xml for Google, Bing, and other search engines.

The enhanced sitemap is then used to form the response when a request GETs /sitemap.xml. Of course, content that has not yet been published will be omitted from the response.

Near HTML Equivalent pages

When you visit a .html link, my worker now checks the enhanced sitemap for an entry. The source is then used to pull the document-ir from the Workers Sites KV binding.

Some further post processing happens and I sprinkle in the HTTP Referer message into the final document ready to be rendered. Next, it goes into a JSX tag which recursively renders document-ir nodes into HTML.

Any ephemeral global state added by the JSX nodes is then collected and reflected on the response, such as any Content Security Policy headers that need to be accounted for.

Lastly, the response is sent to you to read, learn, and enjoy.

Index pages

Posts and Topics is actually a breeze at this point. I can read the enhanced sitemap and emit a document-ir document! Rather than post process the HTML like before, the generated document has posts pre-filtered and it goes through the same rendering process as any normal article I have!

RSS Feed

RSS is tricky! Not only do links have to be rewritten to absolute URLs. I also cannot faithfully deliver the styling that makes my online brand possible. Many node types can pass through just fine, like paragraphs, italic, blockquotes, and such. Specialty nodes like stickers, cards, and highly technical sections need an alternate output, as they

While certainly not so clean, the way I got around this was by setting some ephemeral global state which sets a lite flag. When the <Sticker ...>...</Sticker> JSX node is executed, it checks the lite flag and provides an alternative HTML Table based output.

This lite flag also benefits silly use cases like displaying content to Internet Explorer 6.

Then, the newest four posts are fully embedded with this rendered HTML, while the remaining next six are stubs that link to this website.

A regression: highlighted code

One concern I had going in was how document-ir did not support highlighted code. In a way, that is by design! The IR should be as simple as possible for processing. Enhancements can always be added later.

And indeed, after I went live, the next task was to create a highlighter.

This opened up more options, actually! See, I only have a few built in languages in Mendoza to work with. The rest are custom that I brought in. And, Mendoza's language parsers often crash the build, which is a pain to deal with.

There's a popular JavaScript project highlight.js which I now use to highlight my code at the edge. The good thing is: I can add my own languages to this too! And it really is not too complicated.

However, instead of directly embedding this functionality into my blog source, it is extracted to another service. A Service Binding is used to couple my cendyne.dev worker with a worker dedicated to only syntax highlighting.

However, service bindings are also asynchronous. I cannot invoke a service binding while rendering JSX. Therefore, I used a two pass approach, much like the social media embedding described above. The first pass collects all formatted code with a language tag, then asynchronously requests and caches the highlighted results. The second pass renders synchronously and uses the cache to inline HTML when a successful response exists.

A surprise benefit: Search Engine Optimization warnings

Now that document-ir has proved functional in GitHub Actions and at runtime, I went for one more useful thing, which I could not easily do before.

An interface from GitHub showing dependent build steps. The first is to build the website and the other two are SEO checks and publish

A wholistic GitHub Check Run is recorded as part of a build job "SEOChecks" that reads the same json files which go to staging and production.

A GitHub UI where check runs are listed. One of them is custom and shows the Cendyne character next to 'SEO Checks', it shows action is required..

I reviewed warnings and recommendations across several search engine portals. Among them were the usual: title is too short / too long, meta description is too short / too long, missing image alt attributes, too many h1 elements, and so on.

A GitHub UI with a report. Emoji are next to each page. Pages that fail have reasons listed.

The techniques I employed to record a check run, and comment on a pull request are worthy of another article. Which is now released!

Are you interested in Deno, Bun, or TypeScript with Node? Check out Sharing code between Deno and Node where Bun and ts-node failed. What about authenticating with GitHub or publishing to NPM? Check out Custom GitHub Actions, Check Runs, GitHub Applications, NPM Publishing.

The technology used to do all this post processing, checks, and so on is quite a topic to cover. As with any JavaScript endeavor, there were several road bumps worthy of discussion.

Future

You might be thinking: Wow! That was a lot of effort just to send HTML back to me that looks unchanged. What did that accomplish?

Switching to output and render document-ir unblocks the next long term project: migrating to another authoring technology. I have pushed Mendoza as far as it will go with comfort. It is time I write in something else.

If I can transform document-ir into another authoring markup and back to document-ir without any changes, then I will have a way to port all of my existing content without loss of content. Ultimately, I want to enjoy writing and not need to solve obscure problems to deliver exactly what I want to you, the reader.

Plus, I could do neat things now like dynamically include other content like my Toots on the fediverse on my website, with document-ir in between.

Maybe I'll do that some time.

Thank you for reading. This is the longest side project I have maintained yet. Despite all the grumbling I have for the issues I have overcome, this software does make me happy.

Small shout out

Check out Xe Iaso's My Blog is Hilariously Overengineered to the Point People Think it's a Static Site. Similarly, my site was static, and now it is nearing the same level of complexity as Xe's. Even Xe has a Content Delivery Network: XeDN.

My sticker commentary is in part inspired by Xe Iaso's writing style.

Camping

Also I wrote most of this while camping! One morning I woke up with a frog on my tent.

The camping diet has been mostly sausage and egg for a few days. I think I'm ready for something else.

Several crispy sausages are stacked in a small trapezoid on a paper plate.