Lazyload inserted images in WordPress posts (without regex)

Look into a simple DOM parsing task with PHP, something that is more common to JavaScript.

Most lazyload solutions use data-src attribute, instead of the normal src attribute in the images. When browser sees an image with data-src in it, it does nothing, because it doesn’t mean anything. When the image is wanted to be displayed (when it enters the viewport) lazy load script replaces data-src with the real and tangible src, hence revealing the image to the observer.

So, we want to make normal images inserted into a post body look like this:

<img data-src="img.jpg" src="placeholderimg.gif">

The “Multilingual Plane” and the regex dilemma

First of, you might be like: sounds like a case for Regex. But, careful now, the most upvoted answer in all of stackoverflow, is to a question asking just that. 4428 upvotes. Here’s an excerpt of that:

[…] Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. […]

And it has great comments also:

Chuck Norris can parse HTML with regex.

The answer also scored close to 1000 upvotes and 200 comments in Reddit.

From this scientific study, we can safely conclude that Regex is out of the question (unless you’re Chuck Norris).

What then?

Some sort of a dom parser is ideal to this kind of manipulation. Here’s another great SO thread on just that, there are plenty of options. This article looks into using the Simple HTML DOM Parser. It’s much like jQuery but only for PHP (weird clash of the worlds).

Simple HTML DOM Parser

Edit dom elements:

// First include it to your template

// Create the dom object, this can be a file just as well
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

// Find the second div in the DOM, and give it a class 'bar'
// If you specify a number in the find, it'll output a string
$html->find('div', 1)->class = 'bar';

// Find the first occurrence div with an id of hello,
// and Change it's text to 'hello'
$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;

This outputs:

<div id="hello">
<div id="world" class="bar">

If you don’t specify a number in finds second parameter $html->find('div'), then it returns an array of all the required DOM objects. The array can be accessed via foreach loop:

$html = str_get_html($content);
// Grab all the img tags and loop through them
foreach ($html->find('img') as $element) {
    // Add 'lazy' class to the image
    $element->class = 'lazy ' . $element->class;
echo $html;

The lazy load example

Here’s the before mentioned lazy load trick. Note that this is very small scale parsing though, and probably the “unholy child does not weep the blood of virgins” if regex is used here.

A normal image might look something like this:

<img src="" class="size-full">

We’d like it to look like this:

  class="lazy size-full">

Let’s look at the constituent parts of the element:

  • src, this is the placeholder image, we’re using a 1px gif here, is Base64 encoded data URI to save one HTTP request. Check here for more 1px data URIs.
  • data-src, path to the image.
  • class, lazy class, good to have but no means mandatory.

Here’s a function that takes a post content and works it’s modifying magic on it:

function cm_add_image_placeholders($content)
    $html = str_get_html($content, '', '', '', false);
    $placeholder = 'data:image/gif;base64,R0lGODlhAQABAIAAAMLCwgAAACH5BAAAAAAALAAAAAABAAEAAAICRAEAOw==';
    foreach ($html->find('img') as $element) {
        // Element class, prepend lazy to it
        $element->class = 'lazy ' . $element->class;
        // `data-src` attribute, note the bracket syntax cause of the hyphen
        $element->{'data-src'} = $element->src;
        // Placeholder image to the src
        $element->src = $placeholder;
    return $html;
// This WP specific filter applies the changes to a post
add_filter('the_content', 'add_image_placeholders', 99);

See, no regex!

Notice the third line: str_get_html($content, '', '', '', false);, the fifth parameter is set to false because it will otherwise strip line breaks out. Perplexing that the true is the default…

See the Simple HTML DOM Parser docs for more examples.


There’s also a native PHP methods to traverse and parse the DOM, that looks surprisingly like JavaScript:

$dom = new DOMDocument;
$books = $dom->getElementsByTagName('book');
foreach ($books as $book) {
    echo $book->nodeValue, PHP_EOL;

Club-Mate, the beverage →