Home > Programming > Nice, Clean HTML output at any length

Nice, Clean HTML output at any length

September 17, 2011

I have written previously about the rather spiffy SimplePie. For the record, I really like it so far.

But there’s one thing I want to do with feeds that SimplePie doesn’t support: Trimming output. It’s easy enough if you plan on stripping all markup from the output, but it gets trickier if you want to keep it. This feature is planned for the 2.0 release, which may or may not have to do with the “approximately 100 Jillion-Kabillion-Bazillion support questions” on the subject. (Now that’s a user base!)

Needing a solution that would work in the interim, I searched the Internet for a while and found an interesting function to truncate text and keep the HTML by Jonas Raoni Soares Silva. It worked well given good input, but it inserted extra end tags when the input contained improperly nested tags or missing end tags inside correctly paired tags.

Figuring that the last thing I need is a bunch of rogue </div> end tags breaking my layout, I decided to see if I could come up with something on my own. Here’s what I came up with:

function trim_html($string, $length = null, $suffix = '&hellip;'){

	// Trim the string to $length--if necessary (i.e. if a number given for $length).
	if (is_numeric($length)){
		$string = substr($string, 0, $length); // Get only first $length characters of $string
		$string .= $suffix; // If trimming, add the ellipsis or other specified suffix.
	} // endif 
	// Next, create a DOM document from the trimmed string.
	// The DOMDocument will correct any errors when loading the HTML.
	$dom = new DOMDocument();
	@$dom->loadHTML($string); // This can produce lots of warnings, so ignore errors.
	$string = $dom->saveHTML();
	// Remove the extra HTML added by saveHTML.
	$string = preg_replace('/^.*<body[^>]*>/is', '', $string);
	$string = preg_replace('/<\/body[^>]*>.*$/is', '', $string); 
} // end trim_html()

Of course, there are some issues. (Nothing’s ever easy.) For one thing, the DOMDocument::saveHTML method in recent versions of PHP (5.3.6 and up) lets you pass a DOMNode and get back only “a subset of the document.” I would have liked to try this, as the regular expressions seem hackish to me for some reason. However, my test environment is running 5.3.5, so I can’t test this.

Another issue is that, since a text node can’t be a child of <body>, loadHTML wraps orphaned text nodes in paragraphs. So trim_html('test') would return <p>test</p>. I suppose I could add code to wrap input in a <div> and then tweak the regular expressions to strip those out, should the need arise. For now, the extra paragraphs don’t bother me, so I’ll leave them alone.

Here’s hoping that this is of some use to someone out there. If anyone has any suggestions for improvements, I’d love to hear them.

Categories: Programming Tags: , ,
%d bloggers like this: