The kinda way I use to modify HTML with RegExp using PHP.
HTML Tag Regex
/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/
The above RegExp can be broken down as:
-
/
: The regex opening delimiter. -
\<
: Matches the character<
of an opening tag. -
(?<tag>[a-z][a-z0-9\-]*)
: Matches HTML valid tag name, which should start with a character betweena
andz
, could contain another characters betweena
toz
, and numbers between0
and9
, and could also contain the character-
in it. -
(\s+([\s\S]*?))?
: Matches the entire attributes of the tag including spaces between them, but only if they were present. -
\/?
: Matches the character/
of self closing tags. -
\>
: Matches the character>
, which is supposed to be the closing character of the opening tag. -
(([\s\S]*?)\<\/(?P=tag)\>)?
: Matches the content or HTML inside the tag and the closing tag, but only if the tag is not self closing tag. -
/
: The regex closing delimiter.
HTML Tag Attributes Regex
/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/
The above RegExp can be broken down as:
-
/
: The regex opening delimiter. -
([\w\-]+)
: Matches the attributes key/name. -
(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?
: Matches the value of the attribute, which could be anything wrapped in a single-quote ('
) or in a double-quote ("
). Also, could be naked (not wrapped in a quote). If not wrapped, the value must only contain characters in the rangea
toz
or the capitalsA
toZ
, and numbers in the range0
to9
, and_
(underscore), and-
(hyphen). This could also match nothing for boolean attributes. -
/
: The regex closing delimiter.
Example Usage
<?php
// HTML elements
$content = <<<EOL
<p>Text paragraph.</p>
<img src="http://example.com/image-200x320.png" width="200" height="320">
EOL;
// Tags matching RegExp
$tags_regexp = '/\<(?<tag>[a-z][a-z0-9\-]*)(\s+([\s\S]*?))?\/?\>(([\s\S]*?)\<\/(?P=tag)\>)?/';
// Attributes matching RegExp
$atts_regexp = '/([\w\-]+)(\s*\=\s*(?|(?<quot>[\'"])([\s\S]*?)(?P=quot)|(?<quot>)([\w\-]+)))?/';
// Match all the valid elements in the HTML
preg_match_all( $tags_regexp, $content, $matches, PREG_SET_ORDER );
// Loop through and make the necessary changes
foreach ( $matches as $match ) {
// We are going to modify only image tags
if ( 'img' !== $match[ 'tag' ] ) continue;
// Match all the attributes
preg_match_all( $atts_regexp, $match[2], $atts_match );
// Combine the keys and the values
$atts_match = array_combine( $atts_match[1], $atts_match[4] );
// Build back a HTML valid attributes
$atts = '';
foreach ( $atts_match as $name => $value ) {
$atts .= sprintf( ' %s="%s"', $name, $value );
}
// Replacement for the tag
$amp = sprintf( '<amp-img%s></amp-img>', $atts );
// Replace the complete tag match with the new replacement
$content = str_replace( $match[0], $amp, $content );
}
// The AMPifyed HTML
/**
* <p>Text paragraph.</p>
* <amp-img src="http://example.com/image-200x320.png" width="200" height="320"></amp-img>
*/
echo $content;
The above could also be improved to make a complete HTML-to-AMP converter for simple pages.
Top comments (0)