DEV Community

Brandon Rozek
Brandon Rozek

Posted on • Originally published at brandonrozek.com on

Capturing Quoted Strings in Sed

Disclaimer: This posts assumes some knowledge about regular expressions.

Recently I was trying to capture an HTML attribute in sed. For example, let’s say I want to extract the href attribute in the following example:

<a href="https://brandonrozek.com" rel="me"></a>

Enter fullscreen mode Exit fullscreen mode

Advice you commonly see on the Internet is to use a capture group for anything between the quotes of the href.

In regular expression land, we can represent anything as .* and define a capture group of some regular expression X as \(X\).

sed "s/.*href=\"\(.*\)\".*/\1/g"

Enter fullscreen mode Exit fullscreen mode

What does this look like for our input?

echo \<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\> |\
sed "s/.*href=\"\(.*\)\".*/\1/g"


https://brandonrozek.com" rel="me

Enter fullscreen mode Exit fullscreen mode

It matches all the way until the second "! What we want, is to not match any character within the quotations, but match any character that is not the quotation itself [^\"]*

sed "s/.*href=\"\([^\"]*\)\".*/\1/g"

Enter fullscreen mode Exit fullscreen mode

This then works for our example:

echo \<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\> |\
sed "s/.*href=\"\([^\"]*\)\".*/\1/g"


https://brandonrozek.com

Enter fullscreen mode Exit fullscreen mode

Within a bash script, we can make this a little more readable by using multiple variables.

QUOTED_STR="\"\([^\"]*\)\""
BEFORE_TEXT=".*href=$QUOTED_STR.*"
AFTER_TEXT="\1"
REPLACE_EXPR="s/$BEFORE_TEXT/$AFTER_TEXT/g"

INPUT="\<a href=\"https://brandonrozek.com\" rel=\"me\"\>\</a\>"

echo "$INPUT" | sed "$REPLACE_EXPR"

Enter fullscreen mode Exit fullscreen mode

Top comments (0)