loading...

Retrieving OuterHTML without InnerHTML in C#

bugmagnet profile image Bruce Axtens ・3 min read

Recently I asked How do I retrieve OuterHtml without the InnerHtml? and this is the solution I came up with. It's C# but it should translate okay to most other languages.

I'm doing this because the project I'm working on involves checking third-party websites for back-links to our clients' websites. The information in the back-link may be wrong enough for us to disavow it.

The information I generate from this and other code in the project goes into an XML file and eventually into SQL Server. The HTML that contains the various identifying strings needs to be kept to a minimum. Why take the whole <table> if the identifier falls in the src of an <img> in the 52nd <tr>, 7th <td>?

Here's the code

private string OuterMinusInner(HtmlNode root)
{
    if (root == null)
        return string.Empty;

    foreach (var nodeFromList in
        (from node
         in root.ChildNodes 
         where node.NodeType != HtmlNodeType.Text 
         select node).ToList())
    {
        root.RemoveChild(nodeFromList);
    }

    return root.OuterHtml;
}

The method signature defines a single parameter root as an HtmlNode. The method will return a string.

Next, the method tests for root being null and if it is, the method returns an empty string to the caller.

Next comes some Linq code. I'm fairly new to Linq. I've known about it for years, but only really got into it after working through some of the tasks on the C# track at Exercism.

The Linq query from node in root.ChildNodes where node.NodeType != HtmlNodeType.Text select node gets all of the child nodes in root where the HtmlNodeType is anything other than Text (viz Element, Document or Comment.)

The results of the query are committed to a List (of HtmlNode) using .ToList(). This is important. If you don't do this, the code will crash at run-time because the subsequent .RemoveChild() will change the number of child nodes of root, nodes that the Linq code is (otherwise) enumerating on the fly.

The foreach takes each element of the List of HtmlNode returned from the .ToList of the query and puts it into nodeFromList, using that value as the node to remove from root (in root.RemoveChild(nodeFromList)).

When all the non-Text nodes are removed from root the method ends, returning the OuterHTML of root.

Example:
This

<ul class="menu medium-horizontal vertical accordion-menu" id="menu-header-1" role="menu" aria-multiselectable="true" data-responsive-menu="accordion medium-dropdown" data-close-on-click-inside="false" data-accordion-menu="ljy0ut-accordion-menu"><li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home menu-item-9905" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-9910" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-6785" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-7202" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page current_page_parent menu-item-11332" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-10938" role="menuitem"></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-31" role="menuitem"></li>
</ul>

becomes this

<ul class="menu medium-horizontal vertical accordion-menu" id="menu-header-1" role="menu" aria-multiselectable="true" data-responsive-menu="accordion medium-dropdown" data-close-on-click-inside="false" data-accordion-menu="ljy0ut-accordion-menu">







</ul>

Posted on Oct 31 '19 by:

bugmagnet profile

Bruce Axtens

@bugmagnet

Programmed Canon Canola calculators in 1977. Assorted platforms and languages ever since. Assisting with HOPL.info. I am NOT looking for work -- I've got more than enough to do.

Discussion

markdown guide