DEV Community

t3mplar
t3mplar

Posted on

2 2

Dealing with huge xml-like files containing illegal characters

Sometimes you have to deal with files that look pretty like xml. But htey are not, because contain a lot of illegal characters. Such files usually made by just concatenating strings and not verifying angainst any schema. When these files are comparably small, it's not a big deal to use regex to replace all those characters, but for really big files it's not very convinient.

The idea was to read file element by element and if next part is not a valid xml, then I could clean all illegal characters and use well-formed xml for my needs.

public IEnumerable<XElement> GetElement(string filePath, string elementName)
{
    using var reader = XmlReader.Create(filePath, new XmlReaderSettings { CheckCharacters = false });
    reader.MoveToContent();

    while (reader.Read())
    {
        if (reader.NodeType == XmlNodeType.Element && reader.Name == elementName)
        {
            var str = reader.ReadOuterXml();
            XNode node;

            try
            {
                node = XElement.Parse(str);
            }
            catch (XmlException)
            {
                var pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF]);";
                var regex = new Regex(pattern, RegexOptions.IgnoreCase);
                var fixedStr = regex.Replace(str, string.Empty);

                node = XElement.Parse(fixedStr);
            }

            if (node is XElement el)
            {
                yield return el;
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Important note:
To skip checking for invalid characters, you should pass to XmlReader settings XmlReaderSettings { CheckCharacters = false } so it can omit checks and give me possibility to cleanup input string.

Image of Bright Data

Ensure Data Quality Across Sources – Manage and normalize data effortlessly.

Maintain high-quality, consistent data across multiple sources with our efficient data management tools.

Manage Data

Top comments (0)

Image of AssemblyAI

Automatic Speech Recognition with AssemblyAI

Experience near-human accuracy, low-latency performance, and advanced Speech AI capabilities with AssemblyAI's Speech-to-Text API. Sign up today and get $50 in API credit. No credit card required.

Try the API

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay