Extract Content From ODF files using C#

#odf #csharp #libreoffice #xml

Basically, LibreOffice is a popular open-source office suite that provides users with the ability to create and edit documents, presentations, spreadsheets, and more. Files created with LibreOffice often have the Open Document Format (ODF) extension, such as .odt for text documents and .ods for spreadsheets.

Our requirement is to open this type of ODF files & extract content from it. We can't extract content from ODF files directly using File Class of IO namespace, then what should do?

Yaa right! if could get the data of file in XML then we can also get content from that XML data. Let's take example for Word doc file [LibreOffice Writer] which have extension ".odt" .

Step - 1 : Get the ODF file data in XML format

Step - 2: Filter out all .xml files & get xml data. "content.xml" file contains the content/text of file.

Step - 3 : Now Extract text content from XML data.

=> Source Code :

using System.IO.Compression;
using System.Xml;

namespace Content
{
    public class ODF
    {
        public  string ReadText(string filePath)
        {
            string textContent = "";

            using (ZipArchive zipArchive = ZipFile.OpenRead(filePath))
            {
                foreach (var entry in zipArchive.Entries)
                    {
                        if (entry.FullName.EndsWith(".xml", StringComparison.OrdinalIgnoreCase))
                        {
                            using StreamReader reader = new StreamReader(entry.Open());
                            string xmlContent = reader.ReadToEnd();
                          textContent += ExtractTextFromXml(xmlContent);  
                        }
                    }
            }

            return textContent; // output : text content 
        }

        public string ExtractTextFromXml(string xmlContent)
        {
            string textContent = "";

            XmlDocument xmlDoc = new XmlDocument();
            xmlDoc.LoadXml(xmlContent);

            // add required namespace for different types of documents
            XmlNamespaceManager nsManager = new XmlNamespaceManager(xmlDoc.NameTable);
            nsManager.AddNamespace("text", "urn:oasis:names:tc:opendocument:xmlns:text:1.0"); // for doc files with extension .odt   
            nsManager.AddNamespace("office", "urn:oasis:names:tc:opendocument:xmlns:office:1.0"); // comman for all ODF files  

            foreach (XmlNode node in xmlDoc.SelectNodes("//text:p | //text:h", nsManager))
            {
                textContent += node.InnerText + Environment.NewLine;
            }

            return textContent;
        }
    }
}

=> The example above demonstrates how to extract text content from a .odt document file. If you would like to extract content from other ODF files, such as spreadsheets (.ods) or presentations/ppt (.ods), you can find the necessary code in the repository mentioned below.

=> GithHub Repository

Thank you for joining me on this journey of discovery and learning. If you found this blog post valuable and would like to connect further, I'd love to connect with you on LinkedIn. You can find me at LinkedIn

If you have thoughts, questions, or experiences related to this topic, please drop a comment below.

DEV Community

Extract Content From ODF files using C#

Top comments (0)

Read next

SOLID Principles In C# - Liskov Substitution Principle

ABP Suite: Best CRUD Page Generation Tool for .NET

Using Azure OpenAI Service to generate images with DALL-E in .NET

Intro to JS Interop in Blazor