DEV Community

Cover image for Extract Content From ODF files using C#
Jay Malli
Jay Malli

Posted on

Extract Content From ODF files using C#

Basically, LibreOffice is a popular open-source office suite that provides users with the ability to create and edit documents, presentations, spreadsheets, and more. Files created with LibreOffice often have the Open Document Format (ODF) extension, such as .odt for text documents and .ods for spreadsheets.

Our requirement is to open this type of ODF files & extract content from it. We can't extract content from ODF files directly using File Class of IO namespace, then what should do?

 ODF documents are stored in XML based format

Yaa right! if could get the data of file in XML then we can also get content from that XML data. Let's take example for Word doc file [LibreOffice Writer] which have extension ".odt" .

Step - 1 : Get the ODF file data in XML format

Code for get XML data from file with o/p

Step - 2: Filter out all .xml files & get xml data. "content.xml" file contains the content/text of file.

filter out content.xml file with o/p

Step - 3 : Now Extract text content from XML data.

Code for extract text from XML data with o/p

=> Source Code :

using System.IO.Compression;
using System.Xml;

namespace Content
{
    public class ODF
    {
        public  string ReadText(string filePath)
        {
            string textContent = "";

            using (ZipArchive zipArchive = ZipFile.OpenRead(filePath))
            {
                foreach (var entry in zipArchive.Entries)
                    {
                        if (entry.FullName.EndsWith(".xml", StringComparison.OrdinalIgnoreCase))
                        {
                            using StreamReader reader = new StreamReader(entry.Open());
                            string xmlContent = reader.ReadToEnd();
                          textContent += ExtractTextFromXml(xmlContent);  
                        }
                    }
            }

            return textContent; // output : text content 
        }

        public string ExtractTextFromXml(string xmlContent)
        {
            string textContent = "";

            XmlDocument xmlDoc = new XmlDocument();
            xmlDoc.LoadXml(xmlContent);

            // add required namespace for different types of documents
            XmlNamespaceManager nsManager = new XmlNamespaceManager(xmlDoc.NameTable);
            nsManager.AddNamespace("text", "urn:oasis:names:tc:opendocument:xmlns:text:1.0"); // for doc files with extension .odt   
            nsManager.AddNamespace("office", "urn:oasis:names:tc:opendocument:xmlns:office:1.0"); // comman for all ODF files  

            foreach (XmlNode node in xmlDoc.SelectNodes("//text:p | //text:h", nsManager))
            {
                textContent += node.InnerText + Environment.NewLine;
            }

            return textContent;
        }
    }
}

Enter fullscreen mode Exit fullscreen mode

=> The example above demonstrates how to extract text content from a .odt document file. If you would like to extract content from other ODF files, such as spreadsheets (.ods) or presentations/ppt (.ods), you can find the necessary code in the repository mentioned below.

=> GithHub Repository

Thank you for joining me on this journey of discovery and learning. If you found this blog post valuable and would like to connect further, I'd love to connect with you on LinkedIn. You can find me at LinkedIn

If you have thoughts, questions, or experiences related to this topic, please drop a comment below.

Top comments (0)