Serge Artishev

Posted on Jun 30

Building a DOCX to Markdown Converter with Node.js

#markdown #converter #node #tutorial

Welcome to a step-by-step guide on building a powerful DOCX to Markdown converter using Node.js. This project is a great way to learn about file manipulation, command-line interfaces, and converting document formats. By the end of this series, you'll have a tool that not only converts DOCX files to Markdown but also extracts images and formats tables. Let's dive in!

Introduction
Setting Up the Project
Basic DOCX to HTML Conversion
Converting HTML to Markdown
Extracting Images
Formatting Tables
Conclusion

Introduction

Markdown is a lightweight markup language with plain text formatting syntax. It's widely used for documentation due to its simplicity and readability. However, many documents are created in DOCX format, especially in corporate environments. Converting these documents to Markdown can be tedious if done manually. This is where our converter comes in handy.

Setting Up the Project

First, let's create a new directory for our project and initialize it with npm.

mkdir docx-to-md-converter
cd docx-to-md-converter
npm init -y

Next, we'll install the necessary dependencies. We'll use mammoth for converting DOCX to HTML, turndown for converting HTML to Markdown, commander for building the CLI, and uuid for unique image names.

npm install mammoth turndown commander uuid

Create a new file named index.js in your project directory. This will be the main file for our converter.

touch index.js

Basic DOCX to HTML Conversion

Let's start by writing a simple script to convert DOCX files to HTML. We'll use the mammoth library for this.

Open index.js and add the following code:

#!/usr/bin/env node

import * as fs from 'fs';
import * as path from 'path';
import * as mammoth from 'mammoth';
import { program } from 'commander';

program
  .version('1.0.0')
  .description('Convert DOCX to HTML')
  .argument('<input>', 'Input DOCX file')
  .argument('[output]', 'Output HTML file (default: same as input with .html extension)')
  .action(async (input, output) => {
    try {
      await convertDocxToHtml(input, output);
    } catch (error) {
      console.error('Error:', error);
      process.exit(1);
    }
  });

program.parse(process.argv);

async function convertDocxToHtml(inputFile, outputFile) {
  if (!outputFile) {
    outputFile = path.join(path.dirname(inputFile), `${path.basename(inputFile, '.docx')}.html`);
  }

  const result = await mammoth.convertToHtml({ path: inputFile });
  await fs.promises.writeFile(outputFile, result.value);
  console.log(`Conversion complete. Output saved to ${outputFile}`);
}

This script uses commander to parse command-line arguments, mammoth to convert DOCX to HTML, and fs to write the output to a file. To make this script executable, add the following line at the top of index.js:

#!/usr/bin/env node

Make sure the script has execute permissions:

chmod +x index.js

Now you can run the script to convert a DOCX file to HTML:

node index.js example.docx example.html

Converting HTML to Markdown

Next, we'll add the functionality to convert HTML to Markdown using turndown.

First, install turndown:

npm install turndown

Update index.js to include the HTML to Markdown conversion:

#!/usr/bin/env node

import * as fs from 'fs';
import * as path from 'path';
import * as mammoth from 'mammoth';
import TurndownService from 'turndown';
import { program } from 'commander';

program
  .version('1.0.0')
  .description('Convert DOCX to Markdown')
  .argument('<input>', 'Input DOCX file')
  .argument('[output]', 'Output Markdown file (default: same as input with .md extension)')
  .action(async (input, output) => {
    try {
      await convertDocxToMarkdown(input, output);
    } catch (error) {
      console.error('Error:', error);
      process.exit(1);
    }
  });

program.parse(process.argv);

async function convertDocxToMarkdown(inputFile, outputFile) {
  if (!outputFile) {
    outputFile = path.join(path.dirname(inputFile), `${path.basename(inputFile, '.docx')}.md`);
  }

  const result = await mammoth.convertToHtml({ path: inputFile });
  const turndownService = new TurndownService();
  const markdown = turndownService.turndown(result.value);

  await fs.promises.writeFile(outputFile, markdown);
  console.log(`Conversion complete. Output saved to ${outputFile}`);
}

Now you can convert DOCX files to Markdown:

node index.js example.docx example.md

Extracting Images

DOCX files often contain images that we need to handle. We'll extract these images and save them to a folder, updating the image links in the Markdown file.

Update index.js to include image extraction:

#!/usr/bin/env node

import * as fs from 'fs';
import * as path from 'path';
import * as mammoth from 'mammoth';
import TurndownService from 'turndown';
import { program } from 'commander';
import { v4 as uuidv4 } from 'uuid';

program
  .version('1.0.0')
  .description('Convert DOCX to Markdown with image extraction')
  .argument('<input>', 'Input DOCX file')
  .argument('[output]', 'Output Markdown file (default: same as input with .md extension)')
  .action(async (input, output) => {
    try {
      await convertDocxToMarkdown(input, output);
    } catch (error) {
      console.error('Error:', error);
      process.exit(1);
    }
  });

program.parse(process.argv);

async function convertDocxToMarkdown(inputFile, outputFile) {
  if (!outputFile) {
    outputFile = path.join(path.dirname(inputFile), `${path.basename(inputFile, '.docx')}.md`);
  }

  const imageDir = path.join(path.dirname(outputFile), 'images');
  if (!fs.existsSync(imageDir)) {
    fs.mkdirSync(imageDir, { recursive: true });
  }

  const result = await mammoth.convertToHtml({ path: inputFile }, {
    convertImage: mammoth.images.imgElement(async (image) => {
      const buffer = await image.read();
      const extension = image.contentType.split('/')[1];
      const imageName = `image-${uuidv4()}.${extension}`;
      const imagePath = path.join(imageDir, imageName);
      await fs.promises.writeFile(imagePath, buffer);
      return { src: `images/${imageName}` };
    })
  });

  const turndownService = new TurndownService();
  const markdown = turndownService.turndown(result.value);

  await fs.promises.writeFile(outputFile, markdown);
  console.log(`Conversion complete. Output saved to ${outputFile}`);
}

Now, images will be extracted and saved in an images folder, and the Markdown file will contain the correct links to these images.

Formatting Tables

The final feature we'll add is table formatting. DOCX files often contain tables that need to be correctly formatted in Markdown.

Update index.js to include table formatting:

#!/usr/bin/env node

import * as fs from 'fs';
import * as path from 'path';
import * as mammoth from 'mammoth';
import TurndownService from 'turndown';
import { program } from 'commander';
import { v4 as uuidv4 } from 'uuid';

program
  .version('1.0.0')
  .description('Convert DOCX to Markdown with image extraction and table formatting')
  .argument('<input>', 'Input DOCX file')
  .argument('[output]', 'Output Markdown file (default: same as input with .md extension)')
  .action(async (input, output) => {
    try {
      await convertDocxToMarkdown(input, output);
    } catch (error) {
      console.error('Error:', error);
      process.exit(1);
    }
  });

program.parse(process.argv);

function createMarkdownTable(table) {
  const rows = Array.from(table.rows);
  if (rows.length === 0) return '';

  const headers = Array.from(rows[0].cells).map(cell => cell.textContent?.trim() || '');
  const markdownRows = rows.slice(1).map(row => 
    Array.from(row.cells).map(cell => cell.textContent?.trim() || '')
  );

  let markdown = '| ' + headers.join(' | ') + ' |\n';
  markdown += '| ' + headers.map(() => '---').join(' | ') + ' |\n';
  markdownRows.forEach(row => {
    markdown += '| ' + row.join(' | ') + ' |\n

';
  });

  return markdown;
}

async function convertDocxToMarkdown(inputFile, outputFile) {
  if (!outputFile) {
    outputFile = path.join(path.dirname(inputFile), `${path.basename(inputFile, '.docx')}.md`);
  }

  const imageDir = path.join(path.dirname(outputFile), 'images');
  if (!fs.existsSync(imageDir)) {
    fs.mkdirSync(imageDir, { recursive: true });
  }

  const result = await mammoth.convertToHtml({ path: inputFile }, {
    convertImage: mammoth.images.imgElement(async (image) => {
      const buffer = await image.read();
      const extension = image.contentType.split('/')[1];
      const imageName = `image-${uuidv4()}.${extension}`;
      const imagePath = path.join(imageDir, imageName);
      await fs.promises.writeFile(imagePath, buffer);
      return { src: `images/${imageName}` };
    })
  });

  let html = result.value;

  const turndownService = new TurndownService();

  turndownService.addRule('table', {
    filter: 'table',
    replacement: function(content, node) {
      return '\n\n' + createMarkdownTable(node) + '\n\n';
    }
  });

  const markdown = turndownService.turndown(html);

  await fs.promises.writeFile(outputFile, markdown);
  console.log(`Conversion complete. Output saved to ${outputFile}`);
}

Conclusion

In this blog post, we built a DOCX to Markdown converter step by step, adding features like image extraction and table formatting. This tool demonstrates the power and flexibility of Node.js for handling file manipulations and conversions.

The source code for this project is available on GitHub, where you can find the latest updates, contribute to the project, and explore further enhancements.

Thank you for following along with this guide. Happy coding!

Top comments (1)

Darren Cooper • Oct 10

Thanks for this - I have been thinking about this very task. Incredibly useful

Some comments have been hidden by the post's author - find out more

DEV Community

Building a DOCX to Markdown Converter with Node.js

Table of Contents

Introduction

Setting Up the Project

Basic DOCX to HTML Conversion

Converting HTML to Markdown

Extracting Images

Formatting Tables

Conclusion

Top comments (1)

Read next

10 CSS Tricks for UI developers

How to Set Up Next.js 15 for Production in 2024

Introducing NexaPHP: A Lightweight MVC PHP Framework

How to Design a Tangram Puzzle Using 3D CAD Software