DEV Community

Hrishikesh Terdalkar
Hrishikesh Terdalkar

Posted on

 

Devanagari Transliteration Pipeline for LaTeX

GitHub logo hrishikeshrt / devanagari-transliteration-latex

Devanagari Transliteration in LaTeX -- Write in Devanagari to render as IAST, Harvard-Kyoto, Velthuis, SLP1, WX etc.

Devanagari Transliteration in LaTeX

Write in Devanagari to render as IAST, Harvard-Kyoto, Velthuis, SLP1, WX etc.

Devanagari text can be transliterated in various standard schemes. There exist several input systems based on these transliteration schemes to enable users easily input the text. More often than not, a user has a preference of scheme to type the input in. Similarly, at times, one faces a need to render it in a different scheme in the PDF document.

In my case, I prefer using ibus-m17n to type text in Devanagari. While writing articles that contain Devanagari text, I also faced the need to render the text as IAST in the final PDF One could always learn to input text in another input scheme, but that may get tedious. Similarly, transliterating each word using online systems such as Aksharamukha can also be a tedious task. So, I was looking for a way…

Devanagari is the fourth most widely adopted writing system in the world, primarily used in the Indian subcontinent. The script is being used for more than 120 languages, some of the more notable languages being, Sanskrit, Hindi, Marathi, Pali, Nepali and several variations of these languages.

Devanagari text can be transliterated in various standard schemes. There exist several input systems based on these transliteration schemes to enable users easily input the text. More often than not, a user has a preference of scheme to type the input in. Similarly, at times, one faces a need to render it in a different scheme in the PDF document.

In my case, I prefer using ibus-m17n to type text in Devanagari. While writing articles that contain Devanagari text, I also faced the need to render the text as IAST in the final PDF.
One could always learn to input text in another input scheme, but that may get tedious. Similarly, transliterating each word using online systems such as Aksharamukha can also be a tedious task. So, I was looking for a way where I can type in Devanagari, and have it rendered in IAST after PDF compilation. As a solution, I came up with a system consisting of a small set of LaTeX commands to add custom syntax to LaTeX and a python transliteration script (based on indic-transliteration package) to serve as a middle-layer and process the LaTeX file to create a new LaTeX file with proper transliteration.

LaTeX Compilation System with Transliteration Support

There are two primary components to the system,

  1. LaTeX Synatx
  2. Transliteration Script

LaTeX Syntax

XeTeX (xelatex) and LuaTeX (lualatex) have good unicode support and can be used to write Devanagari text. In the current example, I mention the setup with XeTeX.

We first add the required packages in the preamble of the LaTeX (.tex) file.

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% Devanagari Related Packages
\usepackage{fontspec, xunicode, xltxtra}
Enter fullscreen mode Exit fullscreen mode

Using fontspec, we can define environments for font families, to write text in specific scripts. To write Devanagari text, one needs to have a Devanagari font available. (It is assumed here that one may need to write both in Devanagari as well as other transliteration schemes.)

For more on Devanagari fonts, you may check the fonts section of this document. In this section, it is assumed that Sanskrit 2003 font is installed in the system.

To define the environments as mentioned earlier, we add the following lines in the preamble.

% Define Fonts
\newfontfamily\textskt[Script=Devanagari]{Sanskrit 2003}
\newfontfamily\textiast[Script=Latin]{Sanskrit 2003}

% Commands for Devanagari Transliterations
\newcommand{\skt}[1]{{\textskt{#1}}}
\newcommand{\iast}[1]{{\textiast{#1}}}
\newcommand{\Iast}[1]{{\textiast{#1}}}
\newcommand{\IAST}[1]{{\textiast{#1}}}
Enter fullscreen mode Exit fullscreen mode

This provides us with four commands. \skt{} can be used to render Devanagari text. \iast{}, \Iast{} and \IAST{} can be used to render devanagari text in IAST format in lower case, title case and upper case respectively. It should be noted that from the perspective of LaTeX engine, the commands \iast{}, \Iast{} and \IAST{} are identical. They are just different syntactically to aid the python script to perform transliteration and apply appropriate modifications.
It should further be noted that we can define new font families and new commands for any of the valid schemes as per the requirement, which can potentially give us additional commands such \velthuis{}, \hk{} and so on.

Minimal Example

Equipped with these commands, and some Devanagari text, we have a minimal example as follows, stored in the file minimal.tex,

\documentclass[10pt]{article}

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% Devanagari Related Packages
\usepackage{fontspec, xunicode, xltxtra}

% Define Fonts
\newfontfamily\textskt[Script=Devanagari]{Sanskrit 2003}
\newfontfamily\textiast[Script=Latin]{Sanskrit 2003}

% Commands for Devanagari Transliterations
\newcommand{\skt}[1]{{\textskt{#1}}}
\newcommand{\iast}[1]{{\textiast{#1}}}
\newcommand{\Iast}[1]{{\textiast{#1}}}
\newcommand{\IAST}[1]{{\textiast{#1}}}

\title{Transliteration of Devanagari Text}
\author{Hrishikesh Terdalkar}

\begin{document}

\maketitle

\skt{को न्वस्मिन् साम्प्रतं लोके गुणवान् कश्च वीर्यवान्।}

\iast{को न्वस्मिन् साम्प्रतं लोके गुणवान् कश्च वीर्यवान्।}

\Iast{को न्वस्मिन् साम्प्रतं लोके गुणवान् कश्च वीर्यवान्।}

\IAST{को न्वस्मिन् साम्प्रतं लोके गुणवान् कश्च वीर्यवान्।}

\end{document}
Enter fullscreen mode Exit fullscreen mode

Transliteration Script

The python script is used to perform transliteration and some clean-up on the LaTeX.

python3 finalize.py minimal.tex final.tex
Enter fullscreen mode Exit fullscreen mode

This result in the content being transformed in the following way,

% ...

\skt{को न्वस्मिन् साम्प्रतं लोके गुणवान् कश्च वीर्यवान्।}

\iast{ko nvasmin sāmprataṃ loke guṇavān kaśca vīryavān|}

\Iast{Ko Nvasmin Sāmprataṃ Loke Guṇavān Kaśca Vīryavān|}

\IAST{KO NVASMIN SĀMPRATAṂ LOKE GUṆAVĀN KAŚCA VĪRYAVĀN|}

% ...
Enter fullscreen mode Exit fullscreen mode

We can now proceed to compile the final.tex file.

xelatex final
Enter fullscreen mode Exit fullscreen mode

This results in the following output, PDF

Anatomy of the Transliteration Script

At the core of the transliteration script, there is a function transliterate_between.

def transliterate_between(
    text: str,
    from_scheme: str,
    to_scheme: str,
    start_pattern: str,
    end_pattern: str,
    post_hook: Callable[[str], str] = lambda x: x,
) -> str:
    """Transliterate the text appearing between two patterns

    Only the text appearing between patterns `start_pattern` and `end_pattern`
    it transliterated.
    `start_pattern` and `end_pattern` can appear multiple times in the full
    text, and for every occurrence, the text between them is transliterated.

    `from_scheme` and `to_scheme` should be compatible with scheme names from
    `indic-transliteration`

    Parameters
    ----------
    text : str
        Full text
    from_scheme : str
        Input transliteration scheme
    to_scheme : str
        Output transliteration scheme
    start_pattern : regexp
        Pattern describing the start tag
    end_pattern : regexp
        Pattern describing the end tag
    post_hook : Callable[[str], str], optional
        Function to be applied on the text within tags after transliteration
        The default is `lambda x: x`.

    Returns
    -------
    str
        Text after replacements
    """

    if from_scheme == to_scheme:
        return text

    def transliterate_match(matchobj):
        target = matchobj.group(1)
        replacement = transliterate(target, from_scheme, to_scheme)
        replacement = post_hook(replacement)
        return f"{start_pattern}{replacement}{end_pattern}"

    pattern = "%s(.*?)%s" % (re.escape(start_pattern), re.escape(end_pattern))
    return re.sub(pattern, transliterate_match, text, flags=re.DOTALL)
Enter fullscreen mode Exit fullscreen mode

We can provide the start and end patterns as \iast{ and } respsectively, to transliterate the text enclosed in these tags.

Using this function, we can write a generic function to work with any transliteration scheme.

def latex_transliteration(
    input_text: str,
    from_scheme: str,
    to_scheme: str
) -> str:
    """Transliaterate parts of the LaTeX input enclosed in scheme tags

    A scheme tag is of the form `\\to_scheme_lowercase{}` and is used
    when the desired output is in `to_scheme`.

    i.e.,
    - Tags for IAST scheme are enclosed in \\iast{} tags
    - Tags for VH scheme are enclosed in \\vh{} tags
    - ...

    Parameters
    ----------
    input_text : str
        Input text
    from_scheme : str
        Transliteration scheme of the text written within the input tags
    to_scheme : str
        Transliteration scheme to which the text within tags should be
        transliterated

    Returns
    -------
    str
        Text after replacement of text within the scheme tags
    """
    start_tag_pattern = f"\\{to_scheme.lower()}"
    end_tag_pattern = "}"
    return transliterate_between(
        input_text,
        from_scheme=from_scheme,
        to_scheme=to_scheme,
        start_pattern=start_tag_pattern,
        end_pattern=end_tag_pattern
    )
Enter fullscreen mode Exit fullscreen mode

Note: The names of schemes (and therefore the corresponding LaTeX commands) have to conform to the names of schemes used
by the indic-transliteration package.

IAST is a case-insensitive transliteration scheme, and as such, we might be interested in specific capitalization of certain words (e.g. proper nouns). We can use the post_hook argument to provide this function. Using that, we can create a function to handle the three variants of IAST mentioned previously, namely, \iast{} (lower), \Iast{} (title) and \IAST{} (upper).

def devanagari_to_iast(input_text: str) -> str:
    """Transliaterate parts of the input enclosed in
    \\iast{}, \\Iast{} or \\IAST{} tags from Devanagari to IAST

    Text in \\Iast{} tags also undergoes a `.title()` post-hook.
    Text in \\IAST{} tags also undergoes a `.upper()` post-hook.

    Parameters
    ----------
    input_text : str
        Input text

    Returns
    -------
    str
        Text after replacement of text within the IAST tags
    """
    intermediate_text = transliterate_between(
        input_text,
        from_scheme=sanscript.DEVANAGARI,
        to_scheme=sanscript.IAST,
        start_pattern="\\iast{",
        end_pattern="}"
    )
    intermediate_text = transliterate_between(
        intermediate_text,
        from_scheme=sanscript.DEVANAGARI,
        to_scheme=sanscript.IAST,
        start_pattern="\\Iast{",
        end_pattern="}",
        post_hook=lambda x: x.title()
    )
    final_text = transliterate_between(
        intermediate_text,
        from_scheme=sanscript.DEVANAGARI,
        to_scheme=sanscript.IAST,
        start_pattern="\\IAST{",
        end_pattern="}",
        post_hook=lambda x: x.upper()
    )

    return final_text
Enter fullscreen mode Exit fullscreen mode

Finally, there are other utility functions to remove comments and clean excessive whitespaces.

Extras

Additionally, we may want some more structure to our setup, such as,

  • Separation of ontent into multiple files
\input{sections/section_devanagari.tex}
\input{sections/section_iast_lower.tex}
\input{sections/section_iast_title.tex}
\input{sections/section_iast_upper.tex}
Enter fullscreen mode Exit fullscreen mode
  • Bibliography
\bibliographystyle{acm}
\bibliography{papers}
Enter fullscreen mode Exit fullscreen mode

Final LaTeX Preparation

We may have used the scheme tags across multiple sections. One option is to apply the transliteration script on every section file, to create a new set of section files and use those to compile the final LaTeX file.

A simpler solution is available in the form of latexpand which resolves the \input{} commands to actually include the content and create a single consolidated LaTeX file.

latexpand main.tex > single.tex
Enter fullscreen mode Exit fullscreen mode

Now, we can run the python script on this file to resolve the transliteration tags.

python3 finalize.py main.tex final.tex
Enter fullscreen mode Exit fullscreen mode

Compilation

When working with BibTeX, we often need to multiple times to get the correct rendering of references in the PDF. Usually, this requires

xelatex final
bibtex final
xelatex final
xelatex final
Enter fullscreen mode Exit fullscreen mode

Alternatively, we can use latexmk which takes care of the tedious compilation routines and reduces our job to a single command,

latexmk -pdflatex='xelatex %O %S' -pdf -ps- -dvi- final.tex
Enter fullscreen mode Exit fullscreen mode

Another benefit of using latexmk is, we can clean the numerous files generated by LaTeX engine using a one-liner as well,

latexmk -c
Enter fullscreen mode Exit fullscreen mode

Makefile

Finally, we can place all of the console commands together in a Makefile.

all: .all

.all: main.tex sections/*.tex papers.bib
        latexpand main.tex > single.tex
        python3 finalize.py single.tex final.tex

        latexmk -pdflatex='xelatex %O %S' -pdf -ps- -dvi- final.tex

clear:
        latexmk -C
        rm single.tex
        rm final.tex

clean:
        latexmk -c
Enter fullscreen mode Exit fullscreen mode

Thus, now we can focus on writing content in the .tex files and once we are done, simply use the command,

make
Enter fullscreen mode Exit fullscreen mode

Requirements

We have made use of a number of external tools, and it is required to have these setup prior to the described solution.

Minimal Requirements

The minimal example mentioned earlier requires only three things,

Extra Requirements

The extras have some more dependencies.

  • BibTeX (optional) (bibliography support)
  • latexpand (optional) (resolve \input{})
  • latexmk (optional) (simpler TeX compilation)

Devanagari Fonts

Nowadays, there are several good Devanagari fonts available. Google Fonts also provides a wide variety of Devanagari fonts.

Two of my personal favourites are,

Code

The source code for the entire setup is available at hrishikeshrt/devanagari-transliteration-latex.

Latest comments (0)

Timeless DEV post...

Git Concepts I Wish I Knew Years Ago

The most used technology by developers is not Javascript.

It's not Python or HTML.

It hardly even gets mentioned in interviews or listed as a pre-requisite for jobs.

I'm talking about Git and version control of course.

One does not simply learn git