DEV Community

Cover image for How to properly store and print plaintext strings
Martin Kordas
Martin Kordas

Posted on

How to properly store and print plaintext strings

Most of the time, users of web applications fill in dynamic textual data in the form of plaintext. These texts are usually stored in a database and then get printed in different ways. Apart from dynamic user data, applications also use static plaintexts for user interface texts and for static items lists. (Static plaintexts could be stored in a database or directly in program data.)

I decided to share some common tips on how to deal with these plain strings appropriately. For brevity, the article only lists what should be done with plaintext strings, it does not focus on how to implement it. In most programming languages, an existing library function could be found to accomplish each of described tasks.

Storing plaintext

Before storing a string received as an user input, you should perform these operations.

  1. Normalize whitespace: Normalization provides uniform format for your strings and strip unnecessary or unintended characters. You will probably implement this feature using regular expressions.
    1. Strip any whitespace from beginning and end of the string.
    2. Convert whitespace characters other than space character (e. g. tabulator) into a space character (do not convert line break characters).
    3. Merge any adjacent whitespace characters into a single space character.
    4. Normalize line breaks: line breaks are represented by carriage return (CR) or line feed (LF) character. Replace CR or CRLF sequence with LF character (Unix convention) or CRLF sequence (Windows convention).
  2. Strip non-printable characters: Today's most common Unicode encoding has also adopted many non-printable characters from the original ASCII character set (e. g. BELL, EOT...). Strip those characters to avoid confusion caused by debugging "invisible" strings. To filter the non-printable characters out, you would usually use a regular expression with specific character value range.
  3. Strip unsupported Unicode characters: Although UTF-8 encoding is de-facto standard these days, some unusual Unicode characters may not be supported by your systems. For example, MySQL's utf8 charset is capable of storing character of Unicode Basic Multilingual Plane only. Other characters (e.g. emojis) will not work and you should ideally strip them before processing a storing.

Printing plaintext

Printing plaintext into plaintext

Web applications usually use HTML as its output format, but occasionally you would use plaintext instead. This includes exporting into text files or sending plaintext e-mails. Printing plaintext into plaintext is naturally easier then printing it into HTML.

  1. Normalize line breaks: Thanks to line break normalization before storing, you already have line breaks in an uniform format. You should now however convert it depending on the system where the text will be used. Ideally you would use LF line break style for Unix systems and CRLF sequence for Windows systems. You just have to discover what type the target system is.
    • When sending a file to a client, you could determine client's operating system from the User-Agent header.
    • Alternatively, you could simply always use CRLF sequence, which will work on both Unix and Windows systems most of the time.
    • If you are printing into a file stored locally on your server, you should always use line break style specific to your server's operating system.
  2. Wrap lines: It could be very annoying for user to read lines with excessive length. Check that each line in your text does not exceed suitable limit of characters (e. g. 80 chars) and insert a line break where needed.
  3. Convert encoding: Although we usually use UTF-8 on the web and in database storage, when doing exports some legacy character sets may be preferred by your clients (e.g. ISO-8859-1).

Printing plaintext into HTML

Although it cannot express any formatting, storing plaintext in the database has several advantages over storing HTML itself:

  • plaintext is smaller in size
  • plaintext can be easily read by database admin or by developer when debugging an app
  • plaintext is suitable for querying (searching based on user input including fulltext search)
  • plaintext represents "neutral", generic form of textual data (whereas text stored as HTML can be used only inside HTML document)

If you need to print a plaintext string into HTML, you should try to programatically add all formatting features that plaintext was not able to store. I list some of these below. (Only points marked with '!' are necessary for proper and secure printing of texts into HTML, others are optional.)

  1. Convert URLs into <a> tags: Do this only if the text is supposed to contain URLs.
  2. Apply formatting expressed in given markup language: Simple markup languages like Markdown make it possible for users to specify formatting directly in the plaintext. Before printing, plaintext is transformed into HTML using a library function. Your users however need to be trained to use specific markup language.
  3. (!) Convert HTML special characters into corresponding entities: This is an absolute necessity for security reasons, as it protects your website against HTML injection. For example, special character < would be encoded as entity &gt; and special character " as entity &quot;.
  4. Convert all applicable HTML characters into corresponding HTML entities: This makes some unusual Unicode characters easier to identify in the source code. For example, © character would be encoded as entity &copy;.
  5. Transform characters based on your language's grammar rules: Users of your application typically only use basic characters when filling in input fields and do not care much about typography. Hence, your program should intelligently deduce transformations needed for the text to be typographically and grammatically correct. Typically you would replace a character located at appropriate position with an HTML entity.

    • &nbsp;: Put non-breaking space instead of simple space wherever a line break is inappropriate. In English for example, you would put non breaking space between number and unit indication ( e. g. 10 kg).
    • &ndash; or &mdash;: Basic dash character (-) should be replaced with wider dash characters in many cases. In English for example, you would put ndash character (–) between numerical ranges (e. g. 1939–1945).
    • &hellip;: Horizontal ellipsis character (…) should replace three dots at the end of a sentence.
    • &bdquo; and &ldquo;: Users usually use basic double quote character (") when filling forms. Some languages may however require different stylization for opening and closing quotes (for example „ for opening and “ for closing quotes).
  6. (!) Convert line break characters to <br /> tags: Otherwise you will see no actual line breaks on the website.

  7. Wrap lines: See above section Printing as plaintext. Wrapping long HTML lines becomes useful when you read raw HTML source code as a developer, but does not have any impact on user experience.

You should ideally incorporate several of these operations into one function call which you will use as a standard way for printing plaintext into HTML.

As an alternative approach, you could store plaintext and HTML version of the text simultaneously. This solution saves you from dynamically transforming plaintext every time you need to print it into HTML, but reduces maintainability and increases storage needs.

Conclusion

I consider the above mentioned thoughts to be a structured summary on how to deal with plaintext as a programmer. It is not easy for a beginner to think through all possible eventualities, so I hope this could be helpful.

If anything is not clear, please write to comments. I can also provide example implementation of some of the described string handling operations.

Top comments (0)