DEV Community

loading...

Let's talk about Copy.. & Paste

Bryan Ollendyke
@elmsln @haxcamp @btopro #HAXTheWeb #drupal #webcomponents #edtech ✻ Full stack unicorn Adjunct professor teaching about webdev, ethics, and everything in between
Updated on ・7 min read

There are few problems more complicated in building a JS application than correctly handling copy and paste while doing so safely. Such a simple operation. Take part of 1 page / tab / document offline, highlight a selection, copy, open tab that is your app, move cursor to editable area, paste. While seemingly mindless, this "simple" operation is the cause of much stress.

Issues you need to worry about:

  • XSS: Need to ensure Javascript isn't pasted
  • Broken HTML: Gotta correct or the browser will "fix" how it deems
  • Attribute mess: Many apps still bring along endless style="" blocks, data-my-weird-thing-in-this-app attributes and class names which could possibly conflict with the app your building
  • Word hell: Oh those dreaded word attributes and invalid entities flying around. Sure it's gotten better with Office365 being web native, but they still exist
  • Empty things: Needless divs, spans, sections, p, b and the like hanging around to mess up your previously pristine dom

HAXTheWeb wants to be the best writing and creative experience on the internet, and be that in a platform agnostic way that empowers developers to make new assets for content authors by simply making new web components.

How we handle copy / paste

There are many solutions, even many within our solution, but we've made a single utils repo that you can import our solution and get a large portion of the way there. Yes, this is OUR sanitation methodology so maybe yours differs (and that's fine, fork, copy and paste, etc to make your own).

@lrnwebcomponents/utils

yarn add @lrnwebcomponents/utils
npm install @lrnwebcomponents/utils

You can see the exact code of this package in our monorepo. Now I'm going to step through each chunk and explain what it's doing. A follow up post will cover what our actual eventListener logic to implement this with the clipboard is (that gets implementation specific though).

SHOW ME THE CODE

I'll go over implementation in the next post but to simplify, here's how we're getting the paste data:

if (e.clipboardData || e.originalEvent.clipboardData) {
        pasteContent = (e.originalEvent || e).clipboardData.getData(
          "text/html"
        );
      } else if (window.clipboardData) {
        pasteContent = window.clipboardData.getData("Text");
      }

This ensures we get cross browser paste data either as rich HTML or just textual content of what's been copied. The rich text is going to give us our HTML attributes from another application, but also cause us the most headaches in clean up!

Step 1: remove line breaks / Mso classes

function stripMSWord(input) {
  // 1.  remove line breaks / Mso classes right off the bat
  var output = input
    .split("\n\r")
    .join("\n")
    .split("\r")
    .join("\n")
    .split("\n\n")
    .join("\n")
    .split("\n\n")
    .join("\n")
    .split("\n\n")
    .join("\n")
    .split("\n")
    .join(" ")
    .replace(/( class=(")?Mso[a-zA-Z]+(")?)/g, "");

This series of weird split / join statements, ensures we don't import a ridiculously highly spaced block of content. \r is a return, \n is an end-line. To ensure we get every combination possible (and across browser / app / OS) we take double line endings, turn it into an array based on those, and then rejoin the array into a string using a single end-line.

This may seem weird but it ensures some apps that would put A LOT of white-space don't have it while still sending us something remotely readable. As we're RegEx replacing material later on it simplifies some of those too.

Lastly we seek and destroy the Mso looking class which MS puts all over the place. Later on we kill all classes but I figured I'd leave this here in-case people wanted to just target Office.

Step 2: Remove HTML comments

  // 2. strip Word generated HTML comments
  output = output.replace(/<\!--(\s|.)*?-->/gim, "");
  output = output.replace(/<\!(\s|.)*?>/gim, "");

This weird looking regex will target all comments within the HTML output and remove them entirely. Some apps put these in, I don't personally need them sending comment across (word being a big offender here).

  // 3. remove tags leave content if any
  output = output.replace(
    /<(\/)*(meta|link|html|head|body|span|font|br|\\\\?xml:|xml|st1:|o:|w:|m:|v:)(\s|.)*?>/gim,
    ""
  );

Another strange one. This is to account for some issues where you'll get sent something like <p>The problem is <font>some actual content</font> can be found in there too.</p>. This match will take the contents of tags like head, body, span, font, br (yeah some stuff even screws those up) and then kill the tag wrapper while leaving the inner content. Running our regex against our example we get <p>The problem is some actual content can be found in there too.</p>

  // 4. Remove everything in between and including tags '<style(.)style(.)>'
  var badTags = ["style", "script", "applet", "embed", "noframes", "noscript"];
  for (var i in badTags) {
    let tagStripper = new RegExp(
      "<" + badTags[i] + "(s|.)*?" + badTags[i] + "(.*?)>",
      "gim"
    );
    output = output.replace(tagStripper, "");
  }

This is part of our XSS strategy. If you try pasting styles, JavaScript or older tags like applet / embed, we kill them entirely without replacement. The whole thing is gone.

  // 5. remove attributes ' style="..."', align, start
  output = output.replace(/ style='(\s|.)*?'/gim, "");
  output = output.replace(/ face="(\s|.)*?"/gim, "");
  output = output.replace(/ align=.*? /g, "");
  output = output.replace(/ start='.*?'/g, "");
  // ID's wont apply meaningfully on a paste
  output = output.replace(/ id="(\s|.)*?"/gim, "");
  // Google Docs ones
  output = output.replace(/ dir="(\s|.)*?"/gim, "");
  output = output.replace(/ role="(\s|.)*?"/gim, "");

This is pretty aggressive but I don't want style, face (made up word thing), align (word), start (also word), id (could conflict w/ pasted content from accessibility / style perspective), dir (needless dir="ltr" from google), or role (again local app being pasted into is in charge of this).

  // 6. some HAX specific things in case this was moving content around
  // these are universally true tho so fine to have here
  output = output.replace(/ contenteditable="(\s|.)*?"/gim, "");
  // some medium, box, github and other paste stuff as well as general paste clean up for classes
  // in multiple html primatives
  output = output.replace(/ data-(\s|.)*?"(\s|.)*?"/gim, "");
  output = output.replace(/ class="(\s|.)*?"/gim, "");

We use this in the context of HAX so there are some additional ones added here. These SHOULD be useful to most people but I blocked them off anyway for clarity. This wipes contenteditable, anything matching data-{WHATEVER} and all classes. That last one might be a deal breaker for you, but I don't need Medium (as an example) adding in <p class="as er df sd fg ds we ds cx sd yt fg as xc sd qf ds qw">Thing</p> to all of my pasted content indefinitely (and yes that's what their classes look like).

  // 7. clean out empty paragraphs and endlines that cause weird spacing
  output = output.replace(/&nbsp;/gm, " ");
  // start of double, do it twice for nesting
  output = output.replace(/<section>/gm, "<p>");
  output = output.replace(/<\/section>/gm, "</p>");
  output = output.replace(/<p><p>/gm, "<p>");
  output = output.replace(/<p><p>/gm, "<p>");
  // double, do it twice for nesting
  output = output.replace(/<\/p><\/p>/gm, "</p>");
  output = output.replace(/<\/p><\/p>/gm, "</p>");
  // normalize BR's; common from GoogleDocs
  output = output.replace(/<br \/>/gm, "<br/>");
  output = output.replace(/<p><br \/><b>/gm, "<p><b>");
  output = output.replace(/<\/p><br \/><\/b>/gm, "</p></b>");
  // some other things we know not to allow to wrap
  output = output.replace(/<b><p>/gm, "<p>");
  output = output.replace(/<\/p><\/b>/gm, "</p>");
  // drop list wrappers
  output = output.replace(/<li><p>/gm, "<li>");
  output = output.replace(/<\/p><\/li>/gm, "</li>");
  // bold wraps as an outer tag like p can, and on lists
  output = output.replace(/<b><ul>/gm, "<ul>");
  output = output.replace(/<\/ul><\/b>/gm, "</ul>");
  output = output.replace(/<b><ol>/gm, "<ol>");
  output = output.replace(/<\/ol><\/b>/gm, "</ol>");
  // try ax'ing extra spans
  output = output.replace(/<span><p>/gm, "<p>");
  output = output.replace(/<\/p><\/span>/gm, "</p>");
  // empty with lots of space
  output = output.replace(/<p>(\s*)<\/p>/gm, " ");
  // empty p / more or less empty
  output = output.replace(/<p><\/p>/gm, "");
  output = output.replace(/<p>&nbsp;<\/p>/gm, " ");
  // br somehow getting through here
  output = output.replace(/<p><br\/><\/p>/gm, "");
  output = output.replace(/<p><br><\/p>/gm, "");
  // whitespace in reverse of the top case now that we've cleaned it up
  output = output.replace(/<\/p>(\s*)<p>/gm, "</p><p>");

This one is a series of bizarre inconsistencies I noticed when pasting from Google and a few others. SEVERAL applications were creating <p><p></p></p> or worse, wrapping the entire copied selection in a <b> tag (deprecated version of <strong>).

This block cleans up random spans, ol in b, ul in b, p in li, p in p, b in p, span in p, p in span, and other mostly meaningless things like multiple empty breaks in a row. This is a pretty aggressive attempt at getting PURE HTML structures to come across, and correct them while still doing it all via Regex.

  // wow do I hate contenteditable and the dom....
  // bold and italic are treated as if they are block elements in a paste scenario
  // 8. check for empty bad tags
  for (var i in badTags) {
    let emptyTagRemove = new RegExp(
      "<" + badTags[i] + "></" + badTags[i] + ">",
      "gi"
    );
    output = output.replace(emptyTagRemove, "");
  }
  output = output.trim();
  return output;
}

Lastly, I do 1 last check to blow away empty bad tags (should be gone above but just checking) and then I do a trim (removing white space from either end of the ends of the string).

And like that, we've got cleaned up content! I recently improved the way HAX handles paste operations, some of which revolves around this script. There's always room for improvement and some of our filters are VERY aggressive so you might want to just fork / copy certain pieces for your own use.

If you have any suggestions on how we can make some of these faster or with less code, happy to hear it. I wanted it to be dependency free (yes I know there's a lot of purifiers out there).

In the next post I'm going to explain some of the logic in how we actually handle pasted data in HAX. HAX jumps through a lot of hoops to try and ensure that your pasted content is logical, block form, well written, and without random additional attributes / spacing all over the place (and secure, obviously!).

This gets VERY in the weeds of HAX but I'll write it up in the event anyone wants to build their own WYSIWYG editor out there :).

Discussion (0)