Stupid XML/HTML Tricks

Starring Lorax, Loofah, and McBean.

With special guest star: Nokogiri

Presented by Mike Dalessio at nyc.rb on 2010-03-09

Mike Dalessio

Pharos and Benchmark are both hiring!

I am the Lorax

I speak for the trees.

Lorax Speaks for the Trees

Lorax is a library based on Nokogiri to provide diffs and deltas for XML/HTML documents.

The Lorax has CS Chops

  • Based on Gregory Cobena's master's thesis
  • Generates deltas in better than O(n * log(n)) time
  • A node's signature is a SHA1 of itself and its children

Lorax Demo

Lorax Needs

  • rspec matchers and test/unit assertions
  • better / cleaner application of deltas

Loofah

Fast and Powerful HTML Sanitization

(the library formerly known as Dryopteris)

Sanitization?

Sanitary: [san-i-ter-ee] adj.

Free from elements, such as filth or pathogens, that endanger health; hygienic.

No, really.

<div>
  ohai!
  <script>
    alert('5cr1pt |<1dd13 wuz here');
  </script>
  kthxbye.
</div>

Strategy #1: escape!

<div>
  ohai!
  &lt;script&gt;
    alert('5cr1pt |&lt;1dd13 wuz here');
  &lt;/script&gt;
  kthxbye.
</div>

Strategy #2: strip

<div>
  ohai!
  alert("5cr1pt |&lt;1dd13 wuz here");
  kthxbye.
</div>

Strategy #3: prune

<div>
  ohai!
  kthxbye.
<div>

Sanitization and You

There is no one-size-fits-all sanitization.

Aspect #1: Markup Fixer-Uppery

Most people expect that the markup will be valid after sanitization.

Aspect #1: Markup Fixer-Uppery

Regexes don't help with this. Don't use them.

Aspect #1: Markup Fixer-Uppery

Loofah uses Nokogiri and libxml2, so you're guaranteed valid markup.

Aspect #2: Whitelist

A whitelist of allowed tags and attributes is a must. Security best practices.

Aspect #2: Whitelist

Enforcing a whitelist with regexes when you don't necessarily have valid markup is hard and wrong.

Aspect #2: Whitelist

Enforcing a whitelist when you have a valid DOM tree is easy.

Aspect #3: Output

How do we render the output?

  • Plain text?
  • HTML?
  • HTML without styles or attributes?
  • All of the above?

Aspect #3: Output

Most packages assume one output format and give you the string.

Aspect #3: Output

But what if I want to do some post-processing?

  • convert <div> to <span>
  • only take a subtree of the sanitized HTML

The Players

Rails's HTML::Sanitizer

fragment = "<script>
              alert('5cr1pt |<1dd13 wuz here');
            </script>"
HTML::Sanitizer.new.sanitize(fragment);

becomes

<script>alert('5cr1pt |<1dd13 here wuz></script>

Rails's HTML::Sanitizer

<script>alert('5cr1pt |<1dd13 here wuz></script>

  • Valid markup transformed into invalid markup!
  • <script> left in!
  • Words magically rearranged!
  • WTF!

Rails's HTML::Sanitizer

If you are using Rails's built-in sanitizer, you may want to think about a course in refrigerator repair.

Sanitize gem

Good:

  • Uses configurable whitelists.
  • Uses Nokogiri.

Bad:

  • Stripping unsafe tags is the only option.

HTML5lib

Good:

  • Innarnet Experts put together a best-of-breed whitelist and process.
  • Great test coverage.

Bad:

  • Uses regexes.
  • Escaping unsafe tags is the only option.

Worse:

  • Uses REXML.

Loofah

Good:

  • Uses Nokogiri and libxml2.
  • Allows escaping, pruning and stripping.
  • Supports multiple output formats.
  • Presents a Nokogiri document, so you can munge as you see fit.
  • Uses HTML5lib's whitelist and test suite.

Bad:

  • Hard to micromanage the whitelist. But why would you want to do this?

Benchmarks (1)

On large doc (98k):

N = 100
                       real
Loofah          ( 15.601635)
ActionView      ( 20.289561)
Sanitize        ( 27.340555)
HTML5lib        (114.587728)

Benchmarks (2)

On small fragment (1k):

N = 1000
                       real
Loofah          (  4.459879)
ActionView      (  5.277416)
Sanitize        (  5.223475)
HTML5lib        ( 34.500048)

Hint: as fragments get smaller, libxml2's performance gets progressively worse.

Codes

doc = Loofah.document(unsafe_html)
doc.is_a? Nokogiri::HTML::Document # => true

Codes

doc = Loofah.fragment(unsafe_html)
doc.is_a? Nokogiri::HTML::DocumentFragment # => true

Codes

Loofah adds a scrub! method which modifies the document in place:


doc.scrub!(:strip)       # replaces unsafe tags with their inner text
doc.scrub!(:prune)       # removes  unsafe tags and their children
doc.scrub!(:whitewash)   # removes  unsafe/namespaced tags and their children,
                         # and strips all attributes (good for MS Word HTML)
doc.scrub!(:escape)      # escapes  unsafe tags, like this: &lt;script&gt;

Codes

unsafe_html = "ohai! <div>div is safe</div>
               <script>but script is not</script>"

doc = Loofah.fragment(unsafe_html).scrub!(:strip)

doc.to_s    # => "ohai! <div>div is safe</div> "
doc.text    # => "ohai! div is safe "

Codes

# this ...
Loofah.scrub_fragment(dangerous_html, :prune)

# is shorthand for this
Loofah.fragment(dangerous_html).scrub!(:prune)

Codes

Microsoft-y Markup looks like this ...


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="ProgId" content="Word.Document">
<meta name="Generator" content="Microsoft Word 11">
<meta name="Originator" content="Microsoft Word 11">
<link rel="File-List" href="file:///C:%5CDOCUME%7E1%5CNICOLE%7E1%5CLOCALS%7E1%5CTemp%5Cmsohtml1%5C01%5Cclip_filelist.xml">
<!--[if gte mso 9]>
<xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="156">
</w:LatentStyles>
</xml><![endif]--><style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;
mso-header-margin:.5in;
mso-footer-margin:.5in;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
</style>
<![endif]-->

<p class="MsoNormal">Much <b style="">Simpler<o:p></o:p></b></p>

Codes

Whitewashed Microsoft-y markup looks like ...


<p>Much <b>Simpler</b></p>

Other Loofah Niceties

  • Two ActiveRecord plugins (opt-in and opt-out)
  • ActionView helper replacements
  • Easy to build custom scrubbers for transformations

McBean

HTML to Markdown

Sneetches

(an exercise in using Loofah for questionable purposes)

The Claim

Loofah makes it easy to transform XML/HTML documents.

McBean: The Vision

Eventually, McBean should allow HTML, Markdown, Textile, RTF, etc. to be interchangeable formats.

McBean: The Code

Wrote a 60-line Loofah::Scrubber class to convert HTML to Markdown in an afternoon.

http://github.com/flavorjones/mcbean/blob/master/lib/mcbean/markdown.rb

McBean: The Demo

Chiggity check out the live demo:

http://mcbean.heroku.com/

Thank you!

Hiring!

(Part 1)

Pharos is looking for a 3-month contractor.

  • Pushing realtime data to a rich web app.
  • Real users. Real money.
  • If you want to work part-time, we're OK with that.

Email mike@pharos-ei.com for information!

Hiring!

(Part 2)

Benchmark Solutions is hiring agile developers.

  • Agile Team in Finance: The Great White Whale
  • Pushing realtime data to a rich web app
  • Interesting architecture and functionality.
  • Real users. Real money. Real backing.
  • Free lunch, 30" cinema displays and an 8-core Mac Pro.

Email mike.dalessio@benchmarksolutions.com for information!