Stupid XML/HTML Tricks

Starring Lorax, Loofah, and McBean.

With special guest star: Nokogiri

Presented by Mike Dalessio at nyc.rb on 2010-03-09

Mike Dalessio

http://mike.daless.io/
@flavorjones
Co-author of Nokogiri.
Formerly at Pivotal.
Side project is Pharos.
Now at Benchmark Solutions.

Pharos and Benchmark are both hiring!

I am the Lorax

I speak for the trees.

Lorax Speaks for the Trees

Lorax is a library based on Nokogiri to provide diffs and deltas for XML/HTML documents.

The Lorax has CS Chops

Based on Gregory Cobena's master's thesis
Generates deltas in better than O(n * log(n)) time
A node's signature is a SHA1 of itself and its children

Lorax Demo

Lorax Needs

rspec matchers and test/unit assertions
better / cleaner application of deltas

Loofah

Fast and Powerful HTML Sanitization

(the library formerly known as Dryopteris)

Sanitization?

Sanitary: [san-i-ter-ee] adj.

Free from elements, such as filth or pathogens, that endanger health; hygienic.

No, really.

<div>
  ohai!
  <script>
    alert('5cr1pt |<1dd13 wuz here');
  </script>
  kthxbye.
</div>

Strategy #1: escape!

<div>
  ohai!
  &lt;script&gt;
    alert('5cr1pt |&lt;1dd13 wuz here');
  &lt;/script&gt;
  kthxbye.
</div>

Strategy #2: strip

<div>
  ohai!
  alert("5cr1pt |&lt;1dd13 wuz here");
  kthxbye.
</div>

Strategy #3: prune

<div>
  ohai!
  kthxbye.
<div>

Sanitization and You

There is no one-size-fits-all sanitization.

Aspect #1: Markup Fixer-Uppery

Most people expect that the markup will be valid after sanitization.

Aspect #1: Markup Fixer-Uppery

Regexes don't help with this. Don't use them.

Aspect #1: Markup Fixer-Uppery

Loofah uses Nokogiri and libxml2, so you're guaranteed valid markup.

Aspect #2: Whitelist

A whitelist of allowed tags and attributes is a must. Security best practices.

Aspect #2: Whitelist

Enforcing a whitelist with regexes when you don't necessarily have valid markup is hard and wrong.

Aspect #2: Whitelist

Enforcing a whitelist when you have a valid DOM tree is easy.

Aspect #3: Output

How do we render the output?

Plain text?
HTML?
HTML without styles or attributes?
All of the above?

Aspect #3: Output

Most packages assume one output format and give you the string.

Aspect #3: Output

But what if I want to do some post-processing?

convert <div> to <span>
only take a subtree of the sanitized HTML

The Players

Rails's `HTML::Sanitizer`

fragment = "<script>
              alert('5cr1pt |<1dd13 wuz here');
            </script>"
HTML::Sanitizer.new.sanitize(fragment);

becomes

<script>alert('5cr1pt |<1dd13 here wuz></script>

Rails's `HTML::Sanitizer`

<script>alert('5cr1pt |<1dd13 here wuz></script>

Valid markup transformed into invalid markup!
<script> left in!
Words magically rearranged!
WTF!

Rails's `HTML::Sanitizer`

If you are using Rails's built-in sanitizer, you may want to think about a course in refrigerator repair.

`Sanitize` gem

Good:

Uses configurable whitelists.
Uses Nokogiri.

Bad:

Stripping unsafe tags is the only option.

`HTML5lib`

Good:

Innarnet Experts put together a best-of-breed whitelist and process.
Great test coverage.

Bad:

Uses regexes.
Escaping unsafe tags is the only option.

Worse:

Uses REXML.

Loofah

Good:

Uses Nokogiri and libxml2.
Allows escaping, pruning and stripping.
Supports multiple output formats.
Presents a Nokogiri document, so you can munge as you see fit.
Uses HTML5lib's whitelist and test suite.

Bad:

Hard to micromanage the whitelist. But why would you want to do this?

Benchmarks (1)

On large doc (98k):

N = 100
                       real
Loofah          ( 15.601635)
ActionView      ( 20.289561)
Sanitize        ( 27.340555)
HTML5lib        (114.587728)

Benchmarks (2)

On small fragment (1k):

N = 1000
                       real
Loofah          (  4.459879)
ActionView      (  5.277416)
Sanitize        (  5.223475)
HTML5lib        ( 34.500048)

Hint: as fragments get smaller, libxml2's performance gets progressively worse.

Codes

doc = Loofah.document(unsafe_html)
doc.is_a? Nokogiri::HTML::Document # => true

Codes

doc = Loofah.fragment(unsafe_html)
doc.is_a? Nokogiri::HTML::DocumentFragment # => true

Codes

Loofah adds a scrub! method which modifies the document in place:

doc.scrub!(:strip)       # replaces unsafe tags with their inner text
doc.scrub!(:prune)       # removes  unsafe tags and their children
doc.scrub!(:whitewash)   # removes  unsafe/namespaced tags and their children,
                         # and strips all attributes (good for MS Word HTML)
doc.scrub!(:escape)      # escapes  unsafe tags, like this: &lt;script&gt;

Codes

unsafe_html = "ohai! <div>div is safe</div>
               <script>but script is not</script>"

doc = Loofah.fragment(unsafe_html).scrub!(:strip)

doc.to_s    # => "ohai! <div>div is safe</div> "
doc.text    # => "ohai! div is safe "

Codes

# this ...
Loofah.scrub_fragment(dangerous_html, :prune)

# is shorthand for this
Loofah.fragment(dangerous_html).scrub!(:prune)

Codes

Microsoft-y Markup looks like this ...

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="ProgId" content="Word.Document">
<meta name="Generator" content="Microsoft Word 11">
<meta name="Originator" content="Microsoft Word 11">
<link rel="File-List" href="file:///C:%5CDOCUME%7E1%5CNICOLE%7E1%5CLOCALS%7E1%5CTemp%5Cmsohtml1%5C01%5Cclip_filelist.xml">
<!--[if gte mso 9]>
<xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="156">
</w:LatentStyles>
</xml><![endif]--><style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;
mso-header-margin:.5in;
mso-footer-margin:.5in;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
</style>
<![endif]-->

<p class="MsoNormal">Much <b style="">Simpler<o:p></o:p></b></p>

Codes

Whitewashed Microsoft-y markup looks like ...

<p>Much <b>Simpler</b></p>

Other Loofah Niceties

Two ActiveRecord plugins (opt-in and opt-out)
ActionView helper replacements
Easy to build custom scrubbers for transformations

McBean

HTML to Markdown

Sneetches

(an exercise in using Loofah for questionable purposes)

The Claim

Loofah makes it easy to transform XML/HTML documents.

McBean: The Vision

Eventually, McBean should allow HTML, Markdown, Textile, RTF, etc. to be interchangeable formats.

McBean: The Code

Wrote a 60-line Loofah::Scrubber class to convert HTML to Markdown in an afternoon.

http://github.com/flavorjones/mcbean/blob/master/lib/mcbean/markdown.rb

McBean: The Demo

Chiggity check out the live demo:

http://mcbean.heroku.com/

Thank you!

Lorax: gem install lorax
Loofah: gem install loofah
McBean: gem install mcbean
Follow me on the twitters: @flavorjones

Hiring!

(Part 1)

Pharos is looking for a 3-month contractor.

Pushing realtime data to a rich web app.
Real users. Real money.
If you want to work part-time, we're OK with that.

Email mike@pharos-ei.com for information!

Hiring!

(Part 2)

Benchmark Solutions is hiring agile developers.

Agile Team in Finance: The Great White Whale
Pushing realtime data to a rich web app
Interesting architecture and functionality.
Real users. Real money. Real backing.
Free lunch, 30" cinema displays and an 8-core Mac Pro.

Email mike.dalessio@benchmarksolutions.com for information!

Stupid XML/HTML Tricks

Mike Dalessio

I am the Lorax

Lorax Speaks for the Trees

The Lorax has CS Chops

Lorax Demo

Lorax Needs

Loofah

Sanitization?

No, really.

Strategy #1: escape!

Strategy #2: strip

Strategy #3: prune

Sanitization and You

Aspect #1: Markup Fixer-Uppery

Aspect #1: Markup Fixer-Uppery

Aspect #1: Markup Fixer-Uppery

Aspect #2: Whitelist

Aspect #2: Whitelist

Aspect #2: Whitelist

Aspect #3: Output

Aspect #3: Output

Aspect #3: Output

The Players

Rails's HTML::Sanitizer

Rails's HTML::Sanitizer

Rails's HTML::Sanitizer

Sanitize gem

HTML5lib

Loofah

Benchmarks (1)

Benchmarks (2)

Codes

Codes

Codes

Codes

Codes

Codes

Codes

Other Loofah Niceties

McBean

The Claim

McBean: The Vision

McBean: The Code

McBean: The Demo

Thank you!

Hiring!

(Part 1)

Hiring!

(Part 2)

Rails's `HTML::Sanitizer`

Rails's `HTML::Sanitizer`

Rails's `HTML::Sanitizer`

`Sanitize` gem

`HTML5lib`