Stupid XML/HTML Tricks
Starring Lorax, Loofah, and McBean.
With special guest star: Nokogiri
Presented by Mike Dalessio at nyc.rb on 2010-03-09
Starring Lorax, Loofah, and McBean.
With special guest star: Nokogiri
Presented by Mike Dalessio at nyc.rb on 2010-03-09
Pharos and Benchmark are both hiring!
I speak for the trees.
Lorax is a library based on Nokogiri to provide diffs and deltas for XML/HTML documents.
Fast and Powerful HTML Sanitization
(the library formerly known as Dryopteris
)
Sanitary: [san-i-ter-ee] adj.
Free from elements, such as filth or pathogens, that endanger health; hygienic.
<div>
ohai!
<script>
alert('5cr1pt |<1dd13 wuz here');
</script>
kthxbye.
</div>
<div>
ohai!
<script>
alert('5cr1pt |<1dd13 wuz here');
</script>
kthxbye.
</div>
<div>
ohai!
alert("5cr1pt |<1dd13 wuz here");
kthxbye.
</div>
<div>
ohai!
kthxbye.
<div>
There is no one-size-fits-all sanitization.
Most people expect that the markup will be valid after sanitization.
Regexes don't help with this. Don't use them.
Loofah uses Nokogiri and libxml2, so you're guaranteed valid markup.
A whitelist of allowed tags and attributes is a must. Security best practices.
Enforcing a whitelist with regexes when you don't necessarily have valid markup is hard and wrong.
Enforcing a whitelist when you have a valid DOM tree is easy.
How do we render the output?
Most packages assume one output format and give you the string.
But what if I want to do some post-processing?
<div>
to <span>
HTML::Sanitizer
fragment = "<script>
alert('5cr1pt |<1dd13 wuz here');
</script>"
HTML::Sanitizer.new.sanitize(fragment);
becomes
<script>alert('5cr1pt |<1dd13 here wuz></script>
HTML::Sanitizer
<script>alert('5cr1pt |<1dd13 here wuz></script>
<script>
left in!HTML::Sanitizer
If you are using Rails's built-in sanitizer, you may want to think about a course in refrigerator repair.
Sanitize
gemGood:
Bad:
HTML5lib
Good:
Bad:
Worse:
Good:
HTML5lib
's whitelist and test suite.Bad:
On large doc (98k):
N = 100 real Loofah ( 15.601635) ActionView ( 20.289561) Sanitize ( 27.340555) HTML5lib (114.587728)
On small fragment (1k):
N = 1000 real Loofah ( 4.459879) ActionView ( 5.277416) Sanitize ( 5.223475) HTML5lib ( 34.500048)
Hint: as fragments get smaller, libxml2's performance gets progressively worse.
doc = Loofah.document(unsafe_html)
doc.is_a? Nokogiri::HTML::Document # => true
doc = Loofah.fragment(unsafe_html)
doc.is_a? Nokogiri::HTML::DocumentFragment # => true
Loofah adds a scrub!
method which modifies the document in place:
doc.scrub!(:strip) # replaces unsafe tags with their inner text
doc.scrub!(:prune) # removes unsafe tags and their children
doc.scrub!(:whitewash) # removes unsafe/namespaced tags and their children,
# and strips all attributes (good for MS Word HTML)
doc.scrub!(:escape) # escapes unsafe tags, like this: <script>
unsafe_html = "ohai! <div>div is safe</div>
<script>but script is not</script>"
doc = Loofah.fragment(unsafe_html).scrub!(:strip)
doc.to_s # => "ohai! <div>div is safe</div> "
doc.text # => "ohai! div is safe "
# this ...
Loofah.scrub_fragment(dangerous_html, :prune)
# is shorthand for this
Loofah.fragment(dangerous_html).scrub!(:prune)
Microsoft-y Markup looks like this ...
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="ProgId" content="Word.Document"> <meta name="Generator" content="Microsoft Word 11"> <meta name="Originator" content="Microsoft Word 11"> <link rel="File-List" href="file:///C:%5CDOCUME%7E1%5CNICOLE%7E1%5CLOCALS%7E1%5CTemp%5Cmsohtml1%5C01%5Cclip_filelist.xml"> <!--[if gte mso 9]> <xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> <w:DontGrowAutofit/> </w:Compatibility> <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel> </w:WordDocument> </xml><![endif]--><!--[if gte mso 9]><xml> <w:LatentStyles DefLockedState="false" LatentStyleCount="156"> </w:LatentStyles> </xml><![endif]--><style> <!-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman";} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} --> </style><!--[if gte mso 10]> <style> /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;} </style> <![endif]--> <p class="MsoNormal">Much <b style="">Simpler<o:p></o:p></b></p>
Whitewashed Microsoft-y markup looks like ...
<p>Much <b>Simpler</b></p>
HTML to Markdown
(an exercise in using Loofah for questionable purposes)
Loofah makes it easy to transform XML/HTML documents.
Eventually, McBean should allow HTML, Markdown, Textile, RTF, etc. to be interchangeable formats.
Wrote a 60-line Loofah::Scrubber class to convert HTML to Markdown in an afternoon.
http://github.com/flavorjones/mcbean/blob/master/lib/mcbean/markdown.rb
Pharos is looking for a 3-month contractor.
Email mike@pharos-ei.com for information!
Benchmark Solutions is hiring agile developers.
Email mike.dalessio@benchmarksolutions.com for information!