Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Nokogiri:

History of a Gem

A valid and well-formed talk on an open-source success

Mike Dalessio / @flavorjones

mike.daless.io / blog.flavorjon.es

GoGaRuCo 2013

Thanks to Josh Susser and the organizers. Josh first invited me to speak about Nokogiri in 2008 at icanhasruby, and now he's repeated that mistake, and for that he has my undying gratitude. Nokogiri is like my child, and my personal story is wrapped up in Nokogiri's story.

story

story

story

story

story

story

story

FAQ

  1. What is Nokogirl?
  2.  
  3.  
  4.  
  5.  
$ gem install nokogirl
Building native extensions.  This could take a while...

It's actually spelled nokogiri, not nokogirl

Successfully installed nokogiri-1.6.0
Successfully installed nokogirl-1.0
2 gems installed
Magnus Holm, judofyr

FAQ

  1. What is Nokogirl?
  2. What is Nokogiri?
  3.  
  4.  
  5.  

A Ruby API for XML/HTML parsing and manipulation.

Worked on it primarily with Aaron Patterson. Fixes broken markup. SAX parser, pull parser, XSLT, schema validation.

FAQ

  1. What is Nokogirl?
  2. What is Nokogiri?
  3. OMG, isn't that boring?
  4.  
  5.  
I have a confession to make.

I hate XML.

But I ❤ making painful things not-painful.

And that's the itch I like to scratch. It's rewarding.

Nokogiri has been downloaded 14.8 million times.


(Rails has 27.6 million downloads, Formtastic has 1.7 million.)

Look, it's not a contest. But it might be interesting to put these numbers in context with some recording industry numbers ...

If Rails is

kiss

Been around forever. Used to be edgy, now what your parents listen to. Highly-produced. Glam. Either hate it or love it.

And Formtastic is

eileen

Everybody loves catchy gems with a hook.

Then Nokogiri is

kelly

Broadly appealing, adult contemporary, inoffensive background music.

So yes, Nokogiri does one boring thing.

But it does it well, and people seem to like it.

And that makes me happy.

FAQ

  1. What is Nokogirl?
  2. What is Nokogiri?
  3. OMG, isn't that boring?
  4. What does "Nokogiri" mean?
  5.  

nokogiri

It's wordplay. If you have a dense forest of XML trees, you need a saw.

punny

I don't know if you know this, but @tenderlove likes wordplay. And the fifth most commonly-asked question ...

FAQ

  1. What is Nokogirl?
  2. What is Nokogiri?
  3. OMG, isn't that boring?
  4. What does "Nokogiri" mean?
  5. I can't install Nokogiri on Mac OSX 10.8.4 running MacPorts.
That's not a question! But we'll cover it later, anyway.

“There's an old saying about those who forget history. I don't remember it, but it's good.”

Stephen Colbert

Why is history interesting or even relevant? History is the testing ground of ideas. Needs fulfilled and unfulfilled.

2006

serial-killer

Ruby HTML/XML Parsers in 2006

uspg

Scraping HTML with Mechanize.

Mechanize used Hpricot at this time. More importantly, though ...

uspg

Scraping super-secret HTML with Mechanize.

Problem: Needed support for client-side certificates.

Solution: Emailed a patch to @tenderlove

My First ✌Pull request✌

pull

@tenderlove was open and kind and responsive.

I sent more patches.

I got commit privileges.

I kept contributing.

Time passed.

2008

pharos

Scraping HTML with Mechanize.

pharos

Scraping broken HTML with Mechanize.

I was running into Hpricot bugs, both with markup correction and things like CSS off-by-one errors and crashing when the HTML was exactly 16kb long.

(@flavorjones, in full Github-stalker mode)


You're working on an XML wrapper, right? Do you have any thoughts on how useful libxml2 would be with malformed HTML? For me, this was actually the killer feature of Hpricot -- it manages to un-mangle HTML really well ... There's a lot of malformed HTML out there.

(@tenderlove, being indulgent)


Awesome! Yes. Libxml actually handles broken HTML better than hpricot does. I have test cases for which libxml will handle broken html better than hpricot.

(@tenderlove)


I've submitted those test cases as bugs for hpricot as well. I would have patched hpricot, but it is too hard for me to read!

This should give hope to anyone who has ever had trouble figuring out where to start on a new codebase.

(@flavorjones)


How can I help out?

The five most significant words you can use with an open-source maintainer.

(@tenderlove)


I've started a project called 'nokogiri' which is my libxml wrapper. There is no C, it uses DL exclusively.

Dynamic Language binding.

Call C libraries without writing C code.

# lib/nokogiri/dl/xml.rb
module Nokogiri
  module DL
    module XML
      extend ::DL::Importer

      dlload('libxml2.so') rescue dlload('libxml2.dylib')

      extern "void * xmlReadMemory (const char *, int, const char *,
                       const char *, int)"

      ...
    end
  end
end
# lib/nokogiri/xml.rb
module Nokogiri
  module XML
    class << self
      def parse(string, url = nil, encoding = nil, options = 1)
        Document.wrap(NokogiriLib::XML.xmlReadMemory(
                        string, string.length,
                        (url || 0),
                        (encoding || 0),
                        options
                      ))
      end
    end
  end
end

DL was slow. Really slow.

We killed it and started writing a C extension to call libxml2 directly.

DL re-parsed the function declaration every time you call it. We could have worked on this, but the code was pretty hairy and experimental, and didn't have great tests.
static VALUE read_memory( VALUE klass,
                          VALUE string,
                          VALUE url,
                          VALUE encoding,
                          VALUE options )
{
  const char * c_buffer = StringValuePtr(string);
  const char * c_url    = NIL_P(url)      ? NULL : StringValuePtr(url);
  const char * c_enc    = NIL_P(encoding) ? NULL : StringValuePtr(encoding);
  int len               = (int)RSTRING_LEN(string);
  xmlDocPtr doc;

  ...

  doc = xmlReadMemory(c_buffer, len, c_url, c_enc, (int)NUM2INT(options));

  ...
}

The toughest problem we encountered writing Nokogiri:

Discovering and debugging libxml2's memory management.

mem1

The Document struct, when freed, recursively frees everything under it.

mem1

There is zero or one Ruby objects for each C struct.

mem1

The Document struct is freed when the Document object is GCed. We structure Ruby object references so that Nodes aren't GCed so long as I have a reference to the Document.

mem1

We have to carefully construct object references to make sure we never free a C struct pointed to by a Ruby object.

mem1

Further complicating things: Documents have a Dictionary to minimize the number of strings that have to be allocated.

Attributes and Namespaces have strings in the original document's dictionary.

What happens when you move a Node to another Document, then GC the first Doc?

Fun fact: libxml2 merges string nodes if they're next to each other.

What happens to the Ruby object pointing to a non-existent C object?

Actual comment explaining some insane memory-management logic:

/*
 *  if the reparentee is a text node, there's a very good chance it
 *  will be merged with an adjacent text node after being reparented,
 *  and in that case libxml will free the underlying C struct.
 *
 *  since we clearly have a ruby object which references the underlying
 *  memory, we can't let the C struct get freed. let's pickle the original
 *  reparentee by rooting it; and then we'll reparent a duplicate of the
 *  node that we don't care about preserving.
 *
 *  alternatively, if the reparentee is from a different document than the
 *  pivot node, libxml2 is going to get confused about which document's
 *  "dictionary" the node's strings belong to (this is an otherwise
 *  uninteresting libxml2 implementation detail). as a result, we cannot
 *  reparent the actual reparentee, so we reparent a duplicate.
 */

Different versions of libxml2 have different fun bugs in how they merge (or don't merge) nodes.

PAIN.

Writing and Debugging C Extensions

In conclusion, learn these tools!

API Design

We boldly stole the best XML API we could find ...

API Design by Theft

Hpricot's API.


why

The first few versions of Nokogiri had an Hpricot-compatible API module.

@tenderlove liked calling this layer "bug-compatible", and we eventually killed it.

First official release

on November 17, 2008.

(DST weekend.)

Protip: don't release software on DST weekend if you have a job where energy trading is 24/7.

Community response

Mixed.

Early adopters:

But

People liked to argue a lot about XML library benchmarks in 2008.

And people loved Hpricot.

Lots of people had opinions on benchmarks:

Here are some BS benchmarks:

For an html snippet 2374 bytes long ...
                          user     system      total        real
regex * 1000          0.160000   0.010000   0.170000 (  0.182207)
nokogiri * 1000       1.440000   0.060000   1.500000 (  1.537546)
hpricot * 1000        5.740000   0.650000   6.390000 (  6.401207)

it took an average of 0.0015 seconds for Nokogiri
it took an average of 0.0064 seconds for Hpricot

For an html snippet 97517 bytes long ...
                          user     system      total        real
regex * 10            0.100000   0.020000   0.120000 (  0.122117)
nokogiri * 10         0.310000   0.020000   0.330000 (  0.322290)
hpricot * 10          3.190000   0.300000   3.490000 (  3.502819)

it took an average of 0.0322 seconds for Nokogiri
it took an average of 0.3503 seconds for Hpricot

In retrospect, largely pointless, other than driving us to fix bottlenecks more quickly.

Much like publicly debating which movie star is hotter drives the celebrity to get plastic surgery.

<tangent class="personal">

Late 2008, I was introduced to Pivotal Labs by a fan of Nokogiri.

nakajima

I was hired largely on the basis of my open-source work.

high five

Pause for consideration.

"I can't imagine hiring someone that I didn't know through open source."

DHH, 2005

"Open source is your farm system. Use it!"

Brian Cantrill at Joyent, 2013

Pivotal regularly receives applications without a link to a GitHub profile, or references to public or open-source work.

</tangent>

Then, in August 2009, this tweet:

why-tweet

why

why-tweet

This is also a theme running through his recent manifesto. I love _why, but I just don't agree with this sentiment. If we're not constantly looking for ways to improve, why are we in this business? We miss you, _why.

The Dawn of JRuby

Show of hands, please.

2009 Fact:

JRuby didn't fully support the C extension API.

This was a problem for JRubyists who wanted to use Nokogiri.

I had a dream

One codebase that ran on MRI, Rubinius and JRuby.

My FFI Phase

sunflower-phase

What's FFI?

Foreign Function Interface

Ruby calling native C code directly.

A better implementation of what DL meant to do.
attach_function :xmlReadMemory,
                [:string, :int, :string, :string, :int],
                :pointer

...

module Nokogiri
  module XML
    class Document < Node
      def self.read_memory(string, url, encoding, options)
        wrap_with_error_handling do
          LibXML.xmlReadMemory(string, string.length, url, encoding, options)
        end
      end
    end
  end
end

pony

Shout out to Wayne Meissner (@wmeissner).

rewrite

I spent January to May 2009 doing this.

Meaningful Statistic

It took 3,049 lines of Ruby/FFI code

to reproduce 4,150 lines of C code.

That's not great for a language that's way more powerful than C. And it was painful ...

PAIN:

Writing C in Ruby

ffi-1

PAIN:

"Segfault-driven development"

Development was handicapped by lack of compile-time checking.

PAIN:

Portable string handling is hard.

JVM GC edge cases.

PAIN:

FFI code not any clearer than C.

Requires you think in C and translate to Ruby.

I DO NOT WANT

ffi-2

PAIN:

Big performance penalty.

Though FFI is reportedly much faster these days, serializing data and calling through the FFI stack is always going to be slower than a native function call.

ffi-chart-1

PAIN:

"Harassment-driven development"

Let me tell you a story about RubyConf 2009.

FFI Lessons

  1. If you care about performance, you need to write native extensions
  2. If you care about multi-platform, you need to support at least two codebases
  3. If you don't care about either, then FFI is for you!

FFI Lessons

  1. If you care about performance, you need to write native extensions
  2. If you care about multi-platform, you need to support at least two codebases
  3. If you don't care about either, then FFI is for you!

FFI Lessons

  1. If you care about performance, you need to write native extensions
  2. If you care about multi-platform, you need to support at least two codebases
  3. If you don't care about either, then FFI is for you!
Worth noting that you may not care about performance unless you're doing millions of function calls. As a counter-example, Martin Bosslet is using FFI in Krypt and loving it, because generally cryptographic functions are called once are are CPU-intensive once in native-land.

All these choices suck.

choices

Outcomes

  1. Killed the FFI port.
  2. Unpublished blog post.
  3. :`(

whoa

"Portability is for people who cannot write new programs."

Linus Torvalds

I wasn't willing or able to write a "new program" for JRuby -- I wanted to use the libxml2 port everywhere.

Enter Sergio

Sergio Arbeo (@serabe).

Spanish college student.

Spiked on a pure-Java port over GSOC 2009.

It (mostly) worked.

believe

Ruby Bounty

A Ruby Bounty was created to finish up a pure-Java port, using Xerces and NekoHTML.


$625 was the largest Bounty ever at the time.

Roger Pack started the Ruby Bounty program.

The Money Men

Ruby Bounty contributors:

Awesome people took it across the finish line

Ruby Bounty winners:

Meaningless Statistic

$ sloccount ext lib

Totals grouped by language
(dominant language first):

java:          9206 (50.72%)
ansic:         4758 (26.22%)
ruby:          3932 (21.67%)
yacc:           253 (1.39%)

Meaningless Statistic

zomg-java

Meaningful Statistic

ffi-chart-2

For a flavor of the zeitgeist, check out this awesome ruby-talk thread


Three general opinions expressed:

  1. Do it in pure Ruby and fix the implementations' performance problems! ✘ See REXML!
  2. Do it with FFI! ✘ Threw it away!
  3. Do it native! ✔ ← THIS.
Even in 2009, there was still confusion and debate over how we should support JRuby.

Installation

Remember our FAQ? ...

FAQ

  1. What is Nokogirl?
  2. What is Nokogiri?
  3. OMG, isn't that boring?
  4. What does "Nokogiri" mean?
  5. I can't install Nokogiri on Mac OSX 10.8.4 running MacPorts.

FAQ

  1. What is Nokogirl?
  2. What is Nokogiri?
  3. OMG, isn't that boring?
  4. What does "Nokogiri" mean?
  5. I can't install Nokogiri on X running Y.

Hard for different reasons on different platforms.


But, basically: dependencies.

Let's do this in chronological order:

  1. Windows
  2. JRuby
  3. Everybody else

Windows Problems

Nobody has a build toolchain.

Nobody has libxml2 installed on their system.

Installation solution

Cross-compile and package DLLs with the gem:

"Fat Binary" Gems

$ ls -l gems
total 21652
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mingw32.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mswin32-60.gem
-rw-r--r-- 1 miked  221184 Mar 11 17:31 nokogiri-1.5.7.gem
10MB versus 220KB

Fat because we have to compile against multiple rubies:

Going to get even fatter once we finish support for 64-bit Windows.

Windows Support

Luis Lavena (@luislavena) supports the Windows build toolchain basically single-handedly.

He rules!

Show of hands

Any Ruby Windows developers out there?

We could use some Windows peeps to help support the platform.


Tweet me, maybe: @flavorjones

JRuby Problems

Nokogiri's JRuby port uses specific libraries:

These may not be installed on the target system.

Installation solution

Build and package jar files!

$ ls lib/*jar
lib/isorelax.jar
lib/jing.jar
lib/nekodtd.jar
lib/nekohtml.jar
lib/xercesImpl.jar

The JRuby gem is also a "Fat Binary"

$ ls -l gems
total 21652
-rw-r--r-- 1 miked 2204160 Mar 11 17:31 nokogiri-1.5.7-java.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mingw32.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mswin32-60.gem
-rw-r--r-- 1 miked  221184 Mar 11 17:31 nokogiri-1.5.7.gem
Though still not as fat as the windows gems.

MRI Problems

Unlike most C extensions, we have unwieldy external dependencies:

LAME.

PAIN:

FOR YOU

installation

PAIN:

FOR ME

int is_2_6_16(void)
{
  return (strcmp(xmlParserVersion, "20616") <= 0) ? 1 : 0 ;
}

if (   reparentee->type == XML_TEXT_NODE
    && pivot->type      == XML_TEXT_NODE
    && is_2_6_16(
   ) {
  /* work around a string-handling bug in libxml 2.6.16.
     we'd rather leak than segfault. */
  pivot->content = xmlStrdup(pivot->content);
}

libxml 2.9.x

Currently the default for homebrew.

Xpath query bug breaks CSS queries.


(This was very recently fixed.)

This put me over the edge. I was not going to write a new CSS query parser implementation to work around a bug in libxml2.

"Wait."

"Didn't you just say that mini_portile will compile autoconf projects and bind to them at gem installation time?"

Compilation at gem install-time is totally not what mini_portile was meant to be used for.

Nokogiri 1.6.0

inside

Packages libxml2 and libxslt inside the gem.

Good For You

Installation Just Works™.

(You can still use your system libraries if you really want to.)

Good For Me

I can lock Nokogiri's logic to a specific version of libxml2, lowering support and testing burden.

I may or may not actually do this.

"Fat Source"

mugatu

Naming Things is Easy!

The Future

Nokogiri 2.0 roadmap up at github.com/sparklemotion/nokogiri

2.0 Roadmap Highlights

Improving APIs:

2.0 Roadmap Highlights

Addressing architectural issues:

Call To Action

Call to Action, Part I

If you're a Windows MRI user,

Get involved.

Demonstrate to gem maintainers that Windows MRI is worth supporting.

Call to Action, Part II

If you aren't already,

Contribute to open-source.

Start small. Iterate. Get lots of small wins. Make friends. Get a better job.

Nokogiri does one boring thing, but does it well.

You are annoyed by something small and boring somewhere. Go fix it.

I wouldn't have my job, wouldn't be talking at conferences, and wouldn't be part of the Ruby community, if I hadn't started scratching my own itch.

Scratch your itch. Make the world a better place.

And get to hang out with @tenderlove.