Use the left/right arrow keys to navigate.

Nokogiri

History, Present and Future

Presented at GORUCO 2013

by Mike Dalessio (@flavorjones)

(a valid and well-formed presentation in 10 minutes)


permalink: bit.ly/nokogiri-goruco-2013

pivotal-labs

me

Nokogiri

As of this morning, Nokogiri has been downloaded 11,755,224 times.


(For comparison, Rails is at around 24 million downloads.)

Nokogiri == kelly

Rails == kiss

How did that happen?

I shall tell you.

But first ...

Etymology

What does it mean?

Nokogiri saw


It's a saw. You know, for cutting through trees. Of XML.

Origins

The year was 2008.

2008

  • Rails 2.1 and Ruby 1.8.7 are released.
  • 'Slumdog Millionaire' and 'Wall-E' in theaters.
  • Oil hits $100/barrel for the first time.
  • Stock markets crash. (Thanks, subprime lenders.)
  • LHC goes online.
  • Barack Obama is elected president.
  • NY Gov. Eliot Spitzer resigns in disgrace.

On September 9th, I had an email conversation with Aaron Patterson (@tenderlove).


tenderlove

chat 1

chat 2

How did DL work out?

DL was slow. Really slow.

(But more on dynamic bindings later.)

Origins

We quickly moved to writing a C extension to bind to libxml2.

Origins

The toughest problem writing a C extensions?

Debugging libxml2's memory management model.

Debugging Memory Issues and C extensions

Learn these tools!

  • valgrind
  • perftools.rb (Thanks to Aman Gupta!)

API Design

We boldly stole the best XML API we could find ...

API Design by Theft

Hpricot's API.


why

First official release on November 17, 2008.

Meaningless Statistic

By January 2009, we had:

  • 1,510 lines of Ruby code
  • 1,760 lines of C code

Community acceptance

Here are some BS performance stats from the time:

bs-stats

People in the past argued a lot about XML library benchmarks.

Then, in August 2009, this:

why-tweet

why

Sigh.

The Dawn of JRuby

JRuby didn't fully support C extensions.

This was a problem for JRubyists who wanted to use Nokogiri.

I had a dream.

One codebase that ran on MRI, Rubinius and JRuby.

My FFI Phase

What's FFI?

Foreign Function Interface

Ruby calling native C code directly.

I had a dream.

One codebase that bound to libxml2 on any platform.

Ruby FFI

A cross-platform Ruby API for accessing native C code.

Ruby FFI

Shout out to Wayne Meissner (@wmeissner).

FFI is basically magic.

My FFI Phase

I spent most of January -- May 2009 rewriting all the C code in Ruby.

My FFI Phase

It was painful.

Meaningful Statistic

It took 3,049 lines of Ruby code

to reproduce 4,150 lines of C code.

PAIN: Writing C in Ruby

ffi-1

FFI port

Good: It worked!


(Golf clap.)

FFI port

Bad: "Segfault-driven development"


Handicapped by lack of compile-time checking.

FFI port

Bad: Portable string handling is hard.


JVM GC edge cases.

FFI port

Bad: FFI code not any clearer than C


Requires you think in C and translate to Ruby.

ffi-2


DO NOT WANT.

FFI port

Really Bad: Huge performance penalty.

ffi-chart-1

FFI port

And worst of all ...

Let me tell you a story about RubyConf 2009.

All these choices suck.

choices

Outcome

Killed the FFI port.

Unpublished blog post.


:(

FFI Lessons

  • If you care about performance, you need to write native extensions
  • If you care about multi-platform, you need to support at least two codebases
  • If you don't care about either, then FFI is for you!

(Pause for a sip of beverage.)

"Portability is for people who cannot write new programs."

-- Linux Torvalds

Enter Sergio Arbeo

Sergio Arbeo (@serabe)

College student.

Spiked on a pure-Java port over the summer.

It (mostly) worked.

believe

Meaningless Statistic

$ sloccount ext lib

Totals grouped by language
(dominant language first):

java:          9206 (50.72%)
ansic:         4758 (26.22%)
ruby:          3932 (21.67%)
yacc:           253 (1.39%)

March 2012

zomg-java

Meaningful Statistic

ffi-chart-2

Ruby Bounty contributors

  • Aaron Patterson (@tenderlove)
  • Mike Dalessio (@flavorjones)
  • Darrin Eden (@dje)
  • Charles Nutter (@headius)
  • Tony Arcieri (@tarcieri)
  • Sergio Arbeo (@serabe)
  • Roger Pack (@rdp)

Ruby Bounty winners

  • Pat Mahoney (@pmahoney)
  • Yoko Harada (@yokolet)
  • Charles Nutter (@headius)
  • Sergio Arbeo (@serabe)

(Pause for a sip of beverage.)

The Present

Installation

Installation Problems and Solutions

Installation Problems and Solutions

  • Windows
  • JRuby
  • everybody else

Installation on Windows

It's complicated.

Instant poll: Any Ruby Windows developers out there?

Nobody has a build toolchain.

Nobody has libxml2 installed on their system.

Solution

Cross-compile and package the DLLs with the gem:

  • libxml2
  • libxslt
  • libiconv
  • zlib

"Fat Binary" gems

$ ls -l gems
total 21652
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mingw32.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mswin32-60.gem
-rw-r--r-- 1 miked  221184 Mar 11 17:31 nokogiri-1.5.7.gem

Fat because we have to compile against multiple rubies:

  • Ruby 1.8.7
  • Ruby 1.9.3
  • Ruby 2.0 (as of nokogiri 1.5.7)

Windows

Luis Lavena (@luislavena) supports the Windows build toolchain basically single-handedly.

  • rake-compiler
  • mini_portile
  • infinite patience

He rules!

Installation on JRuby

Nokogiri's JRuby port uses specific libraries:

  • isorelax
  • jing
  • nekodtd and nekohtml
  • xerces

These may not be installed on the target system.

Solution

Build and package jar files!

$ ls lib/*jar
lib/isorelax.jar
lib/jing.jar
lib/nekodtd.jar
lib/nekohtml.jar
lib/xercesImpl.jar

The JRuby gem is also a "Fat Binary"

$ ls -l gems
total 21652
-rw-r--r-- 1 miked 2204160 Mar 11 17:31 nokogiri-1.5.7-java.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mingw32.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mswin32-60.gem
-rw-r--r-- 1 miked  221184 Mar 11 17:31 nokogiri-1.5.7.gem

Installation on MRI(-ish)

Like any ordinary C extension, we compile on installation.

Installation on MRI(-ish)

Unlike most C extensions, we have unwieldy external dependencies:

  • libxml2
  • libxslt

This is kind of lame.

External system dependencies

This is lame because

  • everyone has different libxml2 versions installed
  • ... in different places
  • ... with different (possibly buggy) behavior

PAIN FOR YOU

installation

PAIN FOR ME

int is_2_6_16(void)
{
  return (strcmp(xmlParserVersion, "20616") <= 0) ? 1 : 0 ;
}

if (   reparentee->type == XML_TEXT_NODE
    && pivot->type      == XML_TEXT_NODE
    && is_2_6_16(
   ) {
  /* work around a string-handling bug in libxml 2.6.16.
     we'd rather leak than segfault. */
  pivot->content = xmlStrdup(pivot->content);
}

Have I told you about libxml 2.9.0 yet?

"Wait."

"Didn't you just say there's a toolchain for compiling autoconf projects and binding to them at gem installation time?"

Dude/Dudette!

You've been paying attention!

Thank you!

I'm Happy to Announce

(drumroll)

Nokogiri 1.6.0

Packages libxml2 and libxslt inside the gem.

Nokogiri 1.6.0

Installation will Just Work™.

You can still use your system libraries if you really want to.

Nokogiri 1.6.0

(Pending me running "gem push" in a few minutes.)

Naming Things is Easy!

I hereby invent the "Fat Source" gem.


mugatu

The Future

The Future

Nokogiri 2.0 roadmap up at github.com/sparklemotion/nokogiri

Highlights

  • better serialization / pretty-printing
  • faster SAX parsing (see the fairy-wing throwdown)
  • better Node attributes API
  • CSS query parsing (pseudo-selectors and JQuery compat)
  • Fragment boogs
  • better custom XPath handler API
  • general encodings improvement
  • Reader

Let's talk afterwards. (Preferably, on the boat.)

me