Nokogiri
History, Present and Future
Presented at GORUCO 2013
by Mike Dalessio (@flavorjones)
(a valid and well-formed presentation in 10 minutes)
permalink: bit.ly/nokogiri-goruco-2013
Presented at GORUCO 2013
by Mike Dalessio (@flavorjones)
(a valid and well-formed presentation in 10 minutes)
permalink: bit.ly/nokogiri-goruco-2013
As of this morning, Nokogiri has been downloaded 11,755,224 times.
(For comparison, Rails is at around 24 million downloads.)
Nokogiri ==
Rails ==
I shall tell you.
But first ...
What does it mean?
It's a saw. You know, for cutting through trees. Of XML.
The year was 2008.
On September 9th, I had an email conversation with Aaron Patterson (@tenderlove).
(But more on dynamic bindings later.)
We quickly moved to writing a C extension to bind to libxml2.
The toughest problem writing a C extensions?
Learn these tools!
valgrind
perftools.rb
(Thanks to Aman Gupta!)We boldly stole the best XML API we could find ...
Hpricot's API.
First official release on November 17, 2008.
By January 2009, we had:
Here are some BS performance stats from the time:
People in the past argued a lot about XML library benchmarks.
Then, in August 2009, this:
Sigh.
JRuby didn't fully support C extensions.
This was a problem for JRubyists who wanted to use Nokogiri.
One codebase that ran on MRI, Rubinius and JRuby.
Foreign Function Interface
Ruby calling native C code directly.
One codebase that bound to libxml2 on any platform.
A cross-platform Ruby API for accessing native C code.
Shout out to Wayne Meissner (@wmeissner).
FFI is basically magic.
I spent most of January -- May 2009 rewriting all the C code in Ruby.
It was painful.
It took 3,049 lines of Ruby code
to reproduce 4,150 lines of C code.
Good: It worked!
(Golf clap.)
Bad: "Segfault-driven development"
Handicapped by lack of compile-time checking.
Bad: Portable string handling is hard.
JVM GC edge cases.
Bad: FFI code not any clearer than C
Requires you think in C and translate to Ruby.
DO NOT WANT.
Really Bad: Huge performance penalty.
And worst of all ...
Let me tell you a story about RubyConf 2009.
Killed the FFI port.
Unpublished blog post.
:(
(Pause for a sip of beverage.)
"Portability is for people who cannot write new programs."
-- Linux Torvalds
College student.
Spiked on a pure-Java port over the summer.
It (mostly) worked.
$ sloccount ext lib
Totals grouped by language
(dominant language first):
java: 9206 (50.72%)
ansic: 4758 (26.22%)
ruby: 3932 (21.67%)
yacc: 253 (1.39%)
March 2012
(Pause for a sip of beverage.)
Instant poll: Any Ruby Windows developers out there?
Nobody has a build toolchain.
Nobody has libxml2 installed on their system.
Cross-compile and package the DLLs with the gem:
"Fat Binary" gems
$ ls -l gems
total 21652
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mingw32.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mswin32-60.gem
-rw-r--r-- 1 miked 221184 Mar 11 17:31 nokogiri-1.5.7.gem
Fat because we have to compile against multiple rubies:
Luis Lavena (@luislavena) supports the Windows build toolchain basically single-handedly.
rake-compiler
mini_portile
He rules!
Nokogiri's JRuby port uses specific libraries:
isorelax
jing
nekodtd
and nekohtml
xerces
These may not be installed on the target system.
Build and package jar
files!
$ ls lib/*jar
lib/isorelax.jar
lib/jing.jar
lib/nekodtd.jar
lib/nekohtml.jar
lib/xercesImpl.jar
The JRuby gem is also a "Fat Binary"
$ ls -l gems
total 21652
-rw-r--r-- 1 miked 2204160 Mar 11 17:31 nokogiri-1.5.7-java.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mingw32.gem
-rw-r--r-- 1 miked 9870336 Mar 11 17:31 nokogiri-1.5.7-x86-mswin32-60.gem
-rw-r--r-- 1 miked 221184 Mar 11 17:31 nokogiri-1.5.7.gem
Like any ordinary C extension, we compile on installation.
Unlike most C extensions, we have unwieldy external dependencies:
This is kind of lame.
This is lame because
int is_2_6_16(void)
{
return (strcmp(xmlParserVersion, "20616") <= 0) ? 1 : 0 ;
}
if ( reparentee->type == XML_TEXT_NODE
&& pivot->type == XML_TEXT_NODE
&& is_2_6_16(
) {
/* work around a string-handling bug in libxml 2.6.16.
we'd rather leak than segfault. */
pivot->content = xmlStrdup(pivot->content);
}
Have I told you about libxml 2.9.0 yet?
"Wait."
"Didn't you just say there's a toolchain for compiling autoconf projects and binding to them at gem installation time?"
Dude/Dudette!
You've been paying attention!
Thank you!
(drumroll)
Packages libxml2 and libxslt inside the gem.
Installation will Just Work™.
You can still use your system libraries if you really want to.
(Pending me running "gem push" in a few minutes.)
I hereby invent the "Fat Source" gem.
Nokogiri 2.0 roadmap up at github.com/sparklemotion/nokogiri
Node
attributes APIFragment
boogsReader
Let's talk afterwards. (Preferably, on the boat.)