Versão em português
Rodrigo header Rodrigo Rosenfeld Rosas

How NokoGiri and JRuby saved my week

Sun, 04 Mar 2012 12:30:00 +0000

I'd like to share some experiences I had this week trying to parse some HTML with Groovy.

Then, I'll explain how it was better done with JRuby and it was also finished much faster too.

This week I had to extract some references from some HTML documents and store them to the database.

This is the spec of what I wanted to implement in MiniTest specs written in Ruby:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# encoding: utf-8
require 'minitest/autorun'
require_relative '../lib/references_extractor'

describe ReferencesExtractor do
  def example
    %Q{
      <div cid=1>
        <empty cid=11>
        </empty>
        some text
        <div cid=12>
          <div cid=121>
            <empty /><another></another>
            <p cid=1211>First paragraph.</p>
            <p cid=1212>Second paragraph.</p>
          </div>
          <p cid=122>Another pa<b>ra</b>graph.</p>
        </div>
      </div>
    }
  end

  it "extract references from example" do
    return
    extractor = ReferencesExtractor.new example
    {
      ['1'] => {'1' => "some text First paragraph. Second paragraph. Another paragraph."},
      ['1211', '1212', '11'] => {'121' => "First paragraph. Second paragraph."},
      ['1211', '1212', '122'] => {'12' => "First paragraph. Second paragraph. Another paragraph."},
      ['12', '1212']          => {'12' => "First paragraph. Second paragraph. Another paragraph."},
      ['1212', '122'] => {'1212' => "Second paragraph.", '122' => "Another paragraph."},
    }.each {|cids, expected| extractor.get_references_texts(cids).must_equal(expected) }
  end
end

I had a similar test written using JUnit, with a small change to make it more easy to implement but I'll discuss it later on in this article. Let me just explain this situation better.

Don't ask me what "cid" means as I wasn't the one to name this attribute, but I guess it is "c..." id, although I have no clue what is "c..." all about. It was already called this way when I started working on this project and I'm the sole developer of this project right now after lots of other developers having worked on it before me.

Part of the application I maintain has to deal with documents obtained from Edgar filings. Then a processing is made to each HTML tag so that they're given sequential unique numbers in the "cid" attribute. Someone will then be able to review the documents and highlight certain parts of it by clicking on the elements in the page. So the database has a reference to a document and a cid list, like "1000,1029,1030" will all elements that should be highlighted. This was stored exactly this way as a string in a database column.

But some weeks ago I was requested to export the contents of some highlighted references to an Excel spreadsheet and this is somewhat more complex than it looks like. With jQuery, it would be equivalent to "$('[cid=12]').text()".

For performance reasons in the search interface I had to import all references from over 3,000 documents to the database. For the new references, I'll do the processing with jQuery and send it already formatted to the server, but I need to do the initial import and doing the batch processing in the client-side would be painfully slow for this case.

But getting the correct output in the server-side is not that simple. For example, for those documents, there is no CSS involved, making it simpler to deal with. So "<div>some t<div>ex</div>t</div>" should be stored as "some t ex t" while "<div>some t<span>ex</span>t" should be stored as "some text". Since this requires a deeper understanding of HTML semantics, I decided to simplify it while dealing with Groovy and assume all elements as being block-level elements while parsing the fixed HTML as XML.

The Groovy solution

Doing that in Groovy took me a full week specially due to lack of documentation of XmlParser and XmlSlurper Groovy classes.

First, I had no clue which one to choose. As they had a similar interface I decided to start with XmlParser, and then change to XmlSlurper when it was finished to compare the performance between them.

I couldn't find any methods for searching for some XPATH or CSS expression. When you write "new XmlParser().parseText(xmlContent)", you get a Node.

XmlParser is not an HTML parser, so the XML content should be well formed, then you need to use some library like NekoHTML or TagSoup. Then you would use it like "new XmlParser(new Parser()).parseText(xmlContent)" That's ok, but if you want to play with it and don't know Groovy enough for dealing with Gradle and Maven dependencies, just use a valid XML as an example.

Since I couldn't find a search-like method for Node, I had to look for node '[cid=12]' with something like this:

1
2
3
xmlContent = '<div cid="12"> some text <span cid="13"> as an example </span>.</div>'
root = new XmlParser().parseText(xmlContent)
node = root.depthFirst().find { it.@cid == '12' }

Calling "node.text()" would yield to 'some text.' and calling "node.children()" would yield to ['some text', spanNode, '.'], which means it ignores white spaces, so it is of no usage to me.

So, I tried XmlSlurper. In this case, node.text() yields to ' some text as an example .'. Great for this example, but when applied to node with cid 12 in the MiniTest example above, it would yield to 'First paragraph.Second paragraph.Another paragraph.' ignoring all white spaces, so I couldn't use this.

But after searching a lot, I figured out that there was a class that would convert some node back to XML including all original white spaces, so it should be possible. Then I tried to get the text by myself.

"node.children()" returned [spanNodeChildInstance], ignoring the text nodes, so I was out of luck and had to dig into its source code. Finally after some hours digging the source-code I found what I was looking for: "node[0].children()" returning [' some text ', spanNode, '.'].

It took a while before I could get this to work, but I wasn't finished with it. I would have to navigate the XML tree for getting the final processed text. Look at the MiniTest example again and you'll see that I needed to get node with cid 12 as equivalent to the cid list [1211, 1212, 122].

So, one of the features I needed is to look for the first node ancestral having a cid, so that I could try it to see if it was a possible node. It happens that it was not that simple as while traversing the parents maybe I couldn't find any parent node with a cid. So, how could I check that I've reached the root node?

With XmlSlurper, when you call rootNode.parent() you'll get rootNode. So, I tried something like this:

1
2
parent = node.parent()
while (!parent.@cid && parent != parent.parent()) parent = parent.parent()

But the problem is that the comparison is made by string, so I have no real way to see if I have reached the parent. So, my solution was to check for "node.name() != 'html'" in this case. This is really a bad API design. Maybe root.parent() could return null. Also, I should be able to compare a node instead of its text.

After several days, in the end of last Thursday I could get a "working" version of a similar JUnit test passing with an implementation in Groovy. But as I wasn't using really an HTML parser, but an XML one, it means that I couldn't process white-spaces correctly for in-line blocks.

NokoGiri

Then, on Friday morning I was curious how I could parse HTML with Ruby, as I never did it before. That was when I got my first smile that morning when I read this from Aaron Patterson documentation of NokoGiri:

XML is like violence - if it doesn’t solve your problems, you are not using enough of it.

The smile got even bigger when I tried this:

1
2
require 'nokogiri'
Nokogiri::HTML('<div>Some <span>Te<b>x</b>t</span>.').text == 'Some Text.' # true

The smile has shrunk a bit when I realized that I would get the same result if I replaced the inline "b" block element with a "div". But that is ok, it was already good enough.

Other than the "text" method being more useful than the one used by XmlSlurper (new-lines are treated differently), navigating the XML tree is also much easier with NokoGiri. But I still couldn't find a good way of finding out if some node was a root one, as calling "root.parent" would raise an exception. Fortunately, as NokoGiri supports XPATH, I didn't need to do this manual traversing and this wasn't an issue to my specific needs.

But there was a remaining issue. It performed very badly when compared to the Groovy version, about 4 times slower. Looking at my CPU usage statistics it was obvious to me that it wasn't using all my CPU power, as in the Groovy version. It didn't matter how much threads I used with CRuby, each processor wouldn't be over 20% of the available capacity.

JRuby to the rescue

It is a shame that the Java API actually has a better API than Ruby for dealing with a pool of threads. It is called the Executors framework. As I couldn't find something like this in the Ruby standard library, I tried a Ruby gem called Concur.

I didn't investigate if the performance issues were caused by Concur implementation or the CRuby one, but I decided to give JRuby or Rubinius a try. As I already had JRuby available, I tried it first and as the results were about the same as the Groovy version, I didn't bother to check Rubinius.

With JRuby I could use the Java Executors framework just like in Groovy and I could see all my 6 cores above 90% all the time my 10 threads have been working for importing over 3,000 documents. Unfortunately my actual servers are much slower than my computer and it took more than 4 hours in the staging server when it took about an hour and a half in my computer. The CRuby version would probably take more than 4 hours in my computer, which means it could take almost a full day in the staging and production servers.

Conclusion

I must explain that I haven't tried using Ruby first because I would be able to take advantage of my models being already mapped by the Grails application, so I wouldn't have to deal with database set-up and would be allowed to have all my code in a single language. Of course, if I knew beforehand all the pain that it would be coding this in Groovy, I would have already done this in Ruby from the beginning. And the Ruby version was a bit better than my previous attempt with Groovy with regards to some corner cases including new-lines processing.

I'm very grateful for Aaron tendelove Paterson and Charles Nutter for their awesome work on Ruby, NokoGiri and JRuby. Thanks to them I could get my work done very fast in an elegant way, saving my week of frustration with Groovy.

comments powered byDisqus