Friday, November 10, 2006

Scrapping japanese web pages with Ruby and Mechanize

With WWW::Mechanize it is possible to scrap all kinds of web pages no matter if they require sessions, complicated forms, or have difficult/bad html syntax. So why the emphasis of "japanese" in the title you ask??. Do we need a Japanese version of Mechanize to scrap Japanese characters??. Well the answer to the second question is NO, but we still need to be carefull when scrapping Japanese html pages.

If you have to scrap several Japanese html pages from different places you will find that in Japan there are three standard encodings used for the text in html pages, namely EUC-JP, Shift-JIS and UTF-8. At first I was using the content-type of the html header to determine the page encoding and then convert the page to Shift-JIS encoding that is the encoding used by the MSSQL database where I store the scrapped data.

Unfortunately as more pages I scrapped it was clear that the validity of the content-type field depends on how serious is the html page developer. Some pages were missing the charset parameter in the content-type and others even had it wrong (charset=sjis in a UTF encoded page). Suddenly I was testing every page to find out the encoding before scrapping it and had three ruby scripts, one for each type of encoding.

Later I learned that ruby actually has ways to find out automatically between the three encodings by checking the text itself (no wonder since Ruby was developed in Japan). This capability combined with the flexibility of Mechanize allowed me to write a single script to extract any japanese page without concerns in encodings : ).

The magic is done by Kconv (Ruby Kanji Converter) that has methods to convert any string to an specific encoding. Kconv makes some guesses on the current encoding of the string and then makes the necessary steps to convert it to the desired one.

To integrate Kconv with Mechanize we take advantage of the pluggable parsers:


require 'rubygems'
require 'mechanize'
require 'kconv'

# Create a pluggable parser
class SjisParser < WWW::Mechanize::Page
def initialize(uri = nil, response = nil, body = nil, code = nil)
body = Kconv.tosjis(body) # Magic Line
super(uri, response, body, code)
end
end

# Create the WWW::Mechanize object
agent = WWW::Mechanize.new

# Register our parser
agent.pluggable_parser.html = SjisParser

# Load the UTF or EUCJP encoded page
page = agent.get("http://utf-or-eucjp/encoded/page.html")

# Print the result
puts page.body


There is nothing new in this code with respect to Mechanize but if you do not understand what this code does the you should read the GUIDE file that comes with Mechanize. The important line here is the one marked Magic Line. This simple line of code uses Kconv to convert whatever it receives (the html page) to Shift-Jis encoding. The encoded html page is then used to create a Page object as Mechanize would normaly do. No matter if the page is in UTF, EUC or SJIS encoding the result will be a Page object with the body in Shift-JIS encoding. Pretty cool isn't it?

The code above works well for Mechanize 0.5.1 and 0.6.0 that support pluggable parsers but in Mechanize 0.4.7 there are no pluggable parsers.In Mechanize 0.4.7 we have a less flexible "body_filter" that can be used to encode the html pages before they are parsed by Mechanize.


require 'rubygems'
require_gem 'mechanize', '=0.4.7' # Make sure we are using version 0.4.7
require 'mechanize'
require 'kconv'

# Create the WWW::Mechanize object
agent = WWW::Mechanize.new

page = agent.get("http://utf-or-eucjp/encoded/page.html")
page.body_filter = lambda { |body| Kconv.tosjis(body) } # Magic Line

# Print the links text
puts page.links.each { |link| puts link.text }


Again the magic is marked by the Magic Line line. From the Mechanize documentation:

The body filter sends the page body to the code block, and parses what the code block returns. The filter on WWW::Page#body_filter is a "per-page" filter, meaning that it is only applied to one page object.

In short we convert the html to Shift-Jis before it is parsed by Mechanize.

Warning!!: There is a small difference on how body_filter works in comparison to the pluggable parser presented above. The body_filter will encode the html text to sjis and the result will be passed to the parser for further processing. The body_filter WILL NOT modify the html text itself!!. If you print out the html text using "puts page.body" you will get the html text in the original encoding. Only the parsed html will be in Shift-JIS and that is why I only print the links text the example above.

In the case of the pluggable parser we convert the html text and pass the encoded version to the Page class initialize method. This way the Page object is created with the Shift-JIS encoded version from the start.

If you need to access the encoded html text it is possible to exploit the openness of Ruby classes and modify the Page class to do it. For this we simply redefine the "body" method from the Page class as follows:


require 'rubygems'
require_gem 'mechanize', '=0.4.7' # Make sure we are using version 0.4.7
require 'mechanize'
require 'kconv'

# Redefine the body method of the Page class
module WWW
class Page
def body
Kconv.tosjis(@body)
end
end
end

# Create the WWW::Mechanize object
agent = WWW::Mechanize.new

page = agent.get("http://utf-or-eucjp/encoded/page.html")
page.body_filter = lambda { |body| Kconv.tosjis(body) } # Magic Line

# Print the Shift-JIS encoded body
puts page.body


There is no limit to the flexibility that Ruby offers us. With the definition of the Page class in the above code we are modifiying the way Mechanize works without need to change anything in the original Mechanize source. Now we can use the "body" method to obtaing the Shift-JIS encoded version of the html text. Moreover we could go rampant and add methods to get the body in different encodings like:


require 'rubygems'
require_gem 'mechanize', '=0.4.7' # Make sure we are using version 0.4.7
require 'mechanize'
require 'kconv'

# Redefine the Page class
module WWW
class Page
# Return the body Shift-JIS encoded
def sjis_body
Kconv.tosjis(@body)
end

# Return the body EUC-JP encoded
def euc_body
Kconv.toeuc(@body)
end

# Return the body UTF-8 encoded
def utf8_body
Kconv.toutf8(@body)
end
end
end



We see here that Ruby and everything Ruby related is powerfull and flexible without being too complicated. I am still learning and the more I learn the more I am stick with it. If only there were MIDP/CLDC implementations in Ruby... Have fun scrapping japanese pages!

No comments:

Post a Comment