Parsing HTML and XML Documents
I want to parse an HTML file (http://xmlfeed.jobcentral.com/) for the HYPERLINKS it has on the page.

I want to store all the Hyper Links present on the page to MySQL database.

Once this is done, Then the links are associated to different XML files. I also want the data present on the XML to be stored in the database.

Can we directly parse the XML files present there or we should parse the hpyerlinks present and then should go ahead with XML parsing...?

Which method is more efficient..?

Can we directly parse the XML files present on the website..?

Hpricot rules and should do perfect for the parsing.. http://code.whytheluckystiff.net/hpricot/

As for the other part, you are going to grab all their listings an store them in your database? If its specific to just this site and that xml.. I would just make a Job model and code up a rake task that parses the xml and stores it in the job model. They even have a guid attribute so you can easily avoid dupe jobs..

Hi Piyush, there are many methods to parse the HTML and XML data, I have worked a lot in this field and defenately can tell you following are best options: 1.Usr Rubyful Soup (its a gem) You can get more info at


2.Use Hpricot and Mechanize 3.For feeds use feed_tools to read the feeds and rubyful_soup to parse the data or Hpricot in case any difficulty, mail me saurabh[at]railsworkways[dot]com

