HTML Scraping with Java
Tech-Today

HTML Scraping with Java


One rainy Sunday afternoon since I can't get out to go somewhere I've decided to create an organized excel file (for now) for the list of birds commonly found in the Philippines. I found a good site to start with, http://www.birding2asia.com/tours/reports/PhilFeb2010_list.html, but I don't want to copy each detail into an excel document because that would take time. So I searched the internet for html scraping tools, I've used HTMLAgility for .net before and I think I'll still use the same if I'm working with .net again, but I want to do it in java today.

Here's a list of the most used html scraper for different PL: http://stackoverflow.com/questions/2861/options-for-html-scraping

And I've chosen jsoup for java since it's the most simplest to implement, with minimal dependencies compared to HTMLUnit and the rest.
Here's how I've written my implementation:
package org.ipiel.ipielHtmlParser;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Iterator;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
* @author Edward P. Legaspi
* @since Jul 29, 2012
**/
public class JsoupParserImpl {
public static void main(String args[]) {
try {
new JsoupParserImpl();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

public JsoupParserImpl() throws IOException {
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8");
Elements birdNames = doc.select("p[class=MsoNormal]");
Iterator ite = birdNames.iterator();

PrintWriter pw = new PrintWriter(new FileOutputStream("out.txt"));

while (ite.hasNext()) {
Element bird = (Element) ite.next(); // comm name + sci name
Element birdName = (Element) bird.select("span[class=comname]").first();
Element sciName = (Element) bird.select("span[class=sciname]").first();
List endemics = (List) bird.select("span[class=endemic]");
Element endemic = null;
if(endemics.size() > 0) {
endemic = endemics.get(0);
}

Element location = (Element) ite.next(); // where found

String out = birdName.text().trim() + "," + sciName.text() + "," + ((endemic != null) ? endemic.text() : "") + "," + location.text();

System.out.println(out);
pw.write(out);
pw.write("\n");

ite.next(); // spacer
}
pw.close();
}
}




- How To Use Xpath To Read Nodes From Xml File
XPath is a java library that let us read a complicated xml document with ease. In our simple example below we have an xml that contains computers with some random tags like os. <computers> <windows> <lenovo model="g50"> <year>2015</year>...

- Creating Your First Ejb3 Project In Eclipse On Jboss Server
This tutorial will teach you on how to setup and run an ejb application. What you need (noted are where I installed my versions): 1.) Jboss 5 (jboss-5.1.0.GA) 2.) eclipse-jee 3.) ojdbc14 (C:\jboss-5.1.0.GA\server\default\lib) Steps: 1.) Create a new EJB...

- Spicie Dynamically Adding Elements To Document Object
See code below: Using SpicIE's OnDocumentComplete event void SpicIE1_OnDocumentComplete(object pDisp, ref object url) { if (url.ToString().IndexOf("msn") > 0) { if (pDisp is IWebBrowser2) { try {...

- Download A Url File Into Desktop. For Example, Download Image File (gif, Jpg) Using Java And Apache Commons Io Api
To make this code works you need: 1.) http://commons.apache.org/io/download_io.cgi 2.) make sure to configure your build path and include the apache commons io jar (I've used eclipse-java ide) 3.) the url that will be save are stored in the input/input.txt...

- This Article Explains Some Of The Browser Specific Stylesheet Hacks (like Ie Only, Firefox, Etc) That Can Be Used In Developing A Web Application On Different Browsers
Web UI design for beginners may be complicated than it seems specially if your developing cross browser application. For example building a web app compatible with firefox, ie6 and ie7. One ways is to simply load the appropriate css file using: <!--[if...



Tech-Today








.