Extracting Data From Web Pages

This is a frequently asked question to me; therefore, I decided to write a blog entry about it.

Sometimes, programmers want to access the data within an HTML page and use the data programmatically. A typical example is to extract currency values from within financial pages. See the examples below:

http://www.tcmb.gov.tr/kurlar/200909/09092009.html

Technically, it is possible to extract data from within this webpage. At the end of the day, the data in this page is a long text string. You can display the source of this sample webpage using the facilities of your Internet browser, and see it for yourself.

It is possible to write a program to scan for particular patterns within this text and extract the currency data. This technique is often called as “Spidering”.

However, using a spider program in a production environment comes with a great risk. HTML pages are intended to be viewed by humans. If the visual design of the HTML page is changed, the patterns within the text source behind would also be changed. This would bring your spider program to a state where it can’t find your predefined patterns any more; and therefore, it won’t be able to extract data. In such a scenario, you would have to re-code your spider program according to the new format.

Solution? Don’t develop spidering programs in a production environment unless you are 100% sure that the visual design of the target HTML page will not be changed.

The right way to implement such a solution is to find a Web service which provides the data in XML format. Example:

http://www.tcmb.gov.tr/kurlar/200909/29092009.xml

This is the same sample dataset in XML format – intended to be read by programs. Parsing the XML code here (instead of HTML spidering) is relatively safer, and can be used in a production environment.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s