This is a frequently asked question to me; therefore, I decided to write a blog entry about it.
Sometimes, programmers want to access the data within an HTML page and use the data programmatically. A typical example is to extract currency values from within financial pages. See the examples below:
Technically, it is possible to extract data from within this webpage. At the end of the day, the data in this page is a long text string. You can display the source of this sample webpage using the facilities of your Internet browser, and see it for yourself.
It is possible to write a program to scan for particular patterns within this text and extract the currency data. This technique is often called as “Spidering”.
However, using a spider program in a production environment comes with a great risk. HTML pages are intended to be viewed by humans. If the visual design of the HTML page is changed, the patterns within the text source behind would also be changed. This would bring your spider program to a state where it can’t find your predefined patterns any more; and therefore, it won’t be able to extract data. In such a scenario, you would have to re-code your spider program according to the new format.
Solution? Don’t develop spidering programs in a production environment unless you are 100% sure that the visual design of the target HTML page will not be changed.
The right way to implement such a solution is to find a Web service which provides the data in XML format. Example:
This is the same sample dataset in XML format – intended to be read by programs. Parsing the XML code here (instead of HTML spidering) is relatively safer, and can be used in a production environment.