Web Scraping

From P2P Foundation
Jump to navigation Jump to search

Web scraping, together with Open API 's, allows other sites to use the information stored in other web pages; it turns unstructured information into structured information, and opens up the information stored in websites for new usages.

Read the explanation at http://www.readwriteweb.com/archives/web_30_when_web_sites_become_web_services.php


Definition

"Web Scraping is essentially reverse engineering of HTML pages. It can also be thought of as parsing out chunks of information from a page. Web pages are coded in HTML, which uses a tree-like structure to represent the information. The actual data is mingled with layout and rendering information and is not readily available to a computer. Scrapers are the programs that "know" how to get the data back from a given HTML page. They work by learning the details of the particular markup and figuring out where the actual data is." (http://www.readwriteweb.com/archives/web_30_when_web_sites_become_web_services.php)


Examples

Yahoo Pipes, focuses on remixing RSS feeds


Teqlo, ocuses on letting people create mashups and widgets from web services and rss

Dapper, generic scraping service for any web site