Online ISSN:1349-8606
Progress in Informatics  
No.6 March 2009  
Page 27-39 PDF(2,060KB) | References
doi:10.2201/NiiPi.2009.6.4
Buildingweb page collections efficiently exploiting local surrounding pages
Yuxin WANG1.*1 and Keizo OYAMA2
1Information Technology Center, University of Tokyo
2National Institute of Informatics
2The Graduate School for Advanced Studies (SOKENDAI)
(Received: September 17, 2008)
(Revised: January 15, 2009)
(Accepted: January 18, 2009)
Abstract:
This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher's homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce a logical page group structure concept that is represented by the relation between an entry page and its surrounding pages based on their connection type and relative URL directory level, and use the contents of local surrounding pages according to that concept. For the first process, we propose a very efficient method for comprehensively gathering all potential researchers' homepages from the web using property-based keyword lists. Four kinds of page group models (PGMs) based on the page group structure were used for merging the keywords from the surrounding pages. Although a lot of noise pages are included if we use keywords in the surrounding pages without considering the page group structure, the experimental results show that our method can reduce the increase of noise pages to an allowable level and can gather a significant number of the positive pages that could not be gathered using a single-page-based method. For the second process, we propose composing a three-grade classifier using two base classifiers: precision-assured and recall-assured. It classifies the input to assured positive, assured negative, and uncertain pages, where the uncertain pages need a manual assessment, so that the collection quality required by an application can be assured. Each of the base classifiers is further composed of a surrounding page classifier (SC) and an entry page classifier (EC). The SC selects likely component pages and the EC classifies the entry pages using information from both the entry page and the likely component pages. An evident performance improvement of the base classifiers by the introduction of the SC is shown through experiments. Then, the reduction of the number of uncertain pages is evaluated and the effectiveness of the proposed method is shown.
Keywords:
Web page collections, page group model, logical page group structure, three-grade classifier, quality assurance, precision and recall
PDF(2,060KB) | References

National Institute of Informatics is a member of CrossRef.
Go back HOME