Abstract
Semantic annotation of Web objects is a key problem for Web information extraction. The Web contains an abundance of useful semi-structured information about real world objects, and the empirical study shows that strong two-dimensional sequence characteristics and correlative characteristics exist for Web information about objects of the same type across different Web sites. Conditional Random Fields (CRFs) are the state-of-the-art approaches taking the sequence characteristics to do better labeling. However, as the appearance of correlative characteristics between Web object elements, previous CRFs have their limitations for semantic annotation of Web objects and cannot deal with the long distance dependencies between Web object elements efficiently. To better incorporate the long distance dependencies, on one hand, this paper describes long distance dependencies by correlative edges, which are built by making good use of structured information and the characteristics of records from external databases; and on the other hand, this paper presents a two-dimensional Correlative-Chain Conditional Random Fields (2DCC-CRFs) to do semantic annotation of Web objects. This approach extends a classic model, two-dimensional Conditional Random Fields (2DCRFs), by adding correlative edges. Experimental results using a large number of real-world data collected from diverse domains show that the proposed approach can significantly improve the semantic annotation accuracy of Web objects.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zhu J, Nie Z Q, Wen J R, Zhang B, Ma W Y. 2D conditional random fields for Web information extraction. In Proc. the International Conference on Machine Learning, Bonn, Germany, Aug. 7-11, 2005, pp.1044-1051.
Haas L. Beauty and the beast: The theory and practice of information integration. In Proc. the 11th International Conference on Database Theory, Barcelona, Spain, Jan. 10-12, 2007, pp.28-43.
Zhu J, Nie Z Q, Wen J R, Zhang B, Ma W Y. Simultaneous record detection and attribute labeling in Web data extraction. In Proc. the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, Aug. 20-23, 2006, pp.494-503.
Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. the International Conference on Machine Learning, Williamstown, USA, Jun. 28-Jul. 1, 2001, pp.282-289.
Zhai Y H, Liu B. Web data extraction based on partial tree alignment. In Proc. the 14th International World Wide Web Conference, Chiba, Japan, May 10-14, 2005, pp.76-85.
Embley D W, Campbell D M, Jiang Y S et al. Conceptualmodel-based data extraction from multiple-record Web pages. Data and Knowledge Engineering, 1999, 31(3): 227-251.
Ramakrishnan S M, Ramakrishnan I V, Singh A. Bootstrapping semantic annotation for content-rich HTML documents. In Proc. the 21st International Conference on Data Engineering, Tokyo, Japan, Apr. 5-8, 2005, pp.583-593.
Arlotta L, Crescenzi V, Mecca G, Merialdo P. Automatic annotation of data extracted from large Web sites. In Proc. the 6th International Workshop on Web and Databases, California, USA, Jun. 12-13, 2003, pp.7-12.
Zhao H, Kit C Y. Scaling conditional random fields by one against-the-other decomposition. Journal of Computer Science and Technology, 2008, 23(4): 612-619.
Sutton C, McCallum A. Collective segmentation and labeling of distant entities in information extraction. England: University of Massachusetts, Technical Report: 04-49, July 2004.
Huang J B, Ji H B, Sun H L. Integration of heterogeneous of Web records using mixed skip-chain conditional fields. Journal of Software, 2008, 19(8): 2149-2158. (in Chinese)
Zhu J, Nie Z Q, Zhang B, Wen J R. Dynamic hierarchical Markov random fields for integrate Web data extraction. Journal of Machine Learning Research, 2008, 9(6): 1583-1614.
Cohen W, Sarawagi S. Exploiting dictionaries in named entity extraction: Combining semi-Markov extraction processes and data integration methods. In Proc. the International Conference on Knowledge Discovery and Data Mining, Seattle, USA, Aug. 22-25, 2004, pp.89-98.
Nie Z Q, Wu F, Wen J R, Ma W Y. Extracting objects from the Web. In Proc. the 22nd International Conference on Data Engineering, Atlanta, USA, Apr. 3-7, 2006, p.123.
Hammersley J, Clifford P. Markov fields on finite graphs and lattices. Unpublished manuscript, Oxford University, 1971.
Mansuri I R, Sarawagi S. Integrating unstructured data into relational databases. In Proc. the 22nd International Conference on Data Engineering, Atlanta, USA, Apr. 3-7,2006, p.29.
Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(3): 503-528.
Kevin P M, Yair W, Michael I J. Loopy belief propagation for approximate inference: An empirical study. In Proc. the 15th Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, Jul. 30-Aug. 1, 1999, pp.467-475.
Weiss Y. Correctness of local probability propagation in graphical models with loops. Neural Computation, 2000, 12(1): 1-41.
Weiss Y, Freeman W. On the optimality of solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE Transaction on Information Theory, 2001, 47(2): 736-744.
Wang X L, Computer Processing of Natural Language, Beijing: Tsinghua University Press, 2005, pp.58-62. (in Chinese)
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the National Natural Science Foundation of China under Grant No. 90818001 and the Natural Science Foundation of Shandong Province of China under Grant No. Y2007G24.
Rights and permissions
About this article
Cite this article
Ding, YH., Li, QZ., Dong, YQ. et al. 2D Correlative-Chain Conditional Random Fields for Semantic Annotation of Web Objects. J. Comput. Sci. Technol. 25, 761–770 (2010). https://doi.org/10.1007/s11390-010-9363-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-010-9363-8