章成志 分享 http://blog.sciencenet.cn/u/timy 宠辱不惊闲看庭前花开花落,去留无意漫观天外云展云舒

博文

基于Citation-KNN的语义隐含主题词自动抽取方法

已有 8709 次阅读 2008-8-7 19:51 |个人分类:文本挖掘| 自动标引, 主题分析, 关键词抽取, 隐含语义主题词, Citation-KNN

章成志1,2  刘耀1  王惠临1

1. 中国科学技术信息研究所 北京 1000382. 南京理工大学信息管理系 南京 210094

  现有的关键词抽取技术仅仅是对正文词汇的抽取,不能够抽取语义上隐含的主题。语义隐含主题的抽取是关键词自动抽取技术的难点。众所周知,KNN方法作为机器学习领域的一个经典的方法,在很多领域都有出色的表现。本文以KNN算法为基础,提出基于Citation-KNN的语义隐含主题词自动抽取方法。实验结果表明该方法在进行语义隐含主题词抽取任务上的有效性。

关键词:关键词抽取;隐含语义主题词;Citation-KNN

Automatic Implicit Semantic Subject Extraction Based on Citation-KNN

 Zhang Cheng-Zhi1, 2, Liu Yao1, Wang Huilin1

1. Institute of Scientific & Technical Information of China, Beijing 100038, China

2. Department of Information Management, Nanjing University of Science & Technology, Nanjing 210094, China

Abstract:          Currently, the keywords extraction method can only extract words appeared in the articles and it cannot extract the implicit semantic subject (ISS). It is a difficult work to extract implicit subject in an article in the task of automatic keywords extraction. As we all know, KNN method is a classic method in machine learning field and is also well used in many other fields. In this paper, we proposed an automatic ISS extraction method based on Citation-KNN method which transforms from the KNN method. Experimental results show that the proposed method can not only improve the precision and recall of keyword extraction, but also extract implicit subject efficiently.

key words:   Automatic Keyword Extraction;  Implicit Semantic Subject;  Citation-KNN

   

注: Citation-KNN最初由Jun Wang和Jean-Daniel Zucker提出,并用于解决多示例学习问题(Wang & Zucker 2000)。Citation-KNN是对传统KNN算法的一种改进,主要思想是借助于文献计量学中的引用与被引用这一思路。如图1示,在对测试样本x’i进行类别决策时,除了考虑最近邻的K个训练样本的类别外(即测试样本的“引文”),还考虑到训练样本集中将x’i作为其K个最近邻之一的训练样本(即测试样本的“被引”样本)的类别。

  

    

参考文献:

Anjewierden A, Kabel S. 2001. Automatic Indexing of Documents with Ontologies. In: Proceedings of the 13th Belgian/Dutch Conference on Artificial Intelligence (BNAIC-01), Amsterdam, Neteherlands. 23~30.

Baeza-Yates R, Ribeiro-Neto B. 1999. Modern Information Retrieval. New York: Association for Computing Machine (ACM) Press, 27-30.

Chien LF. 1997. PAT-tree-based Keyword Extraction for Chinese Information Retrieval. In: Proceedings of the ACM SIGIR International Conference on Information Retrieval, Philadelphia, USA: ACM Press, 50~59.

Cover TM, Hart PE. 1968. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, IT-13 : 21~27.

Edmundson H P, Oswald V A. 1959. Automatic Indexing and Abstracting of the Contents of Documents. Planning Research Corp, Document PRC R-126, ASTIA AD No. 231606, Los Angeles. 1~142.

Edmundson H P. 1969. New Methods in Automatic Abstracting Extracting. Journal of the Association for Computing Machinery.16(2): 264~285.

Ercan G, Cicekli I. 2007. Using Lexical Chains for Keyword Extraction. Information Processing and Management, 43(6): 1705~1714.

Frank E, Paynter GW, Witten IH, et al.. 1999. Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), California: Morgan Kaufmann, 668~673.

Hulth A. 2003. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, Sapporo, Japan, 216~223.

Lois L E. 1970. Experiments in Automatic Indexing and Extracting. Information Storage and Retrieval, 6: 313~334.

Luhn H P. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development, 1(4): 309~317.

Luhn H P. 1958. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development. 2(2): 159~165.

Salton G, Wong A, Yang C S. 1975. A Vector Space Model for Automatic Indexing. Communications of ACM, 18(11): 613~620.

Tan P, Steinbach M, Kumar V. 2006. Introduction to Data Mining. Boston: Addison-Wesley, 225.

Tomokiyo T, Hurst M. 2003. A language Model Approach to Keyphrase Extraction. In: Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition & Treatment, Sapporo, Japan, 33~40.

Turney P D. 1999. Learning to Extract Keyphrases from Text. NRC Technical Report ERB-1057, National Research Council, Canada. 1~43.

Turney PD. 1997. Extraction of Keyphrase from Text: Evaluation of Four Algorithms. Techial Repor ERB-1051, National Research Council, Institute for Information Technology.

Turney PD. 2000. Learning algorithms for keyphrase extraction. Information Retrieval. 2:303~336.

Wang J, Zucker J D. 2000. Solving the Multiple-instance Problem: A Lazy Learning Approach. In: Proceedings of 17th International Conference on Machine Learning (ICML2000). San Francisco: Morgan Kaufmann Publishers, 1119-1125.

Yang Y, Liu X. 1999. A Re-examination of Text Categorization Methods. In: Proceedings of 22nd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), Berkeley, CA, USA, 42~49.

Zhang CZ, Su XN, Zhou DM. 2008. Document Clustering Using Sample Weighting. In: He YX, Xiao GZ, Sun MS eds. Recent Advance of Chinese Computing Technologies Singapore: Chinese and Oriental Languages Information Processing Society, 3: 260-265.

李素建 王厚峰 俞士汶 辛乘胜,2004,关键词自动标引的最大熵模型应用研究,计算机学报,27(9):1192~1197.

    

全文链接地址:www.sciencenet.cn/upload/blog/file/2008/11/2008112485938151997.doc

    

引用说明 章成志, 刘耀, 王惠临.  基于Citation-KNN的语义隐含主题词自动抽取方法[C].  In: Proceedings of 9th Chinese Lexical Semantics Workshop (CLSW2008), SINGAPORE, COLIPS PUBLICATION, 2007: 371-379. 

 

相关论文

自动标引研究的回顾与展望 (PDF)

Automatic Keyword Extraction from Documents Using Conditional Random Fields (PPT)


                           

    

  

    



knn

https://blog.sciencenet.cn/blog-36782-34528.html

上一篇:Machine Translation Archive[主题:对齐]
下一篇:生物学相关领域本体(持续更新中...)
收藏 IP: .*| 热度|

0

发表评论 评论 (1 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-3-29 19:39

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部