|
|
| ARTICLE |
|
|
|
| Year : 2012 | Volume
: 58
| Issue : 2 | Page : 155-165 |
|
|
Personalized Document Summarization Using Pseudo Relevance Feedback and Semantic Feature
Sun Park1, Byung Rae Cha2, JangWoo Kwon3
1 Institute Research of Information Science and Engineering Research, Mokpo National University, South Korea 2 SCENT Center, GIST, South Korea 3 Department of Computer & Information Engineering, INHA University, South Korea
| Date of Web Publication | 16-May-2012 |
Correspondence Address: Sun Park Institute Research of Information Science and Engineering Research, Mokpo National University South Korea
 DOI: 10.4103/0377-2063.96182
Abstract | | |
This paper proposes a new automatic personalized document summarization using pseudo relevance feedback (PRF) and semantic features to extract meaningful sentences from retrieval documents in the Internet. The proposed method uses generic summarization based on the semantic features of non-negative matrix factorization to extract sentences that well reflect the major topics of the searched documents. In addition, this method reduces the semantic gap between the low level of summarizing search results and the high level of user's perception; the method uses query-based summarization depending on PRF and semantic features. The method improves the quality of personalized document summarization because the sentences most relevant to the given query are extracted efficiently by using a combination of generic and query-based summarization. The experimental results demonstrate that the proposed method achieves a better document summarization performance than do other methods. Keywords: Non-negative matrix factorization, Personalized document summarization, Pseudo relevance feedback, Semantic features
How to cite this article: Park S, Cha BR, Kwon J. Personalized Document Summarization Using Pseudo Relevance Feedback and Semantic Feature. IETE J Res 2012;58:155-65 |
1. Introduction | |  |
Due to the increase in the amount of accessible text documents on the Internet, the necessity for automatic document summarization has also increased. Automatic document summarization is the process of reducing the document size while maintaining its basic outline. Therefore, this process should distill the most important information from the document. The document summarization method can be either generic document summarization or query-based document summarization. A generic document summarization distills an overall sense of the documents' contents, whereas a query-based document summarization only distills the contents of the document relevant to the user's query [1] .
As the importance of personalized information on websites such as Facebook and personal blogs grows larger, the need for personalized document summarization concentrating on the user's interest increases. The personalized method is a summarization process that preserves the specific information that is relevant for a given user profile rather than the information that entirely summarizes the content of the news item. If the summary is personalized according to the user's interests, the user can save time, not only in deciding whether the document is interesting or not, but also in finding the information without having to read the full text [2],[3],[4] . However, automatic personalized document summarization methods done by means of machines are poor because there is a semantic gap between the desired summary in terms of high-level users' requirements and the summarization of a document by a low-level machine. In order to overcome this problem with personalized document summarization methods, recent studies have used UF (user's feedback) [1, 5, 6] and a semantic feature [7],[8],[9] to try to reduce the semantic gap between low-level features and high-level concepts.
The UF effect ensures that the new query will move toward relevant sentences and away from non-relevant ones for query reformulation [10] . Generally, UF can be either RF (relevance feedback) or PRF (pseudo relevance feedback). One of the RF in information retrieval (IR) is a well-known approach. However, this method requires the intervention of a user for a relevance judgment of documents and the query expansion bias can exacerbate the semantic gap problem. The PRF can provide an automatic relevance judgment on sentences without a user's intervention but it may get a biased query during the query expansion process [1, 5, 6, 10]. The UF (RF or PRF) in the IR field can be applied to document summarization fields. However, there is an important difference between the two fields when the UF is applied. The UF, depending on query expansion in IR, uses documents, whereas the method for document summarization uses the sentences from the document set. A sentence is not sufficient information for the UF of the query reformulation in comparison to the entire document, which is used in a normal IR [1, 5, 10]. The semantic feature methods have great power to easily extract a meaningful sentence in connection with semantic features representing the internal structure of a document set and deliver a good semantic interpretation [11,12] . However, this method of document summarization might limit the successful decomposition of the semantic features from any data set (since data objects can be viewed from extremely different viewpoints or as highly articulated objects) [11] .
In order to solve the problems posed by the above limitations of personalized document summarization, in this paper, we propose a new method that uses PRF and the semantic features of non-negative matrix factorization (NMF) to summarize personalized documents with regard to a given query. In the method proposed, first, the candidate sentence set for summarizing personalized documents is extracted using the generic and query-based summarization methods. The generic method based on semantic features constructs the candidate sentence set by selecting important sentences covering the major topics of the search results. The query-based method using PRF and semantic features organizes the candidate sentence set from the extracted significant sentences, which are highly relevant to a given query and which minimize the semantic gap between the user's interest and the extracted sentences. Finally, the personalized summarization method can improve the quality of the result of summarizing documents through a combination of generic and query-based relevance scores, which reflect user interest and the inherent structure of the retrieval results.
In the present study, we decided to somewhat modify the method of our previous works [7],[8],[9] , for the following reason. Our previous works have certain advantages in identifying meaningful sentences for personalized summaries to cover the major topics of search results; however, the method is restricted to the composition of the retrieval documents. Thus, the proposed method promises to combine the advantages of the PRF with the semantic features of the NMF for personalized document summarization.
The rest of the paper is organized as follows: In Section 2, we describe the document summarization methods and the NMF algorithm in detail. In Section 3, the personalized summarization method is introduced. Section 4 gives the evaluation and experimental results. Finally, we conclude in Section 5.
2. Related Works | |  |
2.1 Document Summarization
Generally, document summarization methods use either generic summaries or query-based summaries. Besides this, the methods are divided into single-document summarization or multi-document summarization according to the target of the summary method. Multi-document summarization is a process undertaken to produce a single summary from a set of related documents, whereas single-document summarization is a process done to summarize one document [1] .
Recently, the incorporation of user knowledge in the document summarization process has been used to increase the efficiency of document summarization methods. Several studies have been carried out using user knowledge. Han et al. proposed a text summarization method using relevance feedback with query splitting. Their method alleviates the feedback problem in a biased query during a query expansion process by splitting the initial query into several pieces. However, if there is insufficient information for query splitting, their method may produce poor document summaries [5] . Diaz and Gervas proposed an item summarization method for the personalization of news delivery systems. The method uses three phrase-selection heuristics that build summaries using two generic summarizations and one personalized summarization depending on RF from news items. Their method allows the user to decide on the relevance of the received news items without inspecting the full text document. However, their method requires user profiles for relevance feedback [2],[3] . To enhance generic-personalized summaries, Diaz and Gervas proposed an automatic personalized summarization using a combination of generic and personalized methods. Their generic summarization methods combine the position method with the thematic word method. Their personalized method selects those sentences of a document that are most relevant to a given user model. This method leads to quite a simple summarization technique providing good results in terms of indicative summarization. However, the method would be severely domain-dependent and might not work as well for different domains [4] . Kumar et al. generated personalized summaries using generic and user-specific methods based on probability. This method extracts the top ranking sentences by means of the generic sentence scoring and the user-specific sentence scoring. The method is not restricted to a user's web-based profile creation for user-specific summarization since the user profiles are driven by publicly available web documents [13] . However, their method restricts us to an extraction of sentences that relate well to the user's intention. To generate a web snippet, Ko et al. proposed a web snippet generation method from web pages using PRF and a query-biased summarization. Good-quality snippets are derived in this method, which uses a query expansion based on the probability model. However, this method does not reflect the internal structure of web pages for meaningful document excerpts since it only uses a statistical query expansion [6] . Li and Chen extracted personalized text snippets using statistical language models. Their study resorts to two methods for snippet extraction and utilizes the probability sequence analysis and the hidden Markov model [14] . In our previous works [7],[8],[9] , we proposed three personalized document summarization methods that use sentence ranking depending on the semantic features of the NMF [7] , the NMF, and the RF [8] , and generic and query-based sentence extraction based on PRF and semantic features [15] . However, these methods are influenced by the organization of the search results.
2.2 Non-negative Matrix Factorization
This section reassesses the theory and algorithm of the NMF and explains the advantage of representing sentences by means of the semantic features in the example. The semantic features of NMF are applied in our proposed method for extracting important sentences.
The NMF method represents the individual object as a non-negative linear combination of the part information extracted from a large volume of objects. The NMF is used to decompose a given m×n matrix A into a non-negative semantic feature matrix W and a non-negative semantic variable matrix H as shown in (1) [11] :

where, m and n are the number of row and column elements, respectively. W is an m×r non-negative matrix and H is an r×n non-negative matrix. r is the number of semantic features. The sizes of matrices W and H are decided by the size of the semantic feature r. Usually, r is chosen to be smaller than m or n, so that the total sizes of W and H are smaller than that of the original matrix A. The matrices W and H are named for the semantic features matrix W and the semantic variable matrix H by Lee and Seung [11] .
The NMF algorithm can be described as follows. The original matrix A is factorized by the objective function and the update rules. The objective function is used to minimize the Euclidean distance between each column of A and its approximation , which was proposed by Lee and Seung [11,12] . In this paper, we define the matrix notation as follows: Let be the j0′th column vector of matrix X, let be the i′th row vector, and let be the element of the i0′th row and the j′th column. As an objective function, the Frobenius norm is used [12] :

Updating of W and H is delayed until  converges under the predefined tolerance. The update rules are as follows:

We illustrate the result of the NMF algorithm, showing how it can be composed of the semantic feature matrices from the original matrix A. Example 1, using the Equations (2) and (3), exemplifies the matrices W and H, as follows:
Example (1): Let r be 2, the number of repetitions 50, and the tolerance 0.001. When the initial elements of the W and H matrices are 0.5, the non-negative matrix A is decomposed into two non-negative matrices, W and H, as shown in [Figure 1]a. [Figure 1]b shows an example of sentence representation using the semantic features of NMF. The column vector A *1, corresponding to the first sentence, is represented as a linear combination of the semantic feature vectors W *r and the semantic variable column vector H *1 . | Figure 1: An example of sentence representation using semantic features and the results of the NMF algorithm. (a) The result of the NMF decomposition. (b) Representing sentence by semantic features and semantic variables.
Click here to view |
Example 2 describes the advantage of representing sentences using the semantic features of NMF from real data.
Example (2) [Table 1] shows a part of the extracted sentences from a document related to the topic "Describe the activities of Morris Dees and the Southern Poverty Law Center." with the document title "NYT19990304.0376" found in DUC 2007 [15,16] . We generate the term-by-sentence matrix A by preprocessing the set of sentences found in [Table 1]. The matrix A is composed of 637 terms and 57 sentences. [Table 2] illustrates the cases in which we applied the NMF to matrix A. | Table 1: A sampling of the extracted sentences related to the topic in DUC 2007
Click here to view |
 | Table 2: The representation of a sentence by means of semantic features of the NMF
Click here to view |
[Table 2] illustrates ten semantic feature vectors (i.e., r = 10), W *1, …, W *10, obtained from the NMF decomposition of matrix A, the weight values, H 1,10, …, H 10,10, of the semantic feature vectors with respect to the sentence S10, the original sentence vector, and the sentence vector calculated from the weight values and the semantic feature vectors.
There are no negative values and many zero values in [Table 2]. That is, the semantic feature vectors obtained by using the NMF are sparse, so that the NMF can obtain semantic features that have a small semantic range. This indicates that a method that uses NMF has a better power to identify document topics than do methods that use decomposition approaches, such as principal component analysis and vector quantization. It should be noted that the semantic feature vectors found in [Table 2] intuitively make more sense because the NMF represents a sentence as a linear combination of a few intuitive semantic feature vectors having non-negative values [11] .
3. The Proposed Personalized Document Summarization Method | |  |
In this paper, we propose an automatic personalized document summarization using PRF and semantic features of the NMF. The proposed method consists of preprocessing, sentence extraction, and personalized summarization. We give a full explanation of these three phases in [Figure 2].
3.1 The Preprocessing Phase
In the preprocessing phase, after the given search results are decomposed into individual sentences, the stop-words are removed using Rijsbergen's stop-words list [10,17] ; word stemming is performed by Porter's stemming algorithm [10,17] for the English language. Then, a weighted term-frequency vector is constructed for each sentence in the search results using Equation (4) [10,17] . Let A be an m×n matrix, where m is the number of terms and n is the number of sentences in the whole of the search results. Let element A ij be the weighted term-frequency of term i in sentence j.

where, L ij is the local weight (term frequency) for term i in sentence j, and G(i) is the global weight (inverse document frequency) for term i in the whole of the search results [10],[17] . That is:

where, N is the total number of sentences in the whole of the search results, and N(i) is the number of sentences that contain term i.
Removal of the stop-words and word stemming with respect to the Korean language uses an already developed Korean language analysis Hangul Analysis Module (HAM). The HAM is shareware that is made in the C language. It supports functions for automatic indexing, spell checking, construction analysis, compound noun disjointing, and automatic word spacing. It is based on a morpheme analyzer. This is the kernel library for Korean analyses [18] .
3.2 The Sentence Extraction Phase
In this section, we extract the important sentences for personalized summarization using the generic and query-based summarization methods. Sentence extraction consists of the generic summa-rization, the query expansion, and the query-based
summarization.
3.2.1 The Generic Summarization
The generic summarization extracts significant sentences including an overall sense of the search results and constructs the candidate sentence set for personalized document summarization. We modify our previous generic summarization method using the semantic variable of NMF [16] . This section describes how to extract the candidate sentences from the search results by means of the generic method. In the generic summarization shown in [Figure 2]a, first, the preprocessing and the NMF are performed. Second, the generic method extracts the important sentences that construct the candidate sentence set by the generic relevance weight based on the semantic variable matrix of the NMF. We define the generic relevance weight GRweight() as Equation (6). Finally, the generic relevance score with respect to the extracted sentence is calculated by using Equation (7) for the personalized summarization.

The generic relevance weight denotes how much the sentence reflects significant topics in the search results, which are represented as semantic variables with respect to sentences.
The generic relevance score grs is defined as follows:

where, grsi is a relevance score of the i'th sentence. The generic relevance score for the personalized summarization denotes how much the extracted sentence reflects user interest, which is represented as the generic relevance weight and the similarity of the original query.

where, the query term vector Q = (q1, q2, …, qn ). qi denotes the i'th term frequency of the query, while n denotes the number of terms.
The proposed algorithm for generic document summarization is given in Algorithm 1.

In line 1, the preprocessing is performed. In line 2, the matrices W and H are factorized from the matrix A by NMF. In lines 3 to 9, the generic summarization uses the generic relevance weight. In line 4, generic relevance weight grw is calculated. In line 5, the top k sentences are extracted by the highest generic relevance weight values. In line 6, the candidate sentence set is constructed from the extracted sentences. In line 7, the generic relevance score grs is calculated for the personalized summarization.
Example (3) We illustrate the example of generic summarization using the proposed generic summarization algorithm as follows [16] : [Table 3] shows five sentences and one query. [Figure 3] shows the decomposing of matrix A in [Table 3] into a semantic feature matrix W and a semantic variable matrix H using NMF. [Figure 4] illustrates the sentence extraction process from the set of sentences found in [Table 3]. We calculate the GRweight() and then extract the sentence S2 corresponding to the semantic variable column vector H *2 having the largest GRweight() value (0.3216). | Figure 4: The generic document summarization using GR weight and semantic variable matrix H.
Click here to view |
3.2.2 The Pseudo Relevance Feedback
The summarization methods that depend on NMF sometimes do not successfully summarize sentences, yielding such cases as an organization of very different sentences or a composition of extremely similar sentences [11] . To resolve this restriction of the constitution of a document set, the query reformulation method is used. This section explains how to expand the query by the PRF for the query-based summarization method. This method well reflects user purpose in connection with a query into the extracting of sentences since it uses the similarities between the query and the important sentences.
The PRF phase shown in [Figure 2]b is described as follows [1],[9] : In the first step, for a given initial query, the relevant sentences are extracted according to the cosine similarity between the initial query and a sentence vector in the documents by using Equation (8). The top k-ranked sentences are then selected if they have similarly high values. The query expansion is then performed by using the extracted top k-ranked sentences. In the second step, the initial query is expanded by using each relevant sentence. The following query expansion method is
used:
Query Point Movement:

where, Q new is the new expanded query vector of the current query Q, is the t′th sentence in the relevant sentences, and w t is the weight calculated by using the cosine similarity between the current query Q and .
Example (4) We illustrate the example of PRF using query point movement as follows [9] : [Table 4] shows the cosine similarity values between the query and the sentences from [Table 3]. [Figure 5] shows the result of the query expansion obtained by (10) from [Table 3] and [Table 4]. We can see that the PRF uses a positive query expansion with the sentences (S2, S3, S4) in [Figure 5] by query point movement. This is because the PRF cannot judge non-relevant sentences. | Figure 5: The result of the query expansion using the query point movement from Table 5 and Figure 3.
Click here to view |
 | Table 4: Cosine similarity values between the query and the sentences from Table 3
Click here to view |
3.2.3 The Query-based Summarization
The proposed query-based summarization method extracts a meaningful sentence for personalized summarization. The method uses similarity between the expanded query by PRF and semantic features representing the internal structure of search results by semantic features.
So, it minimizes the semantic gap between the user's intention for summarization and the results of the summarization methods. We have modified our previous query-based summarization method using the semantic features by using NMF [15],[19] .
This section shows how to select candidate sentences from the search results using the query-based and PRF methods. The query-based summarization shown in [Figure 2]c is described as follows: In the first step, the sentences most relevant to the expanded query are extracted by the cosine similarity of Equation (8). The candidate sentence set is constructed from the extracted sentences. In the second step, the query relevance score qrs, with respect to the selected sentence, is calculated by using Equation (10) for the personalized summarization phase.

The query relevance score for the personalized summarization is defined as how much the selected sentence reflects the user interest in the expanded query, which is represented as the semantic features.
The proposed algorithm for query-based summarization is given in Algorithm 2.

In line 1, the preprocessing phase is performed. In line 2, the matrices W and H are factorized from the matrix A by NMF. In lines 3 to 9, the query-based summarization uses PRF and semantic features. In line 4, a column vector W *p is selected by the largest similarity to the expanded query Q new. The fact that the similarity between W *p and the query is largest means that the p'th semantic feature vector W *p is the feature most relevant to the expanded query. In line 5, the sentence corresponding to the largest index value of the row vector H p* is extracted. This process selects the sentence that has the largest weight for the most relevant semantic feature. In line 6, the candidate sentence set is constructed with the extracted sentence. In line 7, the query relevance score is calculated for the personalized summarization.
Example (5) We illustrate the example of sentence extraction in connection with steps 4 to 6 of the query-based document summarization algorithm [15],[19] .
[Figure 6] illustrates the sentence extraction process from the set of sentences in [Table 3]. In [Figure 6]a, the similarity values are calculated between the expanded query [Figure 5] and the semantic feature vectors [Figure 3]. The semantic feature vector W *3 having the largest similarity value (0.8472) is selected. In [Figure 6]b, the semantic variable vector H *3, corresponding to the semantic feature vector W *3 , is selected. In [Figure 6]c, the sentence S3 is extracted, which sentence corresponds to the largest value (0.5729) in the semantic variable vector H *3. | Figure 6: The sentence extracted using the similarity between the expanded query and the semantic feature vectors.
Click here to view |
3.3 The Personalized Summarization Phase
The personalized summarization method extracts the top k-ranked sentences from the candidate sentence set using the relevance score of generic and query-based summarization. The extracted sentence improves the quality of the summarization result since it reflects the internal structure of the search results by semantic features and the user's purpose in summarizing the document.
This phase can be described as follows: In the first step, as shown in [Figure 2]d, the generic relevance scores of Equation (7) and the query-based relevance scores of Equation (10) are normalized. Then, the ranking scores of the candidate sentences are calculated by using Equation (11).

where, rs i is the ranking score of the i'th sentence.
In the second step, as shown in [Figure 2]e, the top k-ranked sentences, in connection with the ranking scores, are extracted from the candidate sentence set after eliminating redundancy.
4. The Experimental Results | |  |
For our experimental data, we used evaluation data of DUC 2006 and real data of Yahoo-Korea News.
The Document Understanding Conference (DUC) is the international conference for performance evaluation in the area of document summarization. The DUC 2006 test data set is composed of 50 topics and 25 documents relevant to each topic from the AQUAINT corpus for query-relevant multi-document summarization [20] . To compare the performances of the proposed method using DUC 2006, we used the ROUGE evaluation software package [21] , which compares various summary results from several summarization methods with summaries generated by human beings. ROUGE has been applied by the DUC for performance evaluation. ROUGE includes five automatic evaluation methods, ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S, and ROUGE-SU. Each method estimates recall, precision, and f-measure between experts' reference summaries and candidate summaries of the proposed system. ROUGE-N uses n-gram recall between a candidate summary and a set of reference summaries. ROUGE-L computes the ratio between the length of the summaries' longest common subsequence (LCS) and the length of the reference summary as delineated. ROUGE-W uses the weighted LCS that favors LCS with consecutive matches. ROUGE-S uses the overlap ratio of the skip-bigram between a candidate summary and a set of reference summaries. ROUGE-SU is an extension of ROUGE-S with the addition of unigram as the counting unit [21] .
We gave a query to retrieve news documents from Yahoo-Korea News for performance evaluation of real data set. Five independent evaluators were employed to manually create summarizations on the 2000 documents retrieved from Yahoo Korea news documents using 20 queries. Each document from Yahoo Korea news selected by the evaluators has an average of 15.1 sentences. [Table 5] provides the particulars of the evaluation data set of Yahoo Korea. The retrieved news documents were preprocessed using the HAM, which is a Korean language analysis tool based on a Morpheme analyzer [18] .
In this paper, we used the recall (R), precision (P), and F-measure to evaluate the performance of the proposed method using real data of Yahoo Korea. Let S man , S sum be the set of sentences selected by the human evaluators, and the summarizer, respectively. The standard definitions of recall (R), precision (P), and F-measure are defined as follows [3],[10],[18] :

We implemented six different summarization methods, such as QSS, GPS, SGM, GUSS, PNMF, and PFNMF. QSS [5] denotes Han's text summarization method of using positive relevance feedback depending on query splitting. GPS [4] denotes Diaz's summarization method using generic and personalized summaries. SGM [6] denotes Ko's snippet generation method using the PRF. GUSS [13] denotes Kumar's generating summary method using generic and user-specific sentence scorings. PNMF [7] denotes our previous personalized summarization method using sentence ranking by means of semantic features. PFNMF denotes our proposed method using generic and query-based summaries depending on the PRF semantic features.
Experiment (1) We compared the ROUGE results of the six different summarization methods using DUC 2006 data set. In [Table 6], the average recall of PFNMF is approximately 42.21% higher than that of QSS, 13.56% higher than that of GPS, 20.26% higher than that of SGM, 14.08% higher than that of GUSS, and 10.73% higher than that of PNMF. | Table 6: Performance comparison using DUC 2006 with respect to average recall of ROUGE measures
Click here to view |
In [Table 7], the average precision of PFNMF is approximately 24.55% higher than that of QSS, 40.38% higher than that of GPS, 44.39% higher than that of SGM, 24.15% higher than that of GUSS, and 15.03% higher than that of PNMF. | Table 7: Performance comparison using DUC 2006 with respect to average precision of ROUGE measures
Click here to view |
In [Table 8], the average f-measure of PFNMF is approximately 19.84% higher than that of QSS, 27.32% higher than that of GPS, 28.84% higher than that of SGM, 20.69% higher than that of GUSS, and 12.06% higher than that of PNMF. | Table 8: Performance comparison using DUC 2006 with respect to f-measure of ROUGE measures
Click here to view |
Experiment (2) We compared the evaluation measures of the six implemented methods using real data of Yahoo Korea and Equation (12). The evaluation results are shown in [Figure 7]. The average recall of the PFNMF is approximately 21.3% higher than that of the QSS, 4% higher than that of the GPS, 14.4% higher than that of the SGM, 5.7% higher than that of the GUSS, and 3.3% higher than that of the PNMF. The average precision of the PFNMF is approximately 19% higher than that of the QSS, 4.2% higher than that of the GPS, 16% higher than that of the SGM, 8.4% higher than that of the GUSS, and 3.2% higher than that of the PNMF. The average F-measure of the PFNMF is approximately 20.5% higher than that of the QSS, 4.1% higher than that of the GPS, 15.1% higher than that of the SGM, 6.9% higher than that of the GUSS, and 3.3% higher than that of the PNMF. | Figure 7: The evaluation results with respect to the average F-measure using Equation (12).
Click here to view |
To better understand the reason that our method works better than other methods, we analyzed the properties of the summarization methods shown in [Figure 7]. The results of [Figure 7] show that the recall, precision, and F-measure of the GUSS are better than those of the QSS and SGM. This is because the GUSS uses probability distribution over terms in documents to extract the generic and user-specific sentences, whereas the QSS and SGM use the term frequency-based methods. The results show that the evaluation measures of the GPS are better than those of the GUSS because the GPS uses the combination of generic and personalized methods. The results show that the evaluation measures of the PNMF are better than those of the GPS because the PNMF generates a more meaningful summary by reflecting the inherent semantics of the document with respect to the generic and the personalized summary. The results show that the evaluation measures of the PFNMF have the best performance because the PFNMF selects meaningful sentences covering the major topics and reflects user interest in the search results by using the PRF with the NMF.
5. Conclusion | |  |
A personalized summary adapts to a user to correctly identify whether a document is of real interest without the users having to read the whole document. In this paper, we propose an automatic personalized query-based document summarization method using PRF and semantic features. The proposed method extracts important sentences that amply cover the major ideas of the search results using semantic features reflecting the internal structures of retrieval documents. The method also selects significant sentences that are highly relevant to a user's interest since the method minimizes the semantic gap between the user query and the summarization results by using PRF and semantic features. Finally, the method improves the quality of personalized summaries since it extracts meaningful sentences from candidate sentence sets reflecting the generic and query-based relevant score. Experimental results show that the proposed method outperforms the five different summarization methods examined in this paper. In the near future, we plan to enhance the personalized summarization method by using a variety of weighting terms. We anticipate that such weighting will improve the accuracy of the automatic personalized summarization by clustering the topics of the search results.
6. Acknowledgment | |  |
This work was supported by Priority Research Centers Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0028295). "This research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) support program supervised by the NIPA(National IT Industry Promotion Agency)" (NIPA-2011-C1090-1121-0007).
References | |  |
| 1. | I Mani, "Automatic Summarization," AMSTERDAM, Netherlands: John Benjamins Publishing Company; 2001.  |
| 2. | A Diaz and P Gervas, "Item Summarization in Personalization of News Delivery Systems," In proceeding of the 7th International Conference on Text, Speech and Dialogue (TSD), LNAI 3206, Brno, Czech Republic, pp.49-56, Sep. 2004.  |
| 3. | A Diaz and P Gervas, "Evaluation of a system for personalized summarization of web contents," In proceeding of the 10th International Conference on User Modeling (UM), LNAI 3538, Edinburgh, Scotland, UK, pp.453-62, Jul. 2005.  |
| 4. | A Diaz, P Gervas, "User-model based personalized summarization," Information Processing and Management, Vol. 43, pp.1715-34. Mar. 2007.  |
| 5. | K S Han, D H Bea, and H C Rim, "Automatic Text Summarization Based on Relevance Feedback with Query Splitting," In proceedings of the 5th International Workshop on Information Retrieval with Asian Language, Hong Kong, pp.201-2, Sep. 2000.  |
| 6. | Y J Ko, H K An, and J Y Seo, "Pseudo-relevance feedback and statistical query expansion for web snippet generation," Information Processing Letters, Vol. 109, pp.18-22, 2008.  |
| 7. | S Park, "Personalized Summarization Agent Using Non-negative Matrix Factorization," In proceeding of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PRICAI), Hanoi, Vietnam, pp.1034-8, May 2008.  |
| 8. | S Park, and B R Char, "Query based Personalized Summarization Agent using NMF and Relevance Feedback," In proceeding of the International Conference on Convergence and Hybrid Information Technology (ICCIT), Busan, South Korea, Nov. 2008.  |
| 9. | S Park, and D U An, "Automatic Query-based Personalized Summarization that uses Pseudo Relevance Feedback with NMF," In proceeding of the International Conference on Ubiquitous Information Management and Communication (ICUIMC), Suwon, Korea, pp.422-8, Jan. 2010.  |
| 10. | B Y Ricardo, and R N Berthier, "Moden Information Retrieval," New York: ACM Press; 1999.  |
| 11. | D D Lee, and H S Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, Vol. 401, pp. 788-791, Oct. 1999.  |
| 12. | S Wild, J Curry, and A Dougherty, "Motivating Non-Negative Matrix Factorizations," In proceeding of SIAM ALA, 2003.  |
| 13. | C Kumar, P Pingali, and V Varma, "Generating Personalized Summaries Using Public Available Web Documets," In proceeding of the International Conference on Web Intelligence and Intelligent Agent Technology, Sydney, Australia, pp.103-6, Dec. 2008.  |
| 14. | Q Li, and Y P Chen, "Personalized text snippet extraction using statistical language models," Pattern Recognition, Vol 43, pp.378-386, 2010.  |
| 15. | S Park, B R Char, and D U An, "Automatic Multi-document Summarization Based on Clustering and Nonnegative Matrix Factorization," IETE Technical Review, Vol 27, pp.167-78, Mar. 2010.  |
| 16. | J H Lee, S Park, C M Ahn, and D H Kim, "Automatic Generic Document Summarization Based on Non-negative Matrix Factorization," Information Processing and Management, Vol. 45, pp.20-34, Jan. 2009.  |
| 17. | W B Frankes, and B Y Ricardo, "Information Retrieval: Data Structure & Algorithms," New Jersey: Prentice-Hall; 1992.  |
| 18. | S S Kang, "Information Retrieval and Morpheme Analysis," Seoul Korea: HongReung Science Publishing Company; 2002.  |
| 19. | S Park, and J H Lee, "Topic-based Multi-document Summarization Using Non-negative Matrix Factorization and K-means," Journal of KIISE: Software and Applications, Vol. 35, No. 4, pp.255-264, 2008.  |
| 20. | H D Hoa, "Overview of DUC 2006," In Document Understanding Conference. New York City, Jun. 2006.  |
| 21. | C Y Lin, "ROUGE: A Package for Automatic Evaluation of Summaries," In Workshop on Text Summarization Branches Out, Post-Conference Workshop of Association for Computational Linguistics, Barcelona, Jul. 2004.  |
Authors | |  |
Sun Park is a research professor at Research Faculty Institute of Information Science and Engineering Research, Mokpo National University, Korea. He received the Ph.D degree in Computer and Information Engineering from Inha University in 2007, the M.S. degree in Information and Communication Engineering from Hannam University in 2001, and the B.S. degree in Computer Engineering from Jeonju University in 1996. Prior to becoming a researcher at Mokpo National University, he has worked as a postdoctoral at Chonbuk National University, and professor in Dept. of Computer Engineering, Honam University, Korea. His research interests include Data Mining, Information Retrieval, and Information Summarization.
Byung Rae Cha is a research professor at school of information and communication, GIST, Korea. He received the Ph.D. degree in computer engineering from National Mokpo University in 2004 and the M.S. degree in computer engineering from Honam University in 1997. Prior to becoming a research professor at GIST, he has worked as a research professor in department of information and communication eng., Chosun University, and professor in department of computer engineering, Honam University, Korea. His research interests include Computer Security of IDS and P2P, Neural Networks Learning, Mobile-OTP, Future Internet, and Cloud Computing.
JangWoo Kwon received the B.S degree in electronic Eng. from INHA University in 1990, the M.E. and Ph.D. degree in electronic engineering from INHA University in 1992 and 1996, respectively. In 1992 he was a visiting Researcher at Department of Biomedical Engineering of Tokyo University, Tokyo, Japan. From 1996 to 1998 he was a deputy director of Korea Industrial Property Office (KIPO) where his responsibility was to examine patents. From 1998 to 2009 he was an Associate Professor of Department of Computer Engineering at Tongmyoung University, Pusan, Korea. He had been a Dean of Research Institute for Information Eng. Tech. at Tongmyoung University from 2002 to 2006. From 2010 to 2012 he was an Associate Professor of Department of Computer Eng. at Kyungwon University, Kyeong-gi Province, Korea. Since 2006, he has been a Director of Human Resource Development Division of National IT industry promotion agency of Korea. Currently, his research area is in sensor networks and human computer interaction using biomedical signals. For the last 20 years he has been working in biomedical signal analysis and its recognition using artificial intelligence.
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7]
[Table 1], [Table 2], [Table 3], [Table 4], [Table 5], [Table 6], [Table 7], [Table 8]
|