淘先锋技术网

首页 1 2 3 4 5 6 7

Mining and Summarizing Customer Reviews


M.Q. Hu, B. Liu, Mining and Summarizing Customer Reviews, KDD (2004)


摘要

挖掘并总结用户关于某个产品的评论

仅挖掘用户给出褒义、贬义评论的产品特征(only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative)

步骤:

  1. 挖掘用户给出评价(review)的产品特征;

  2. 识别评论中的观点语句(identifying opinion sentences),并标记每条观点语句的感情色彩(褒义、贬义(positive or negative));

  3. 总结(summarizing the results)

1 引言

背景:产品评论爆发式增长,但观点语句占比不高。

目的:生成基于产品特征的用户评价(generating feature-based summaries of customer reviews of products)。

特征(features)指产品特征和功能(product features (or attributes) and functions)。

任务:

  1. 挖掘用户给出观点(opinion)的产品特征(identifying features of the product that customers have expressed their opinions on (called product features));

  2. 针对每个特征,识别表达褒义、贬义观点的评论语句(for each feature, identifying review sentences that give positive or negative opinions);

  3. 生成总结(producing a summary using the discovered information)。

在这里插入图片描述
步骤:

  1. 挖掘(mining)用户给出评价(comment on)的产品特征(product features);

  2. 识别评论中的观点语句(identifying opinion sentences in each review),并标记每条观点语句的感情色彩(褒义、贬义(positive or negative));

    • 识别观点词(opinion words)集(表达观点的形容词集合(a set of adjective words));

    • 标识每个观点词的感情色彩(semantic orientation):褒义(positive)、贬义(negative),通过WordNet引导(bootstrapping)实现;

    • 标识每条语句观点的感情色彩(the opinion orientation of each sentence)

  3. 总结(summarizing the results)

FBS(Feature-Based Summarization)系统

2 相关工作

与已知工作相比,本文特色包括:(1)将评论拆分成语句,对其中的观点语句进行分类;(2)挖掘产品特征;(3)不依赖语料库(corpus)。本文目标为识别产品特征、用户对产品各特征的态度并生成不依赖模板(template)的总结;

2.1 流派分类(Subjective Genre Classification)

2.2 情感分类(Sentiment Classification)

手工标识种子形容词列表(manually create a small list of seed adjectives tagged with positive or negative labels),使用WordNet对其扩充(grow this list using WordNet)。

2.3 文本摘要(Text Summarization)

文本摘要生成方法:(1)模板实例化(template instantiation);(2)段落抽取(passage extraction)。前者需要与领域(domain)、流派(genre)相关的背景知识(background knowledge)以识别、抽取特定核心实体(certain core entities and facts)并填写模板(packaged in a template);后者需要识别文档核心表达语句(certain segments of the text (typically sentences) that are the most representative of the document’s content)。

本文工作无需模板、与领域无关、不进行文档抽取,而是识别、抽取产品特征及其相关观点(identify and extract those specific product features and the opinions related to them)。

2.4 术语查询(Terminology Finding)

术语查询方法:(1)符号法(symbolic approaches);(2)统计方法(statistical approaches)

3 方法(THE PROPOSED TECHNIQUES)

FBS系统框架:

在这里插入图片描述
输入:产品名称(product name)、该产品的评论页面(an entry Web page for all the reviews of the product)

流程:(1)挖掘用户给出评价的产品特征;(2)识别评论中的观点语句并标记每条观点语句的感情色彩;(3)总结。

3.1 词性标注(Part-of-Speech Tagging,POS)

产品特征通常是评论语句中的名词(nouns)或名词词组(noun phrases),本文使用NLProcessor语言解析器(linguistic parser)为评论标注词性(part-of-speech tagging):

在这里插入图片描述
预处理操作(pre-processing):去除停用词(removal of stopwords)、词干化(stemming)、模糊匹配(fuzzy matching),其中模糊匹配用于处理词形变化(word variants)及拼写错误(misspellings)。

3.2 高频特征识别(Frequent Features Identification)

高频特征:经多人评价的产品特征(product features on which many people have expressed their opinions),本文特指显示出现的名词及名词词组(appear explicitly as nouns or noun phrases)。

高频特征查找(finding frequent features):本文采用关联规则(association mining)方式(the association miner CBA based on the Apriori algorithm)查找频繁项集(frequent itemsets),其中项集(itemset)定义为语句中同时出现的词或短语(a set of words or a phrase that occurs together in some sentences);当至少1%(最小支撑(minimum support))的评论语句中包含某个项集时,称该项集为频繁项集或候选频繁特征(candidate frequent features)。

剪枝(pruning):

  1. 紧致剪枝(compactness pruning):遍历特征短语(feature phrases)(至少包含两个名词),剔除无意义特征(checks features that contain at least two words, which we call feature phrases, and remove those that are likely to be meaningless)。

  2. 冗余剪枝(redundancy pruning):针对冗余单词特征(redundant features that contain single words),给定最小 p p p支撑(minimum p p p-support)(本文中最小 p p p支撑取值为3),若某特征的 p p p支撑小于最小 p p p支撑且该特征为其它特征短语的子集,则认为该特征冗余,并将其剔除(if a feature has a p-support lower than the minimum p p p-support (in our system, we set it to 3) and the feature is a subset of another feature phrase (which suggests that the feature alone may not be interesting), it is pruned)。特征 f t r ftr ftr p p p支撑是指包含 f t r ftr ftr且不包含 f t r ftr ftr超集的语句数量( p p p-support of feature f t r ftr ftr is the number of sentences that f t r ftr ftr appears in as a noun or noun phrase, and these sentences must contain no feature phrase that is a superset of f t r ftr ftr)。

3.3 观点词抽取(Opinion Words Extraction)

观点词(opinion words):表达主观观点(express subjective opinions)的单词。本文观点词仅限形空词(adjectives),且仅从包含产品特征的语句中抽取观点词(limit the opinion words extraction to those sentences that contain one or more product features)。

观点语句(opinion sentence)定义:同时包含产品特征及观点词的语句(if a sentence contains one or more product features and one or more opinion words, then the sentence is called an opinion sentence)。

在这里插入图片描述

3.4 观点词感情色彩识别(Orientation Identification for Opinion Words)

观点词感情色彩(semantic orientation):该单词在语义环境中的语义指向(the semantic orientation of a word indicates the direction that the word deviates from the norm for its semantic group),本文中感情色彩仅限褒义(positive)和贬义(negative)。

WordNet不包含单词感情色彩信息,其形容词按双极性簇方式组织(in WordNet, adjectives are organized into bipolar clusters)。本文利用WordNet的形容词的同义词、反义词集合(adjective synonym set and antonym set)引导标注形容词感情色彩。

在这里插入图片描述
通常,同义词感情色彩相同;反义词感情色彩相反(adjectives share the same orientation as their synonyms and opposite orientations as their antonyms)。

形容词感情色彩标注:(1)人工标注一组形容词作为种子列表(本文中为30个形容词);(2)搜索WordNet扩展该列表( use a set of seed adjectives, which we know their orientations and then grow this set by searching in the WordNet)。

在这里插入图片描述
若某个形容词的同义词与反义词的感情色彩矛盾,以同义词为准(for the case that the synonyms/antonyms of an adjective have different known semantic orientations, we use the first found orientation as the orientation for the given adjective)。

3.5 低频特征识别(Infrequent Features Identification)

由于关联规则只能挖掘频繁特征,本文通过观点词查找低频特征(use the opinion words to look for features that cannot be found in the frequent feature generation step using association mining)。

在这里插入图片描述
最近邻名词、名词词组(nearest noun/noun phrase)(被形容词修饰的名词、名词词组(the noun/noun phrase that the opinion word modifies))挖掘可能发现与产品不相关的名词、名词词组(nouns/noun phrases that are irrelevant to the given product),但与高频特征相比,低频特征所占比例仅为很小(15%-20%),且高频特征比低频特征更加重要。本文根据 p p p支撑对特征进行排序,无关特征排名通常极低,因此无关特征对用户观点的影响可以忽略(rank features according to their p p p-supports, those wrong infrequent features will be ranked very low and thus will not affect most of the users)。

3.6 观点语句感情色彩预测(Predicting the Orientations of Opinion Sentences)

观点语句感情色彩:将语句中主流观点感情色彩作为观点语句的感情色彩(use the dominant orientation of the opinion words in the sentence to determine the orientation of the sentence);当语句中褒义、贬义观点词数量相同时,采纳有效观点(effective opinion)的平均感情色彩或前一观点语句的感情色彩(there is the same number of positive and negative opinion words in the sentence, we predict the orientation using the average orientation of effective opinions or the orientation of the previous opinion sentence)。

有效观点指观点语句中距离特征最近的观点词(effective opinion is the closest opinion word for a feature in an opinion sentence)。

在这里插入图片描述
SentenceOrietation()函数考虑3种情况:

  1. 语句中,用户对大多数特征的评价观点一致(the user likes or dislikes most or all the features in one sentence),并且大多数观点词感情色彩一致(the opinion words are mostly either positive or negative);

  2. 语句中,用户对大多数特征的评价观点一致(the user likes or dislikes most or all the features in one sentence),但褒义、贬义观点词数量相同(there is an equal number of positive and negative opinion words);

  3. 其它

情况1,采纳主流观点(dominant orientation)的感情色彩;情况2,采纳所有特征有效观点的平均感情色彩;情况3,采纳前一观点语句的感情色彩(set the orientation of the opinion sentence to be the same as the orientation of previous opinion sentence),即上下文信息(context information)。

若观点语句中包含转折从句,即从句导致特征的情感翻转,本文首先采纳从句有效观点作为特征感情色彩(for a sentence that contains a but clause (sub-sentence that starts with but, however, etc.), which indicates sentimental change for the features in the clause, we first use the effective opinion in the clause to decide the orientation of the features);若在从句不包含观点词,则将主句中观点词的感情色彩翻转(if no opinion appears in the clause, the opposite orientation of the sentence will be used)。

若观点语句中包含否定词(consider whether there is a negation word such as “no”, “not”, “yet”, appearing closely around the opinion word),且否定词与观点词距离小于门限(5)(the word distance between a negation word and the opinion word should not exceed a threshold),则将观点词的感情色彩翻转。

3.7 摘要生成(Summary Generation)

基于特征的摘要生成步骤:

  • 针对每条产品特征,统计相关观点语句(for each discovered feature, related opinion sentences are put into positive and negative categories according to the opinion sentences’ orientations),并计数(a count is computed to show how many reviews give positive/negative opinions to the feature);

  • 根据产品特征在评论中出现的次数,对所有特征排序。

在这里插入图片描述

4 实验(EXPERIMENTAL EVALUATION)

特征摘要(Feature-Based Summarization,FBS)系统

词性标注(part-of-speech tags)工具:NLProcessor

表(1):特征识别精确率(precision)和召回率(recall)。紧致剪枝(列(5-6))、 p p p支持度剪枝(列(7-8))均对召回率影响不大并有效提高准确率【即在不影响TP的前提下,有效降低FP】;非频繁特征识别(列(9-10))能够显著提高召回率但同时导致准确率下降【即以增加FP为代价减小FN】。但非频繁特征排名靠后(ranked rather low),因此不会对结果产生显著影响。

■■

precision = T P T P + F P \text{precision} = \frac{TP}{TP + FP} precision=TP+FPTP

recall = T P T P + F N \text{recall} = \frac{TP}{TP + FN} recall=TP+FNTP

在这里插入图片描述
表(2):FASTR精确率和召回率

在这里插入图片描述
表(3):

在这里插入图片描述
局限(limitations):

  1. 指代解析(pronoun resolution):本文未对代词指代对象进行解析;

  2. 语句感情色彩分析时仅考虑形容词(only used adjectives as indicators of opinion orientations of sentences)。在实际场景中,语句感情色彩也可通过动词和名词表达;

  3. 未考虑观点感情色彩的强度(the strength of opinions)。

5 结论

6 致谢