淘先锋技术网

首页 1 2 3 4 5 6 7

论文标题:Mining and summarizing customer reviews

作者:

出版源:Tenth Acm Sigkdd International Conference on Knowledge Discovery & Data Mining (2004)

截至2017/2/17 20:20 被引用量:3763

(以上信息来自百度学术)

       文章旨在通过文本挖掘来从网络上关于某些产品的客户评论中发掘产品的特征以及顾客对这些特征的情绪,进而形成产品特征的summary。

       主要过程有:

1.获取reviews,并进行文本处理

2.通过association mining 获取评论中产品的特征

3.获取顾客评论中各句子的情绪值

4.计算各特征的两类情绪的评论数量,从而形成产品特征的summary。


接下来分别详细介绍各过程:

1.获取reviews,并进行文本处理

          获取评论属于爬虫工作,不作具体介绍。关于文本处理,主要在于分词和词性标注,文章采用的是NLProcessor linguistic parser(http://www.infogistics.com/textanalysis.html)对评论进行分词和词性标注(Part-of-Speech Tagging (POS)),eg:

<S> <NG><W C='PRP' L='SS' T='w' S='Y'> I </W> </NG>
<VG> <W C='VBP'> am </W><W C='RB'> absolutely
</W></VG> <W C='IN'> in </W> <NG> <W C='NN'> awe
</W> </NG> <W C='IN'> of </W> <NG> <W C='DT'> this
</W> <W C='NN'> camera </W></NG><W C='.'> .
</W></S>


2..通过association mining 获取评论中产品的特征

这篇论文只关注那些作为名词出现在评论中的特征,例如The pictures are very clear和While light, it will not easily fit in pockets,第一句就明显有picture作为特征,而第二句泽隐含size特征,这篇论文并不包含例如第二句的隐含特征。因此本文即选取一些出现频率较高的名词或名词词组作为Frequent features。采用的方法是association mining,(Liu, B., Hsu, W., Ma, Y. 1998. Integrating Classification)基于Apriori algorithm(Agrawal, R. & Srikant, R. 1994. Fast algorithm for mining association rules. VLDB’94, 1994)
and Association Rule Mining. KDD’98, 1998.作者基于以下的假设:It is common that a customer review contains many things that are not directly related to product features. Different customers usually have different stories.However, when they comment on product features, the words that they use converge.Thus using association mining to find frequent itemsets is appropriate because those frequent itemsets are likely to be product features.即评论者的选词相近,且频繁项往往能代表特征。我们可以看到由于过程简单,因此选取到的features可能会质量不好,因此作者还分别进行Compactness pruning和Redundancy pruning。Compactness pruning是找出没有按单词顺序排列的词组,剔除出去(Hu, M., and Liu, B. 2004. Mining Opinion Features in Customer Reviews. To appear in AAAI’04, 2004.),Redundancy pruning是按features的p-support值(p-support of feature ftr is the number of sentences that ftr appears in as a noun or noun phrase)排列,剔除低于minimum p-support value(本文设置为3)的单词或词组。

以上是关于frequent features的提取,关于infrequent features的提取,定义为:当句子中含有情绪词(下一部分定义),但不含有frequnet features,则提取最近的单词作为infrequent features。infrequent features 由于可能跟产品相关性小从而带来误差,但是作者认为infrequent features占比小,并且为了完整性,应考虑infrequent features。


3.获取顾客评论中各句子的情绪值

       句子的情绪值由词语的情绪值决定,如果词语的情绪值总和为正,那么情绪为positive,如果词语的情绪值总和为负,那么情绪为negative。那么首先就是要判断各词语的情绪值,判断词语的情绪值主要通过判断评论中形容词的情绪值来确定,作者先获得评论中的形容词adjective list,然后设定30个常见形容词作为seed list(已判断好情绪值),然后根据wordNet逐步判断出adjective list中形容词的情绪值,在判断的过程中已判断的形容词会逐步加到seed list中,从而得到所有词语的情绪值。当所有词语情绪值总和为0时,作者predict the orientation using the average orientation of effective opinions or the orientation of the previous opinion sentence (recall that effective opinion is the closest opinion word for a feature in an opinion sentence。即要么用effective opinions的情绪值来替代,或者用前一个句子的情绪值替代。关于effective opinion文章并没有详细指出。


4.计算各特征的两类情绪的评论数量,从而形成产品特征的summary。

       关于这一点,主要提供最后summary的格式的例子就好:

Feature: picture
Positive: 12
• Overall this is a good camera with a really good
picture clarity.
• The pictures are absolutely amazing - the camera
captures the minutest of details.
• After nearly 800 pictures I have found that this camera
takes incredible pictures.

Negative: 2
• The pictures come out hazy if your hands shake even
for a moment during the entire process of taking a
picture.
• Focusing on a display rack about 20 feet away in a
brightly lit room during day time, pictures produced by
this camera were blurry and in a shade of orange.