Lucene
全文检索
数据的分类
- 结构化数据:有固定的格式和有限的长度的数据(比如oracle或者mysql),一般使用SQL语句查询结构化数据
- 非结构化数据:没有固定个数,也没有有限长度(比如文件)
应用场景
- 搜索引擎:百度、谷歌
- 站内搜素:淘宝、京东
Lucene
Lucene是实现全文检索的工具包,Lucene是apache下开源的,用于实现全文检索的API
使用Lucene
下载
解压Lucene-8.0.0.zip
我们只需要用到Lucene的核心jar包\lucene-8.0.0\core\
和分词器的jar包\lucene-8.0.0\analysis\
即可,分词器有好多种,这里我们选择标准分词器的jar包\lucene-8.0.0\analysis\common
Lucene的存储结构
Lucene的存储对象是以document为存储单元,对象的属性值则存放在Field中
创建读取对象和写入对象
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
/**
* 创建索引写入器
* @param indexPath:索引路径
* @param create:追加索引还是覆盖索引
* @return
* @throws Exception
*/
public static IndexWriter getIndexWriter(String indexPath,boolean create) throws Exception {
//指定索引库位置
FSDirectory fsDirectory = FSDirectory.open((new File(indexPath)).toPath());
//指定标准分词器
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
if (create){
//创建索引或覆盖索引
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
}else {
//创建索引或追加索引
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
}
//创建写入对象
IndexWriter writer = new IndexWriter(fsDirectory, iwc);
return writer;
}
/**
* 提交关闭写入索引对象
* @throws Exception
*/
public static void closeIndexWriter(IndexWriter iw) throws Exception {
if(iw != null && iw.isOpen()) {
iw.commit();
iw.close();
}
}
/**
* 创建索引读取器
* @param indexPath
* @return
* @throws Exception
*/
public static IndexSearcher getIndexSearch(String indexPath) throws Exception {
//1. 指定索引库位置
Directory directory = FSDirectory.open((new File(indexPath)).toPath());
//2. 创建读取索引的对象
IndexReader ir = DirectoryReader.open(directory);
//3. 创建查询索引的对象
IndexSearcher is = new IndexSearcher(ir);
return is;
}
Field属性
Lucene的所有的Field都是org.apache.lucene.document.Field接口的实现
Field类 | 支持的数据类型 | Analyzed是否分词(是否模糊检索) | indexed是否索引(是否精确检索) | stored是否存储原数据(原数据是否需要展示) | 说明 |
---|---|---|---|---|---|
StringField | 字符串 | N | Y | Y或N | 这个Field用来构建一个字符串Field,但是不会进行分析,会将整个串存储在索引中,比如(订单号,姓名等)是否存储在文档中用Store.YES或Store.NO决定。比如身份证,手机号 |
TextField | 字符串或流 | Y | Y | Y或N | 如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略。比如网页内容,文件内容 |
StoredField | 兼容任何类型 | N | N | Y | 这个Field是用来存储的 |
LongPoint | long类型 | N | Y | N | 这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField |
IntPoint | int类型 | N | Y | N | 这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField |
FloatPoint | float类型 | N | Y | N | 这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField |
DoublePoint | double类型 | N | Y | N | 这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField |
BinaryPoint | byte数组类型 | N | Y | N | 这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField |
NumericDocValuesField | long类型 | N | N | N | 这个Field不存储,不参与运算,只是用来排序的(支持int和long的排序) |
FloatDocValuesField | float类型 | N | N | N | 这个Field不存储,不参与运算,只是用来排序的(支持float排序) |
DoubleDocValuesField | double类型 | N | N | N | 这个Field不存储,不参与运算,只是用来排序的(支持double排序) |
BinaryDocValuesField | byte数组转化的BytesRef类型 | N | N | N | 这个Field不存储,不参与运算,只是用来排序的 |
SortedDocValuesField | byte数组转化的BytesRef类型 | N | N | N | 这个Field不存储,不参与运算,只是用来排序的 |
存储,范围排序查询
范围排序查询包括边界(第一个详细,后面相同的代码略)
import org.apache.lucene.document.IntPoint;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.DoublePoint;
import org.apache.lucene.document.FloatPoint;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.document.DoubleDocValuesField;
import org.apache.lucene.document.FloatDocValuesField;
import org.apache.lucene.document.BinaryDocValuesField;
import org.apache.lucene.document.BinaryPoint;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.search.Query;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.document.SortedSetDocValuesField;
import org.apache.lucene.document.SortedDocValuesField;
//1. IntPoint注入的Field字段排序检索
//存储(三个不同功能的存储必须同名,且要存在同一个document中)
//范围查找索引存储
document.add(new IntPoint("age", 30));
//原数据存储
document.add(new StoredField("age",30));
//排序索引存储
document.add(new NumericDocValuesField("age", 30));
//查询
//指定排序字段,比较器,true倒序,false正序
SortField sortField = new SortField("age", SortField.Type.INT,true);
Sort sort = new Sort(sortField);
//创建范围查找查询对象
Query query = IntPoint.newRangeQuery("age",20,50);
//获得查询结果,查询对象,限制返回条数,排序字段
TopDocs td = indexSearch.search(query, 12,sort);
//2. LongPoint注入的Field字段排序检索,一般用于对时间戳或者价格进行检索
//存储
document.add(new LongPoint("time", 30));
document.add(new StoredField("time",30));
document.add(new NumericDocValuesField("age", 30));
//查询
SortField sortField = new SortField("age", SortField.Type.LONG,true);
Query query = LongPoint.newRangeQuery("time",10,20);
//3. DoublePoint注入的Field字段排序检索
//存储
document.add(new DoublePoint("price", 7.28));
document.add(new StoredField("price",7.28));
document.add(new DoubleDocValuesField("price", 7.28));
//查询
SortField sortField = new SortField("price", SortField.Type.DOUBLE,true);
Query query = DoublePoint.newRangeQuery("price", 5.2, 10.5);
//4. FloatPoint注入的Field字段排序检索
//存储
document.add(new FloatPoint("price", 7.28));
document.add(new StoredField("price",7.28));
document.add(new FloatDocValuesField("price", 7.28));
//查询
SortField sortField = new SortField("price", SortField.Type.FLOAT,true);
Query query = FloatPoint.newRangeQuery("price", 5.2f, 10.5f);
//5. BinaryPoint注入的Field字段排序,针对byte数组的排序查询没有多大意义,一般用它来存储文件
//存储
byte[] arr = {1,2,3,4,5,6};
document.add(new BinaryPoint("view", arr));
document.add(new StoredField("view", arr));
document.add(new BinaryDocValuesField("view", new BytesRef(arr)));
//查询
SortField sortField = new SortField("view", SortField.Type.STRING_VAL,true);
byte[] arr1 = {1,2,3,4,5,6};
byte[] arr2 = {4,5,7,3,5,6};
Query query = BinaryPoint.newRangeQuery("view", arr1, arr2);
//6. StringField注入的Field字段排序
//存储
document.add(new StringField("time", "2020-11-22 23:10:00", Store.YES));
document.add(new SortedDocValuesField("time", new BytesRef("2020-11-22 23:10:00".getBytes("utf-8"))));
//查询
SortField sortField = new SortField("time", SortField.Type.STRING,true);
Query query = SortedSetDocValuesField.newSlowRangeQuery("time", new BytesRef("2020-11-22 23:10:00".getBytes("utf-8")), new BytesRef("2020-11-23 23:10:00".getBytes("utf-8")), true, true);
//7. TextField注入的Field字段排序,与StringField的区别是它支持分词
//存储
document.add(new TextField("time", "2020-11-22 23:10:00", Store.YES));
document.add(new SortedDocValuesField("time", new BytesRef("2020-11-22 23:10:00".getBytes("utf-8"))));
//查询
SortField sortField = new SortField("time", SortField.Type.STRING,true);
Query query = SortedSetDocValuesField.newSlowRangeQuery("time", new BytesRef("2020-11-22 23:10:00".getBytes("utf-8")), new BytesRef("2020-11-23 23:10:00".getBytes("utf-8")), true, true);
完整案例
//创建索引
public static void testIndexIntFieldStored() throws Exception {
IndexWriter iw = null;
//模拟数据
List<Map<String,Object>> demoMaps = new ArrayList<Map<String,Object>>();
Random random = new Random();
for(int i = 0; i < 10;i++) {
Map<String,Object> map = new HashMap<String,Object>();
map.put("name", "张三" + i);
map.put("age", random.nextInt((i+1) * 6));
map.put("id", UUID.randomUUID().toString());
demoMaps.add(map);
}
//封装存储单元到list
List<Document> list = new ArrayList<Document>();
for(Map<String,Object> map : demoMaps) {
Document document = new Document();
String name = map.get("name").toString();
int age = Integer.parseInt(map.get("age").toString());
String id = map.get("id").toString();
//name只存储,检索,不排序
document.add(new TextField("name", name, Store.YES));
//age支持排序
document.add(new IntPoint("age", age));
document.add(new StoredField("age", age));
document.add(new NumericDocValuesField("age", age));
//id支持排序
document.add(new TextField("id", id,Store.YES));
document.add(new SortedDocValuesField("id", new BytesRef(id.getBytes("utf-8"))));
list.add(document);
}
try {
iw = getIndexWriter("E:\\dsgTemp\\indexRepo2",false);
iw.addDocuments(list);
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}finally {
try {
iw.commit();
iw.close();
} catch (Exception e2) {
// TODO: handle exception
}
}
}
//查询索引
public static void testIntFieldSort() {
try {
IndexSearcher indexSearch = getIndexSearch("E:\\dsgTemp\\indexRepo2");
//排序字段数组,谁在前谁是主排序,谁在后,谁就是次排序
SortField[] sortFieldArr = new SortField[2];
sortFieldArr[0] = new SortField("age", SortField.Type.INT,true);
sortFieldArr[1] = new SortField("id", SortField.Type.STRING,false);
Sort sort = new Sort(sortFieldArr);
//指定字段范围检索
Query query = IntPoint.newRangeQuery("age", 10, 60);
TopDocs search = indexSearch.search(query, 10,sort);
System.out.println("总记录数:" + search.totalHits);
ScoreDoc[] scoreDocs = search.scoreDocs;
for(ScoreDoc scoreDoc : scoreDocs) {
Document doc = indexSearch.doc(scoreDoc.doc);
String name = doc.get("name");
String age = doc.get("age");
String id = doc.get("id");
System.out.println("姓名:" + name + ";年龄:" + age + ";ID:" + id );
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
其他查询
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.TermQuery;
//全部查询
Query query = new MatchAllDocsQuery();
//模糊(分词)查询,使用默认分词器StandardAnalyzer
Query query = new TermQuery(new Term("name","张三"));
//指定分词器查询
QueryParser queryParser = new QueryParser("content", new StandardAnalyzer());
Query query = queryParser.parse("Hello world");
多条件查询
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.document.DoublePoint;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
//方法一:
QueryParser queryParser = new QueryParser("name", new StandardAnalyzer());
Query query1 = queryParser.parse("苹果");
Query query2 = DoublePoint.newRangeQuery("price", 5.2, 10.5);
//MUST:必须匹配,相当于and
//FILTER:
//SHOULD:可以有可以没有相当于or
//MUST_NOT:非
BooleanClause bc1 = new BooleanClause(query1, Occur.MUST);
BooleanClause bc2 = new BooleanClause(query2, Occur.MUST);
Query query = new BooleanQuery.Builder().add(bc1).add(bc2).build();
//方法二:
String[] stringQuery = {"5.3", "苹果" };
String[] fields = { "price", "name" };
Occur[] occ = { Occur.MUST, Occur.MUST };
Query query = MultiFieldQueryParser.parse(stringQuery, fields, occ, new StandardAnalyzer());
查询语法
#范围查询 包括边界
price:[5.2 TO 10.5]
#范围查询 不包括10.5
price:[5.2 TO 10.5}
#全部查询
*:*
#分词查询
name:张三
#多条件查询,+(MUST)必须, 无符号(SHOULD)或,-(MUST_NOT)非,#(FILTER)
+(name:苹 name:果) +price:[5.2 TO 10.5]
索引的增删改
/**
* 删除所有索引
*/
public void deleteAllIndex(String indexPath) {
IndexWriter iw = null;
try {
iw = getIndexWriter(indexPath,false);
iw.deleteAll();
closeIndexWriter(iw);
} catch (Exception e) {
// TODO: handle exception
}
}
/**
* 添加索引
*/
public void addIndex(String indexPath) {
IndexWriter iw = null;
try {
iw = getIndexWriter(indexPath,false);
Document doc = new Document();
doc.add(new StringField("name", "schools", Store.YES));
doc.add(new TextField("content", "jiang an xiao xue, nan tong xiao xue,cheng zhong xiao xue", Store.YES));
iw.addDocument(doc);
closeIndexWriter(iw);
} catch (Exception e) {
// TODO: handle exception
}
}
/**
* 修改索引内容(内部操作是先删后加)
*/
public void updateIndex(String indexPath) {
IndexWriter iw = null;
try {
iw = getIndexWriter(indexPath,false);
Term term = new Term("name", "Spring.txt");
Document doc = new Document();
doc.add(new StringField("name", "spring",Store.YES));
doc.add(new StringField("size", "50",Store.YES));
iw.updateDocument(term, doc);
closeIndexWriter(iw);
} catch (Exception e) {
// TODO: handle exception
}
}
/**
* 有条件的删除索引
*/
public void deleteIndexByQuery(String indexPath) {
IndexWriter iw = null;
try {
iw = getIndexWriter(indexPath,false);
iw.deleteDocuments(new Term("name","Spring.txt"));
closeIndexWriter(iw);
} catch (Exception e) {
// TODO: handle exception
}
}
分词器
Lucene自带的分词器:
- StandardAnalyzer:标准分词器,对中文是一个字一个字的分,支持不太好,来源
\lucene-8.0.0\analysis\common
- CJKAnalyzer:中日韩分词器,相对标准分词器来说,较好一点,来源
\lucene-8.0.0\analysis\common
- SmartChineseAnalyzer:中文分词器,相对中日韩分词器又更好了点,但是对英文支持很差,容易出现缺字母的情况,来源
\lucene-8.0.0\analysis\smartcn
Lucene总结
Lucene查询工具类
import java.io.File;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.BooleanQuery.Builder;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.FSDirectory;
/**lucene索引查询工具类
* @author lenovo
*
*/
public class SearchUtil {
/**获取IndexSearcher对象
* @param indexPath
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherByParentPath(String parentPath,ExecutorService service) throws IOException{
MultiReader reader = null;
//设置
try {
File[] files = new File(parentPath).listFiles();
IndexReader[] readers = new IndexReader[files.length];
for (int i = 0 ; i < files.length ; i ++) {
readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
}
reader = new MultiReader(readers);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return new IndexSearcher(reader,service);
}
/**根据索引路径获取IndexReader
* @param indexPath
* @return
* @throws IOException
*/
public static DirectoryReader getIndexReader(String indexPath) throws IOException{
return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0])));
}
/**根据索引路径获取IndexSearcher
* @param indexPath
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{
IndexReader reader = getIndexReader(indexPath);
return new IndexSearcher(reader,service);
}
/**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源
* @param oldSearcher
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{
DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader();
DirectoryReader newReader = DirectoryReader.openIfChanged(reader);
return new IndexSearcher(newReader, service);
}
/**多条件查询类似于sql in
* @param querys
* @return
*/
public static Query getMultiQueryLikeSqlIn(Query ... querys){
Builder builder = new BooleanQuery.Builder();
for (Query subQuery : querys) {
builder.add(subQuery,Occur.SHOULD);
}
return builder.build();
}
/**多条件查询类似于sql and
* @param querys
* @return
*/
public static Query getMultiQueryLikeSqlAnd(Query ... querys){
Builder builder = new BooleanQuery.Builder();
for (Query subQuery : querys) {
builder.add(subQuery,Occur.MUST);
}
return builder.build();
}
/**根据IndexSearcher和docID获取默认的document
* @param searcher
* @param docID
* @return
* @throws IOException
*/
public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{
return searcher.doc(docID);
}
/**根据IndexSearcher和docID
* @param searcher
* @param docID
* @param listField
* @return
* @throws IOException
*/
public static Document getDocumentByListField(IndexSearcher searcher,int docID,Set<String> listField) throws IOException{
return searcher.doc(docID, listField);
}
/**分页查询
* @param page 当前页数
* @param perPage 每页显示条数
* @param searcher searcher查询器
* @param query 查询条件
* @return
* @throws IOException
*/
public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{
TopDocs result = null;
if(query == null){
System.out.println(" Query is null return null ");
return null;
}
ScoreDoc before = null;
if(page != 1){
TopDocs docsBefore = searcher.search(query, (page-1)*perPage);
ScoreDoc[] scoreDocs = docsBefore.scoreDocs;
if(scoreDocs.length > 0){
before = scoreDocs[scoreDocs.length - 1];
}
}
result = searcher.searchAfter(before, query, perPage);
return result;
}
public static TopDocs getScoreDocs(IndexSearcher searcher,Query query) throws IOException{
TopDocs docs = searcher.search(query, getMaxDocId(searcher));
return docs;
}
/**统计document的数量,此方法等同于matchAllDocsQuery查询
* @param searcher
* @return
*/
public static int getMaxDocId(IndexSearcher searcher){
return searcher.getIndexReader().maxDoc();
}
}
测试
package cn.com.trueway;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
public class TestSearch {
public static void main(String[] args) {
ExecutorService service = Executors.newCachedThreadPool();
try {
IndexSearcher searcher = SearchUtil.getIndexSearcherByIndexPath("E:\\dsgTemp\\indexRepo2",service);
System.out.println(SearchUtil.getMaxDocId(searcher));
Query query = new MatchAllDocsQuery();
//分页查询
TopDocs docs = SearchUtil.getScoreDocsByPerPage(1, 20, searcher, query);
ScoreDoc[] scoreDocs = docs.scoreDocs;
System.out.println("所有的数据总数为:"+docs.totalHits);
System.out.println("本页查询到的总数为:"+scoreDocs.length);
for (ScoreDoc scoreDoc : scoreDocs) {
Document doc = SearchUtil.getDefaultFullDocument(searcher, scoreDoc.doc);
System.out.println(doc);
}
System.out.println("\n\n");
//指定查询字段
TopDocs docsAll = SearchUtil.getScoreDocs(searcher, query);
Set<String> fieldSet = new HashSet<String>();
fieldSet.add("name");
fieldSet.add("age");
for (int i = 0 ; i < 5 ; i ++) {
Document doc = SearchUtil.getDocumentByListField(searcher, docsAll.scoreDocs[i].doc,fieldSet);
System.out.println(doc);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}finally{
service.shutdownNow();
}
}
}