淘先锋技术网

首页 1 2 3 4 5 6 7

全文检索

数据的分类

  • 结构化数据:有固定的格式和有限的长度的数据(比如oracle或者mysql),一般使用SQL语句查询结构化数据
  • 非结构化数据:没有固定个数,也没有有限长度(比如文件)

应用场景

  • 搜索引擎:百度、谷歌
  • 站内搜素:淘宝、京东

Lucene

Lucene是实现全文检索的工具包,Lucene是apache下开源的,用于实现全文检索的API

使用Lucene

下载

Lucene官网下载地址

解压Lucene-8.0.0.zip

我们只需要用到Lucene的核心jar包\lucene-8.0.0\core\和分词器的jar包\lucene-8.0.0\analysis\即可,分词器有好多种,这里我们选择标准分词器的jar包\lucene-8.0.0\analysis\common

Lucene的存储结构

Lucene的存储对象是以document为存储单元,对象的属性值则存放在Field中

创建读取对象和写入对象

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
	/**
	 * 创建索引写入器
	 * @param indexPath:索引路径
	 * @param create:追加索引还是覆盖索引
	 * @return
	 * @throws Exception 
	 */
	public static IndexWriter getIndexWriter(String indexPath,boolean create) throws Exception {
		//指定索引库位置
		FSDirectory fsDirectory = FSDirectory.open((new File(indexPath)).toPath());
		//指定标准分词器
		Analyzer analyzer = new StandardAnalyzer();
		IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
	    if (create){
	    	//创建索引或覆盖索引
	        iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
	    }else {
	    	//创建索引或追加索引
	        iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
	    }
	    //创建写入对象
	    IndexWriter writer = new IndexWriter(fsDirectory, iwc);
	    return writer;
	}

	/**
	 * 提交关闭写入索引对象
	 * @throws Exception
	 */
	public static void closeIndexWriter(IndexWriter iw) throws Exception {
		if(iw != null && iw.isOpen()) {
			iw.commit();
			iw.close();
		}
	}

	/**
	 * 创建索引读取器
	 * @param indexPath
	 * @return
	 * @throws Exception
	 */
	public static IndexSearcher getIndexSearch(String indexPath) throws Exception {
		//1. 指定索引库位置
		Directory directory = FSDirectory.open((new File(indexPath)).toPath());
		//2. 创建读取索引的对象
		IndexReader ir = DirectoryReader.open(directory);
		//3. 创建查询索引的对象
		IndexSearcher is = new IndexSearcher(ir);
		return is;
	}

Field属性

Lucene的所有的Field都是org.apache.lucene.document.Field接口的实现

Field类支持的数据类型Analyzed是否分词(是否模糊检索)indexed是否索引(是否精确检索)stored是否存储原数据(原数据是否需要展示)说明
StringField字符串NYY或N这个Field用来构建一个字符串Field,但是不会进行分析,会将整个串存储在索引中,比如(订单号,姓名等)是否存储在文档中用Store.YES或Store.NO决定。比如身份证,手机号
TextField字符串或流YYY或N如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略。比如网页内容,文件内容
StoredField兼容任何类型NNY这个Field是用来存储的
LongPointlong类型NYN这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField
IntPointint类型NYN这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField
FloatPointfloat类型NYN这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField
DoublePointdouble类型NYN这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField
BinaryPointbyte数组类型NYN这个Field不存储,但是参与运算,如果需要存储,得添加同名字段StoredField
NumericDocValuesFieldlong类型NNN这个Field不存储,不参与运算,只是用来排序的(支持int和long的排序)
FloatDocValuesFieldfloat类型NNN这个Field不存储,不参与运算,只是用来排序的(支持float排序)
DoubleDocValuesFielddouble类型NNN这个Field不存储,不参与运算,只是用来排序的(支持double排序)
BinaryDocValuesFieldbyte数组转化的BytesRef类型NNN这个Field不存储,不参与运算,只是用来排序的
SortedDocValuesFieldbyte数组转化的BytesRef类型NNN这个Field不存储,不参与运算,只是用来排序的

存储,范围排序查询

范围排序查询包括边界(第一个详细,后面相同的代码略)

import org.apache.lucene.document.IntPoint;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.DoublePoint;
import org.apache.lucene.document.FloatPoint;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.document.DoubleDocValuesField;
import org.apache.lucene.document.FloatDocValuesField;
import org.apache.lucene.document.BinaryDocValuesField;
import org.apache.lucene.document.BinaryPoint;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.search.Query;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.document.SortedSetDocValuesField;
import org.apache.lucene.document.SortedDocValuesField;
	//1. IntPoint注入的Field字段排序检索
		//存储(三个不同功能的存储必须同名,且要存在同一个document中)
		//范围查找索引存储
		document.add(new IntPoint("age", 30));
		//原数据存储
		document.add(new StoredField("age",30));
		//排序索引存储
		document.add(new NumericDocValuesField("age", 30));
		//查询
		//指定排序字段,比较器,true倒序,false正序
		SortField sortField = new SortField("age", SortField.Type.INT,true);
		Sort sort = new Sort(sortField);
		//创建范围查找查询对象
		Query query = IntPoint.newRangeQuery("age",20,50);
		//获得查询结果,查询对象,限制返回条数,排序字段
		TopDocs td = indexSearch.search(query, 12,sort);
	//2. LongPoint注入的Field字段排序检索,一般用于对时间戳或者价格进行检索
		//存储
		document.add(new LongPoint("time", 30));
		document.add(new StoredField("time",30));
		document.add(new NumericDocValuesField("age", 30));
		//查询
		SortField sortField = new SortField("age", SortField.Type.LONG,true);
		Query query = LongPoint.newRangeQuery("time",10,20);
	//3. DoublePoint注入的Field字段排序检索
		//存储
		document.add(new DoublePoint("price", 7.28));
		document.add(new StoredField("price",7.28));
		document.add(new DoubleDocValuesField("price", 7.28));
		//查询
		SortField sortField = new SortField("price", SortField.Type.DOUBLE,true);
		Query query = DoublePoint.newRangeQuery("price", 5.2, 10.5);
	//4. FloatPoint注入的Field字段排序检索
		//存储
		document.add(new FloatPoint("price", 7.28));
		document.add(new StoredField("price",7.28));
		document.add(new FloatDocValuesField("price", 7.28));
		//查询
		SortField sortField = new SortField("price", SortField.Type.FLOAT,true);
		Query query = FloatPoint.newRangeQuery("price", 5.2f, 10.5f);	
	//5. BinaryPoint注入的Field字段排序,针对byte数组的排序查询没有多大意义,一般用它来存储文件
		//存储
		byte[] arr = {1,2,3,4,5,6};
		document.add(new BinaryPoint("view", arr));
		document.add(new StoredField("view", arr));
		document.add(new BinaryDocValuesField("view", new BytesRef(arr)));	
		//查询
		SortField sortField = new SortField("view", SortField.Type.STRING_VAL,true);
		byte[] arr1 = {1,2,3,4,5,6};
		byte[] arr2 = {4,5,7,3,5,6};
		Query query = BinaryPoint.newRangeQuery("view", arr1, arr2);	
	//6. StringField注入的Field字段排序
		//存储
		document.add(new StringField("time", "2020-11-22 23:10:00", Store.YES));
		document.add(new SortedDocValuesField("time", new BytesRef("2020-11-22 23:10:00".getBytes("utf-8"))));
		//查询
		SortField sortField = new SortField("time", SortField.Type.STRING,true);
		Query query = SortedSetDocValuesField.newSlowRangeQuery("time", new BytesRef("2020-11-22 23:10:00".getBytes("utf-8")), new BytesRef("2020-11-23 23:10:00".getBytes("utf-8")), true, true);
	//7. TextField注入的Field字段排序,与StringField的区别是它支持分词
		//存储
		document.add(new TextField("time", "2020-11-22 23:10:00", Store.YES));
		document.add(new SortedDocValuesField("time", new BytesRef("2020-11-22 23:10:00".getBytes("utf-8"))));
		//查询
		SortField sortField = new SortField("time", SortField.Type.STRING,true);
		Query query = SortedSetDocValuesField.newSlowRangeQuery("time", new BytesRef("2020-11-22 23:10:00".getBytes("utf-8")), new BytesRef("2020-11-23 23:10:00".getBytes("utf-8")), true, true);
完整案例
	//创建索引
	public static void testIndexIntFieldStored() throws Exception {
		IndexWriter iw = null;
		//模拟数据
		List<Map<String,Object>> demoMaps = new ArrayList<Map<String,Object>>();
		Random random = new Random();
		for(int i = 0; i < 10;i++) {
			Map<String,Object> map = new HashMap<String,Object>();
			map.put("name", "张三" + i);
			map.put("age", random.nextInt((i+1) * 6));
			map.put("id", UUID.randomUUID().toString());
			demoMaps.add(map);
		}
		//封装存储单元到list
		List<Document> list = new ArrayList<Document>();
		for(Map<String,Object> map : demoMaps) {
			Document document = new Document();
			String name = map.get("name").toString();
			int age = Integer.parseInt(map.get("age").toString());
			String id = map.get("id").toString();
			//name只存储,检索,不排序
			document.add(new TextField("name", name, Store.YES));
			//age支持排序
			document.add(new IntPoint("age", age));
			document.add(new StoredField("age", age));
			document.add(new NumericDocValuesField("age", age));
			//id支持排序
			document.add(new TextField("id", id,Store.YES));
			document.add(new SortedDocValuesField("id", new BytesRef(id.getBytes("utf-8"))));
			
			list.add(document);
		}
		try {
			iw = getIndexWriter("E:\\dsgTemp\\indexRepo2",false);
			iw.addDocuments(list);
		} catch (Exception e) {
			// TODO: handle exception
			e.printStackTrace();
		}finally {
			try {
				iw.commit();
				iw.close();
			} catch (Exception e2) {
				// TODO: handle exception
			}
		}
	}

	//查询索引
	public static void testIntFieldSort() {
		try {
			IndexSearcher indexSearch = getIndexSearch("E:\\dsgTemp\\indexRepo2");
			//排序字段数组,谁在前谁是主排序,谁在后,谁就是次排序
			SortField[] sortFieldArr = new SortField[2];
			sortFieldArr[0] = new SortField("age", SortField.Type.INT,true);
			sortFieldArr[1] = new SortField("id", SortField.Type.STRING,false);
			Sort sort = new Sort(sortFieldArr);
			//指定字段范围检索
			Query query = IntPoint.newRangeQuery("age", 10, 60);
			TopDocs search = indexSearch.search(query, 10,sort);
			System.out.println("总记录数:" + search.totalHits);
			ScoreDoc[] scoreDocs = search.scoreDocs;
			
			for(ScoreDoc scoreDoc : scoreDocs) {
				Document doc = indexSearch.doc(scoreDoc.doc);
				String name = doc.get("name");
				String age = doc.get("age");
				String id = doc.get("id");
				System.out.println("姓名:" + name + ";年龄:" + age + ";ID:" + id );
			}
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

其他查询

import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.TermQuery;
	//全部查询
	Query query = new MatchAllDocsQuery();
	//模糊(分词)查询,使用默认分词器StandardAnalyzer
	Query query = new TermQuery(new Term("name","张三"));
	//指定分词器查询
	QueryParser queryParser = new QueryParser("content", new StandardAnalyzer());
	Query query = queryParser.parse("Hello world");

多条件查询

import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.document.DoublePoint;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;

	//方法一:
	QueryParser queryParser = new QueryParser("name", new StandardAnalyzer());
	Query query1 = queryParser.parse("苹果");
	Query query2 = DoublePoint.newRangeQuery("price", 5.2, 10.5);
	//MUST:必须匹配,相当于and
	//FILTER:
	//SHOULD:可以有可以没有相当于or
	//MUST_NOT:非
	BooleanClause bc1 = new BooleanClause(query1, Occur.MUST);
	BooleanClause bc2 = new BooleanClause(query2, Occur.MUST);
	Query query = new BooleanQuery.Builder().add(bc1).add(bc2).build();

	//方法二:
	String[] stringQuery = {"5.3", "苹果" };
    String[] fields = { "price", "name" };
    Occur[] occ = { Occur.MUST, Occur.MUST };
    Query query = MultiFieldQueryParser.parse(stringQuery, fields, occ, new StandardAnalyzer());

查询语法

	#范围查询 包括边界
	price:[5.2 TO 10.5]
	#范围查询 不包括10.5
	price:[5.2 TO 10.5}
	#全部查询
	*:*
	#分词查询
	name:张三
	#多条件查询,+(MUST)必须, 无符号(SHOULD)或,-(MUST_NOT)非,#(FILTER)
	+(name:苹 name:果) +price:[5.2 TO 10.5]

索引的增删改

	/**
	 * 删除所有索引
	 */
	public void deleteAllIndex(String indexPath) {
		IndexWriter iw = null;
		try {
			iw = getIndexWriter(indexPath,false);
			iw.deleteAll();
			closeIndexWriter(iw);
		} catch (Exception e) {
			// TODO: handle exception
		}
	}
	
	/**
	 * 添加索引
	 */
	public void addIndex(String indexPath) {
		IndexWriter iw = null;
		try {
			iw = getIndexWriter(indexPath,false);
			Document doc = new Document();
			doc.add(new StringField("name", "schools", Store.YES));
			doc.add(new TextField("content", "jiang an xiao xue, nan tong xiao xue,cheng zhong xiao xue", Store.YES));
			
			iw.addDocument(doc);
			closeIndexWriter(iw);
		} catch (Exception e) {
			// TODO: handle exception
		}
	}
	
	/**
	 * 修改索引内容(内部操作是先删后加)
	 */
	public void updateIndex(String indexPath) {
		IndexWriter iw = null;
		try {
			iw = getIndexWriter(indexPath,false);
			Term term = new Term("name", "Spring.txt");
			Document doc = new Document();
			doc.add(new StringField("name", "spring",Store.YES));
			doc.add(new StringField("size", "50",Store.YES));
			iw.updateDocument(term, doc);
			closeIndexWriter(iw);
		} catch (Exception e) {
			// TODO: handle exception
		}
	}
	
	/**
	 * 有条件的删除索引
	 */
	public void deleteIndexByQuery(String indexPath) {
		IndexWriter iw = null;
		try {
			iw = getIndexWriter(indexPath,false);
			iw.deleteDocuments(new Term("name","Spring.txt"));
			closeIndexWriter(iw);
		} catch (Exception e) {
			// TODO: handle exception
		}
	}

分词器

Lucene自带的分词器:

  • StandardAnalyzer:标准分词器,对中文是一个字一个字的分,支持不太好,来源\lucene-8.0.0\analysis\common
  • CJKAnalyzer:中日韩分词器,相对标准分词器来说,较好一点,来源\lucene-8.0.0\analysis\common
  • SmartChineseAnalyzer:中文分词器,相对中日韩分词器又更好了点,但是对英文支持很差,容易出现缺字母的情况,来源\lucene-8.0.0\analysis\smartcn

Lucene总结

Lucene查询工具类

import java.io.File;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.BooleanQuery.Builder;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.FSDirectory;
 
/**lucene索引查询工具类
 * @author lenovo
 *
 */
public class SearchUtil {
	
	/**获取IndexSearcher对象
	 * @param indexPath
	 * @param service
	 * @return
	 * @throws IOException
	 */
	public static IndexSearcher getIndexSearcherByParentPath(String parentPath,ExecutorService service) throws IOException{
		MultiReader reader = null;
		//设置
		try {
			File[] files = new File(parentPath).listFiles();
			IndexReader[] readers = new IndexReader[files.length];
			for (int i = 0 ; i < files.length ; i ++) {
				readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
			}
			reader = new MultiReader(readers);
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return new IndexSearcher(reader,service);
	}
	
	/**根据索引路径获取IndexReader
	 * @param indexPath
	 * @return
	 * @throws IOException
	 */
	public static DirectoryReader getIndexReader(String indexPath) throws IOException{
		return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0])));
	}
	
	/**根据索引路径获取IndexSearcher
	 * @param indexPath
	 * @param service
	 * @return
	 * @throws IOException
	 */
	public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{
		IndexReader reader = getIndexReader(indexPath);
		return new IndexSearcher(reader,service);
	}
	
	/**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源
	 * @param oldSearcher
	 * @param service
	 * @return
	 * @throws IOException
	 */
	public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{
		DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader();
		DirectoryReader newReader = DirectoryReader.openIfChanged(reader);
		return new IndexSearcher(newReader, service);
	}
	
	/**多条件查询类似于sql in
	 * @param querys
	 * @return
	 */
	public static Query getMultiQueryLikeSqlIn(Query ... querys){
		Builder builder = new BooleanQuery.Builder();
		for (Query subQuery : querys) {
			builder.add(subQuery,Occur.SHOULD);
		}
		return builder.build();
	}
	
	/**多条件查询类似于sql and
	 * @param querys
	 * @return
	 */
	public static Query getMultiQueryLikeSqlAnd(Query ... querys){
		Builder builder = new BooleanQuery.Builder();
		for (Query subQuery : querys) {
			builder.add(subQuery,Occur.MUST);
		}
		return builder.build();
	}
	
	/**根据IndexSearcher和docID获取默认的document
	 * @param searcher
	 * @param docID
	 * @return
	 * @throws IOException
	 */
	public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{
		return searcher.doc(docID);
	}
	
	/**根据IndexSearcher和docID
	 * @param searcher
	 * @param docID
	 * @param listField
	 * @return
	 * @throws IOException
	 */
	public static Document getDocumentByListField(IndexSearcher searcher,int docID,Set<String> listField) throws IOException{
		return searcher.doc(docID, listField);
	}
	
	/**分页查询
	 * @param page 当前页数
	 * @param perPage 每页显示条数
	 * @param searcher searcher查询器
	 * @param query 查询条件
	 * @return
	 * @throws IOException
	 */
	public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{
		TopDocs result = null;
		if(query == null){
			System.out.println(" Query is null return null ");
			return null;
		}
		ScoreDoc before = null;
		if(page != 1){
			TopDocs docsBefore = searcher.search(query, (page-1)*perPage);
			ScoreDoc[] scoreDocs = docsBefore.scoreDocs;
			if(scoreDocs.length > 0){
				before = scoreDocs[scoreDocs.length - 1];
			}
		}
		result = searcher.searchAfter(before, query, perPage);
		return result;
	}
	
	public static TopDocs getScoreDocs(IndexSearcher searcher,Query query) throws IOException{
		TopDocs docs = searcher.search(query, getMaxDocId(searcher));
		return docs;
	}
	
	/**统计document的数量,此方法等同于matchAllDocsQuery查询
	 * @param searcher
	 * @return
	 */
	public static int getMaxDocId(IndexSearcher searcher){
		return searcher.getIndexReader().maxDoc();
	}
	
}

测试

package cn.com.trueway;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class TestSearch {
	
	public static void main(String[] args) {
		ExecutorService service = Executors.newCachedThreadPool();
		try {
			IndexSearcher searcher = SearchUtil.getIndexSearcherByIndexPath("E:\\dsgTemp\\indexRepo2",service);
			System.out.println(SearchUtil.getMaxDocId(searcher));
			Query query = new MatchAllDocsQuery();
			//分页查询
			TopDocs docs = SearchUtil.getScoreDocsByPerPage(1, 20, searcher, query);
			ScoreDoc[] scoreDocs = docs.scoreDocs;
			System.out.println("所有的数据总数为:"+docs.totalHits);
			System.out.println("本页查询到的总数为:"+scoreDocs.length);
			for (ScoreDoc scoreDoc : scoreDocs) {
				Document doc = SearchUtil.getDefaultFullDocument(searcher, scoreDoc.doc);
				System.out.println(doc);
			}
			System.out.println("\n\n");
			
			//指定查询字段
			TopDocs docsAll = SearchUtil.getScoreDocs(searcher, query);
			Set<String> fieldSet = new HashSet<String>();
			fieldSet.add("name");
			fieldSet.add("age");
			for (int i = 0 ; i < 5 ; i ++) {
				Document doc = SearchUtil.getDocumentByListField(searcher, docsAll.scoreDocs[i].doc,fieldSet);
				System.out.println(doc);
			}
			
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}finally{
			service.shutdownNow();
		}
	}
}