How to cache Lucene search result? - search

I looking for solution how to cache searching results in Lucene.
When i used Solr pagination was much easier
My solr code:
query.setStart(start);
query.setRows(rows);
QueryResponse response = solr.query(query);
Simple wildcard searching it was like 400ms for 1st 100 results and each next page it was like 20-70ms
But when I'm using Lucene each time I have to search again and each page takes 400ms
My Lucene code:
Query query = queryParser.parse(text);
TopScoreDocCollector collector=TopScoreDocCollector.create(1000000);
IndexSearcher = indexSearcher.search(query, collector);
TopDocs results =collector.topDocs(start,rows);
for (ScoreDoc scoreDoc : results.scoreDocs) {
Document document = indexSearcher.doc(scoreDoc.doc);
I tried make TopScoreDocCollector and IndexSearcher static but this don't work
Do you have any other solution?

I made results static
static TopDocs results;
results = indexSearcher.search(query, 100000);
public ArrayList meakeResult() throws IOException{
ArrayList res = new ArrayList();
ScoreDoc[] hits=results.scoreDocs;
for (int i=start; i < start+rows; i++) {
Document document = indexSearcher.doc(hits[i].doc);
Answer tab = new Answer();
tab.setAnswer(document.get("answer"));
tab.setQuestion("question" + document.get("question"));
tab.setProces("proces" + document.get("proces"));
tab.setForm("form: " + document.get("form"));
res.add(tab);
}

Related

Azure Cosmos Db, select after row?

I'm trying to select some rows after x rows, something like:
SELECT * from collection WHERE ROWNUM >= 235 and ROWNUM <= 250
Unfortunately it looks like ROWNUM isn't resolved in azure cosmos db.
Is there another way to do this? I've looked at using continuation tokens but it's not helpful if a user skips to page 50, would I need to keep querying with continuation tokens to get to page 50?
I've tried playing around with the page size option but that has some limitations in terms of how many things it can return at any one time.
For example I have 1,000,000 records in Azure. I want to query rows
500,000 to 500,010. I can't do SELECT * from collection WHERE ROWNUM >= 500,000 and ROWNUM <= 500,010 so how do I achieve this?
If you don't have any filters, you can't retrieve items in specific range via query sql direcly in cosmos db so far. So, you need to use pagination to locate your desire items. As I know, pagination is supported based on continuation token only so far.
Please refer to the function as below:
using JayGongDocumentDB.pojo;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.Documents.Linq;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
namespace JayGongDocumentDB.module
{
class QuerySample1
{
public static async void QueryPageByPage()
{
// Number of documents per page
const int PAGE_SIZE = 2;
int currentPageNumber = 1;
int documentNumber = 1;
// Continuation token for subsequent queries (NULL for the very first request/page)
string continuationToken = null;
do
{
Console.WriteLine($"----- PAGE {currentPageNumber} -----");
// Loads ALL documents for the current page
KeyValuePair<string, IEnumerable<Student>> currentPage = await QueryDocumentsByPage(currentPageNumber, PAGE_SIZE, continuationToken);
foreach (Student student in currentPage.Value)
{
Console.WriteLine($"[{documentNumber}] {student.Name}");
documentNumber++;
}
// Ensure the continuation token is kept for the next page query execution
continuationToken = currentPage.Key;
currentPageNumber++;
} while (continuationToken != null);
Console.WriteLine("\n--- END: Finished Querying ALL Dcuments ---");
}
public static async Task<KeyValuePair<string, IEnumerable<Student>>> QueryDocumentsByPage(int pageNumber, int pageSize, string continuationToken)
{
DocumentClient documentClient = new DocumentClient(new Uri("https://***.documents.azure.com:443/"), "***");
var feedOptions = new FeedOptions
{
MaxItemCount = pageSize,
EnableCrossPartitionQuery = true,
// IMPORTANT: Set the continuation token (NULL for the first ever request/page)
RequestContinuation = continuationToken
};
IQueryable<Student> filter = documentClient.CreateDocumentQuery<Student>("dbs/db/colls/item", feedOptions);
IDocumentQuery<Student> query = filter.AsDocumentQuery();
FeedResponse<Student> feedRespose = await query.ExecuteNextAsync<Student>();
List<Student> documents = new List<Student>();
foreach (Student t in feedRespose)
{
documents.Add(t);
}
// IMPORTANT: Ensure the continuation token is kept for the next requests
return new KeyValuePair<string, IEnumerable<Student>>(feedRespose.ResponseContinuation, documents);
}
}
}
Output:
Hope it helps you.
Update Answer:
No such function like ROW_NUMBER() [How do I use ROW_NUMBER()? ] in cosmos db so far. I also thought skip and top.However, top is supported and skip yet(feedback).It seems skip is already in processing and will be released in the future.
I think you could push the feedback related to the paging function.Or just take above continuation token as workaround temporarily.

how to fetch Large sharepoint list item(>10000) and write them in text file

My SharePoint site having a Large List that contains large data
I have to fetch all items and show them in gridview?
I am using below code and getting the below error
"The attempted operation is prohibited because it exceeds the list view threshold enforced by the administrator"
private void GetData()
{
using (SPSite site = new SPSite("URL"))
{
using (SPWeb web = site.OpenWeb())
{
SPList list = web.Lists.TryGetList("BulkData");
if (list != null)
{
SPQuery query = new SPQuery();
query.Query = "<Where><IsNotNull><FieldRef Name=\"Title\" /></IsNotNull></Where>";
query.QueryThrottleMode = SPQueryThrottleOption.Override;
SPListItemCollection items = list.GetItems(query);
int itemCount = items.Count;
StringBuilder sb = new StringBuilder();
string str1 = string.Empty;
foreach (SPListItem item in items)
{
int i = 1;
sb.Append("\r\n").Append(Convert.ToString(item["Title"])).Append("\r\n");
i++;
}
Log(sb);
}
}
}
}
region Log File
public void Log(StringBuilder ErrorMessage)
{
string LogFileTime = DateTime.Now.ToString("ddMMyyyyHHmmss");
string LogFilePath = Server.MapPath(#"~\\Logs\");
if (!File.Exists(LogFilePath + "BulkData" + LogFileTime + ".txt"))
{
var LogFileName = File.Create(LogFilePath + "BulkData" + LogFileTime + ".txt");
var WriteToLogFile = new StreamWriter(LogFileName);
WriteToLogFile.Write(ErrorMessage);
WriteToLogFile.Close();
LogFileName.Close();
}
}
#endregion
You have to modify the List View threshold in Central Admin, setup by default to 5000 elements.
To minimize database contention, SQL Server often uses row-level locking as a strategy to ensure accurate updates without adversely impacting other users who are accessing other rows.
However, if a read or write database operation, such as a query, causes more than 5,000 rows to be locked at once, then it's more efficient for SQL Server to temporarily escalate the lock to the entire table until the database operation is completed.
See this link on MSDN
See this link for instructions step-by-step
However if you must change this value by code you can
SPQuery q1 = new SPQuery();
q1.QueryThrottleMode = SPQueryThrottleOption.Override;
Caution! Remember to grant privileges to the account that will run the code.
see this link for details.
Try using SPQuery.RowLimit to specify the number of items to be fetched. (The MSDN link also has an example of loading limited number of items in multiple pages)
You may use two different approaches:
ContentIterator object - is not available in SharePoint
Foundation
SPWeb.ProcessBatchData with Display method - is available in SP Foundation but is very complicated as it's very complex.

confused Lucene StringField

I want to read some documents from the index has been created, and then put them in another index。But I can not retrieve these documents in “another index”
oh,the documets just have StringField 。。
..someboy can help me
the code:
public static void test() throws IOException{
IndexWriterConfig conf=new IndexWriterConfig(Version.LUCENE_43, new MapbarAnalyzer(TokenizerModle.COMMON));
conf.setOpenMode(OpenMode.CREATE);
conf.setMaxBufferedDocs(10000);
LogByteSizeMergePolicy policy=new LogByteSizeMergePolicy();
policy.setNoCFSRatio(1.0);
policy.setUseCompoundFile(true);
conf.setMergePolicy(policy);
Directory d=new RAMDirectory();
IndexWriter iw=new IndexWriter(d, conf);
Document doc=new Document();
doc.add(new StringField("type", "5B0", Store.YES));
iw.addDocument(doc);
iw.close();
IndexReader r=DirectoryReader.open(d);
IndexSearcher is=new IndexSearcher(r);
Query q=new TermQuery(new Term("type","5B0"));
TopDocs docs=is.search(q, 10);
System.out.println(docs.totalHits);
Directory d1=new RAMDirectory();
IndexWriter iw1=new IndexWriter(d1, conf);
int maxdoc=r.maxDoc();
for(int i=0;i<maxdoc;i++){
Document doc0=r.document(i);
iw1.addDocument(doc0);
}
iw1.close();
IndexReader r1=DirectoryReader.open(d1);
IndexSearcher is1=new IndexSearcher(r1);
Query q1=new TermQuery(new Term("type","5B0"));
TopDocs docs1=is1.search(q1, 10);
System.out.println(docs1.totalHits);
}
You can try to compare what's the differences between these two index/document/query
It turns out doc0's field is set with a "tokenized" attribute.
change the code like this:
for(int i=0;i<maxdoc;i++){
Document doc0=r.document(i);
Field f1 = (Field) doc0.getField("type");
f1.fieldType().setTokenized(false);
iw1.addDocument(doc0);
}
and you can get the result from another index.
But I have no idea why FieldType getting from InderReader changed...

Lucene - simpleAnalyzer - How to get matched word(s)?

I can't get offset of or directly the word itself by using the following algorithm. Any help would be appreciated
...
Analyzer analyzer = new SimpleAnalyzer();
MemoryIndex index = new MemoryIndex();
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
float score = index.search(parser.parse("+content:" + target));
if(score > 0.0f)
System.out.println("How to know matched word?");
Here is whole in memory index and search example. I have just written in for my self and it works perfectly. I understand that you need to store index in memory, but the question is why you need MemoryIndex for that? You simply use RAMDirectory instead and your index will be stored in memory, so when you perform your search, index will be loaded from RAMDirectory (memory).
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer);
RAMDirectory directory = new RAMDirectory();
try {
IndexWriter indexWriter = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new Field("content", text, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_OFFSETS));
indexWriter.addDocument(doc);
indexWriter.optimize();
indexWriter.close();
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", analyzer);
IndexSearcher searcher = new IndexSearcher(directory, true);
IndexReader reader = IndexReader.open(directory, true);
Query query = parser.parse(word);
TopScoreDocCollector collector = TopScoreDocCollector.create(10000, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
if (hits != null && hits.length > 0) {
for (ScoreDoc hit : hits) {
int docId = hit.doc;
Document hitDoc = searcher.doc(docId);
TermFreqVector termFreqVector = reader.getTermFreqVector(docId, "content");
TermPositionVector termPositionVector = (TermPositionVector) termFreqVector;
int termIndex = termFreqVector.indexOf(word);
TermVectorOffsetInfo[] termVectorOffsetInfos = termPositionVector.getOffsets(termIndex);
for (TermVectorOffsetInfo termVectorOffsetInfo : termVectorOffsetInfos) {
concordances.add(processor.processConcordance(hitDoc.get("content"), word, termVectorOffsetInfo.getStartOffset(), size));
}
}
}
analyzer.close();
searcher.close();
directory.close();

Relevance the search result in Lucene

What I want is :
In the search method i will add an extra parameter say relevance param of type float to setup the cuttoff relevance. So lets say if the cutoff is 60% I want items that are higher than 60% relevance.
Here is current code of search :
say the search text is a
and in lucene file system i have following description:
1) abcdef
2)abc
3)abcd
for now it will fetch all the above three docuements , i want to fetch those which are that are higher than 60% relevance.
//for now i am not using the relevanceparam anywhere in the method :
public static string[] Search(string searchText,float relevanceparam)
{
//List of ID
List<string> searchResultID = new List<string>();
IndexSearcher searcher = new IndexSearcher(reader);
Term searchTerm = new Term("Text", searchText);
Query query = new TermQuery(searchTerm);
Hits hits = searcher.Search(query);
for (int i = 0; i < hits.Length(); i++)
{
float r = hits.Score(i);
Document doc = hits.Doc(i);
searchResultID.Add(doc.Get("ID"));
}
return searchResultID.ToArray();
}
Edit :
what if i set boost to my query
say : query.SetBoost(1.6);-- is this is equivalent to 60 percent?
You can easily do this by ignore those hits that have less than a TopDocs.MaxScore * minRelativeRelevance where minRelativeRelevance should be a value between 0 and 1.
I've modified your code to match the 3.0.3 release of Lucene.Net, and added a FieldSelector to your call to IndexSearcher.Doc to avoid loading non-required fields.
Calling Query.SetBoost(1.6) would only mean that the score calculated by that query would be boosted by 60% (multiplied with 1.6). It may change the ordering of the result if there were other queries involved (in a BooleanQuery, for example), but it wont change which results are returned.
public static String[] Search(IndexReader reader, String searchText,
Single minRelativeRelevance) {
var resultIds = new List<String>();
var searcher = new IndexSearcher(reader);
var searchTerm = new Term("Text", searchText);
var query = new TermQuery(searchTerm);
var hits = searcher.Search(query, 100);
var minScore = hits.MaxScore * minRelativeRelevance;
var fieldSelector = new MapFieldSelector("ID");
foreach (var hit in hits.ScoreDocs) {
if (hit.Score >= minScore) {
var document = searcher.Doc(hit.Doc, fieldSelector);
var hitId = document.Get("ID");
resultIds.Add(hitId);
}
}
return resultIds.ToArray();
}

Resources