optimize lucene 4.6.1 for real time search - search

I know there are some similar questions, but as I searched through none of them is suitable to real-time searching.
my case is that I have over a few millions of small text files, for indexing I was able to achieve 3 mins per million files which is ok. The problem is for searching.
In order to get real time response, my search has to be as fast as possbile. currently, it returns result abount 8 - 10 seconds somehow. The query itself is kind of large though, but that should not be the major reason.
when indexing, i have used following configurations
public static LogMergePolicy optimizeIndex() {
LogMergePolicy mergePolicy = new LogByteSizeMergePolicy();
mergePolicy.setMergeFactor(2);
mergePolicy.setMaxMergeDocs(50000);
return mergePolicy;
}
and
IndexWriterConfig config =new IndexWriterConfig(Version.LUCENE_46, analyzer);
config.setMergePolicy(optimizeIndex());
config.setUseCompoundFile(false);
config.setRAMBufferSizeMB(16);
config.setMaxBufferedDocs(50000);
So any thoughts on how should I do to get real time response from search?

Related

NodeJS and Mongo line who's online

TL;DR
logging online users and reporting back a count (based on a mongo find)
We've got a saas app for schools and students, as part of this I've been wanting a 'live' who's online ticker.
Teachers from the schools will see the counter, and the students and parents will trigger it.
I've got a socket.io connect from the web app to a NodeJS app.
Where there is lots of traffic, the Node/Mongo servers can't handle it, and rather than trow more resources at it, I figured it's better to optomise the code - because I don't know what I'm doing :D
with each student page load:
Create a socket.io connection with the following object:
{
'name': 'student or caregiver name',
'studentID': 123456,
'schoolID': 123,
'role': 'student', // ( or 'mother' or 'father' )
'page': window.location
}
in my NODE script:
io.on('connection', function(client) {
// if it's a student connection..
if(client.handshake.query.studentID) {
let student = client.handshake.query; // that student object
student.online = new Date();
student.offline = null;
db.collection('students').updateOne({
"reference": student.schoolID + student.studentID + student.role }, { $set: student
}, { upsert: true });
}
// IF STAFF::: just show count!
if(client.handshake.query.staffID) {
db.collection('students').find({ 'offline': null, 'schoolID':client.handshake.query.schoolID }).count(function(err, students_connected) {
emit('online_users' students_connected);
});
}
client.on('disconnect', function() {
// then if the students leaves the page..
if(client.handshake.query.studentID) {
db.collection('students').updateMany({ "reference": student.reference }, { $set: { "offline": new Date().getTime() } })
.catch(function(er) {});
}
// IF STAFF::: just show updated count!
if(client.handshake.query.staffID) {
db.collection('students').find({ 'offline': null, 'schoolID':client.handshake.query.schoolID }).count(function(err, students_connected) {
emit('online_users' students_connected);
});
}
});
});
What Mongo Indexes would you add, would you store online students differently (and in a different collection) to a 'page tracking' type deal like this?
(this logs the page and duration so I have another call later that pulls that - but that's not heavily used or causing the issue.
If separately, then insert, then delete?
The EMIT() to staff users, how can I only emit to staff with the same schoolID as the Students?
Thanks!
You have given a brief about the issue but no diagnosis on why the issue is happening. Based on a few assumptions I will try to answer your question.
First of all you have mentioned that you'd like suggestions on what Indexes can help your cause, based on what you have mentioned it's a write heavy system and indexes in principle will only slow the writes because on every write the Btree that handles the indexes will have to be updated too. Although the reads become way better specially in case of a huge collection with a lot of data.
So an index can help you a lot if your collection has let's say, 1 million documents. It helps you to skim only the required data without even doing a scan on all data, thanks to the Btree.
And Index should be created specifically based on the read calls you make.
For e.g.
{"student_id" : "studentID", "student_fname" : "Fname"}
If the read call here is based on student_id then create and index on that, and if multiple values are involved (equality - sort or anything) then create a compound index on those fields, giving priority to Equality field first and range and sort fields thereafter.
Now the seconds part of question, what would be better in this scenario.
This is a subjective thing and I'm sure everyone will have a different approach to this. My solution is based on a few assumptions.
Assumption(s)
The system needs to cater to a specific feature where student's online status is updated in some time interval and that data is available for reads for parents, teachers, etc.
The sockets that you are using, if they stay connected continuously all the time then it's that many concurrent connections with the server, if that is required or not, I don't know. But concurrent connections are heavy for the server as you would already know and unless that's needed 100 % try a mixed approach.
If it would be okay for you disconnect for a while or keep connection with the server for only a short interval then please consider that. Which basically means, you disconnect from the server gracefully, connect send data and repeat.
Or, just adopt a heartbeat system where your frontend app will call an API after set time interval and ping the server, based on that you can handle if the student is online or not, a little time delay, yes but easily scaleable.
Please use redis or any other in memory data store for such frequent writes and specially when you don't need to persist the data for long.
For example, let's say we use a redis list for every class / section of user and only update the timestamp (epoch) when their last heartbeat was received from the frontend.
In a class with 60 students, sort the students based on student_id or something like that.
Create a list for that class
For student_id which is the first in ascended student's list, update the epoch like this
LSET mylist 0 "1266126162661" //Epoch Time Stamp
0 is your first student and 59 is our 60th student, update it on every heartbeat. Either via API or the same socket system you have. Depends on your use case.
When a read call is needed
LRANGE classname/listname 0 59
Now you have epochs of all users, maintain the list of students either via database or another list where you can simply match the indexes with a specific student.
LSET studentList 0 "student_id" //Student id of the student or any other data, I am trying to explain the logic
On frontend when you have the epochs take the latest epoch in account and based on your use case, for e.g. let's say I want a student to be online if the hearbeat was received 5 minutes back.
Current Timestamp - Timestamp (If less than 5 minutes (in seconds)) then online or else offline.
This won't be a complete answer without discussing the problem some more, but figured I'd post some general suggestions.
First, we should figure out where the performance bottlenecks are. Is it a particular query? Is it too many simultaneous connections to MongoDB? Is it even just too much round trip time per query (if the two servers aren't within the same data center)? There's quite a bit to narrow down here. How many documents are in the collection? How much RAM does the MongoDB server have access to? This will give us an idea of whether you should be having scaling issues at this point. I can edit my answer later once we have more information about the problem.
Based on what we know currently, without making any model changes, you could consider indexing the reference field in order to make the upsert call faster (if that's the bottleneck). That could look something like:
db.collection('students').createIndex({
"reference": 1
},
{ background: true });
If the querying is the bottleneck, you could create an index like:
db.collection('students').createIndex({
"schoolID": 1
},
{ background: true });
I'm not confident (without knowing more about the data) that including offline in the index would help, because optimizing for "not null" can be tricky. Depending on the data, that may lead to storing the data differently (like you suggested).

Unable to delete large number of rows from Spanner

I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:
CREATE TABLE predictions (
name STRING(MAX),
...,
model_version INT64,
) PRIMARY KEY (name, model_version)
I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:
DELETE FROM predictions WHERE model_version <> ?
According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.
# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
"DELETE FROM predictions WHERE model_version <> #version",
params={"model_version": 104},
param_types={"model_version": Type(code=INT64)},
)
My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/
I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:
def delete_old_rows(transaction, model_version):
delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
dml_statements = [
delete_dml,
]
status, row_counts = transaction.batch_update(dml_statements)
database.run_in_transaction(delete_old_rows,
model_version=104,
timeout_secs=3600,
)
What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.
Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance
The reason that setting timeout_secs didn't help was because the argument is unfortunately not the timeout for the transaction. It's the retry timeout for the transaction so it's used to set the deadline after which the transaction will stop being retried.
We will update the docs for run_in_transaction to explain this better.
The root cause was that the total timeout for the Streaming RPC calls was set too low in the client libraries, being set to 120s for Streaming APIs (eg ExecuteStreamingSQL used by partitioned DML calls.)
This has been fixed in the client library source code, changing them to a 60 minute timout (which is the maximum), and will be part of the next client library release.
As a workaround, in Java, you can configure the timeouts as part of the SpannerOptions when you connect your database. (I do not know how to set custom timeouts in Python, sorry)
final RetrySettings retrySettings =
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofMinutes(60L))
.setMaxRpcTimeout(Duration.ofMinutes(60L))
.setMaxAttempts(1)
.setTotalTimeout(Duration.ofMinutes(60L))
.build();
SpannerOptions.Builder builder =
SpannerOptions.newBuilder()
.setProjectId("[PROJECT]"));
builder
.getSpannerStubSettingsBuilder()
.applyToAllUnaryMethods(
new ApiFunction<UnaryCallSettings.Builder<?, ?>, Void>() {
#Override
public Void apply(Builder<?, ?> input) {
input.setRetrySettings(retrySettings);
return null;
}
});
builder
.getSpannerStubSettingsBuilder()
.executeStreamingSqlSettings()
.setRetrySettings(retrySettings);
builder
.getSpannerStubSettingsBuilder()
.streamingReadSettings()
.setRetrySettings(retrySettings);
Spanner spanner = builder.build().getService();
The first suggestion is to try gcloud instead.
https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml
Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.
Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?
Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.
I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample
You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

How to perform massive data uploads to firebase firestore

I have about ~300mb of data (~180k json objects) that gets updated once every 2-3 days.
This data is divided into three "collections", that I must keep up to date.
I decided to take the Node.JS way, but any solution in a language i know ( Java, Python) will be welcomed.
Whenever I perform a batch set using the node.JS firebase-admin client, not only it consumes an aberrant amount of ram ( about 4-6GB!), but it also tends to crash with errors that don't have a clear ( up to page 4 of google search without a meaningful answer ) reason.
My code is frankly simple, this is it:
var collection = db.collection("items");
var batch = db.batch();
array.forEach(item => {
var ref = collection.doc(item.id);
batch.set(ref, item);
});
batch.commit().then((res) => {
console.log("YAY",res);
});
I haven't found anywhere if there is a limit on the number of writes in a limited span of time (I understand doing 50-60k writes should be easy peasy with a backend the size of firebase), and also found that this can go up the ram train and have like 4-6GB of ram allocated.
I can confirm that when the errors are thrown, or the ram usage clogs my laptop, whatever happens first, I am still at less than 1-4% my daily usage quotas, so that is not the issue.

Bugzilla search result is too long

Suppose I want to search for bugs reported in recent 2 years. The initial result page says "This result was limited to 500 bugs"
Apparently there are more than 500 bugs, so I click See all search results for this query. This time, it shows 10000 bugs, but with a message saying "This list is too long for Bugzilla's little mind; the Next/Prev/First/Last buttons won't appear on individual bugs"
So my question is:
How do I know the exact number of bugs returned by my query (it's unlikely to be exactly 10000)
How do I view the entire search results? Currently it seems like if the search results exceed 10000, the results are truncated. And I didn't find any prev/next page button to navigate the search results page.
You may not see all the bugs because of the configuration your administrator set on your bugzilla instance.
However, using the search function from the bugzilla webservice you can retrieve the list of bugs. If the number of bugs returned by the query is capped, then iterate on the search query using a higher offset and limit. Here is some pseudocode
offset = 0
limit = 5000
currentcount = ws.search(criterias, offset, limit).count
while currentcount == limit
{
offset += limit
currentcount = ws.search(criterias, offset, limit).count
}
totalbugs = currentcount + offset
The same algorithm would work if you also wanted to get the whole list of bugs instead of just the count.
If the idea of sending multiple queries to the webservice don't feel right you may have to talk to the admin to know what hard limits are set on your install of bugzilla and see how you can tweak them to get the results you need

Resources