Sphinx search crashes when there are no indexes - search

I want a sphinx searchd to start up but there are no indexes populated as yet. I have a separate cron job that pulls data from a data source and then calls the indexer to generate the indexes.
So the first time searchd starts the cron job has not yet run, hence there are no indexes. And searchd fails with errors like:
FATAL: no valid indexes to serve
Is there any way to get around this? e.g. to start earchd even when there are no indexes and if someone searched against it during that time, it just returns no docids. Later when the cron job runs, the indexes will be populated and then searched can query those indexes.

if someone searched against it during that time, it just returns no docids.
That would require an actual index to search againast.
Just create an empty index. Then when indexer runs, it recreates the index (with data this time) and notifies searchd - using --rotate switch.
Example of a way to produce a 'empty' index, as provided by #ctx: (Added Dec, 2014)
source force {
type = xmlpipe2
xmlpipe_command = cat /tmp/test.xml
}
index force {
source = force
path = /path/to/sphinx/datadir/filename
charset_type=utf-8
}
/tmp/test.xml:
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="subject"/>
</sphinx:schema>
</sphinx:docset>
indexer force and now searchd should be able to run.
Alternativly can use something like sql_query = SELECT 1,'' but that does require connection to a real database server.

Related

How do I create a composite index for my Firestore query?

I'm trying to perform a firestore query on a collection which results in a failure because an index needs to be created for the query I'm attempting. The error contains a link that is suppose to auto create the missing index for me. However when I follow the link and attempt to create the index that has been prepared for me I encounter an error stating "name only indexes are not supported". I would also point out I have been using the npm functions-framework to test my cloud function that contains the relevant query.
I have tried creating the composite index myself manually but none of the index I have made seem to satisfy my attempted query.
Sample docs in my Items Collection:
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
{
descriptionLastModified: someTimestamp <a timestamp datatype>
detectedLanguage: "fr" <string>
}
{
descriptionLastModified:someTimestamp <a timestamp datatype>
detectedLanguage: "en-us" <string>
}
These are all queries I have tried which fail:
let queryRef = itemsRef.where('descriptionLastModified','<=', oneDayAgoTimestamp).orderBy("descriptionLastModified","desc").where("detectedLanguage", '==', "en-us").get()
let queryRef = itemsRef.where('descriptionLastModified','<=', oneDayAgoTimestamp).where("detectedLanguage", '==', "en-us").get()
let queryRef = itemsRef.where("detectedLanguage", '==', "en-us").where('descriptionLastModified','<=', oneDayAgoTimestamp).get()
I have made the following composite indexes at the collection level to no avail:
CollectionId:items Fields: descriptionLastModified:DESC detectedLangauge: ASC
CollectionId:items Fields: descriptionLastModified:ASC detectedLangauge: ASC
CollectionId:items Fields: detectedLangauge: ASC descriptionLastModified:DESC
My expectation is I should be able to filter my items by their descriptionLastModified timestamp field and additionally by the value of their detected Language string field.
In case anyone finds this in the future, its 2021, I still find composite indexes created manually, despite being incredibly simple, or you'd think and I fully understand why the OP thought his indexes would work, often just don't. Doubtless there is some subtlety that reading some guides would make clear but I haven't found the trick yet and have been using firestore for over 18 months intensively at work.
The trick is to use the link it creates, but this often fails, you get a dialogue box telling you an index will be created, but no details for you to manually create and the friendly blue 'create' button does nothing, it neither creates the index nor does it dismiss the window.
For a while I had it working in firefox but it stopped. A colleague across a couple of desks who has to create them a lot tells me that Edge is the most reliable, and you have to be very careful to not have multiple google accounts signed in - if edge (or chrome) takes you to the wrong login initially when following the link, even if you switch user back (and you have to do this because it will assume your default login rather than say the one currently selected in your only google cloud console window), even if you switch back its about a 1 in 3. He tells me in edge it works about 60%
I used to get about 30% with firefox just hitting refresh and soon a few times, but cant get it working other than in edge now, and actually, unless there is a client with little cash who will notice, I just go for inefficient and costly queries which return the superset of results and do some filters on the results. Mostly running in nodejs and its nippy enough for my purposes. Real shame to ramp up the read counts and consequential bills, but just doesn't seem a fix.

Query Google Cloud Datastore to retrieve matching results

I am using Google Cloud Datastore to save my application data. I have to add a query to get all results matching with Name, Brand or Sku.
Query data with one of the field is returning me records but using all fields together returns me error.
Query:
const term = "My Red";
const q = gstore.createQuery(req.params.orgId, "Variant")
.filter('brand', '=', term)
.filter('sku', '=', term)
.limit(10);
Error:
{"msec":435.96913800016046,"error":"no matching index found.
recommended index is:- kind: Variant properties: -
name: brand - name:
sku","data":{"code":412,"metadata":{"_internal_repr":{}},"isBoom":true,"isServer":true,"data":null,"output":{"statusCode":500,"payload":{"statusCode":500,"error":"Internal
Server Error","message":"An internal server error
occurred"},"headers":{}}}} Debug: internal, error
Also, I want to perform OR operation to get matching results as above will return data with AND operation.
Please help me to find correct path to achieve the desired result.
Thanks in advance and let me know if something is not clear.
The error indicates that the composite index required by the respective query is not in Serving state.
That means it's either not created/deployed or it was recently deployed and is still being built.
Composite indexes must be specifically created and deployed in your app.
If you didn't create it you need to do so. The error message indicates the content the index configuration requires. If you're using the development server it might create it automatically, but you still need to deploy it.
See Indexes docs for more details.
If you recently deployed the composite index please note that it can take some significant amount of time until the matching index is built, depending on how many entities of that kind already exist in the Datastore. You can check the status of the index building in the developer console, on the Indexes page

Right way to delete and then reindex ES documents

I have a python3 script that attempts to reindex certain documents in an existing ElasticSearch index. I can't update the documents because I'm changing from an autogenerated id to an explicitly assigned id.
I'm currently attempting to do this by deleting existing documents using delete_by_query and then indexing once the delete is complete:
self.elasticsearch.delete_by_query(
index='%s_*' % base_index_name,
doc_type='type_a',
conflicts='proceed',
wait_for_completion=True,
refresh=True,
body={}
)
However, the index is massive, and so the delete can take several hours to finish. I'm currently getting a ReadTimeoutError, which is causing the script to crash:
WARNING:elasticsearch:Connection <Urllib3HttpConnection: X> has failed for 2 times in a row, putting on 120 second timeout.
WARNING:elasticsearch:POST X:9200/base_index_name_*/type_a/_delete_by_query?conflicts=proceed&wait_for_completion=true&refresh=true [status:N/A request:140.117s]
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='X', port=9200): Read timed out. (read timeout=140)
Is my approach correct? If so, how can I make my script wait long enough for the delete_by_query to complete? There are 2 timeout parameters that can be passed to delete_by_query - search_timeout and timeout, but search_timeout defaults to no timeout (which is I think what I want), and timeout doesn't seem to do what I want. Is there some other parameter I can pass to delete_by_query to make it wait as long as it takes for the delete to finish? Or do I need to make my script wait some other way?
Or is there some better way to do this using the ElasticSearch API?
You should set wait_for_completion to False. In this case you'll get task details and will be able to track task progress using corresponding API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html#docs-delete-by-query-task-api
Just to explain more in the form of codebase explained by Random for the newbee in ES/python like me:
ES = Elasticsearch(['http://localhost:9200'])
query = {'query': {'match_all': dict()}}
task_id = ES.delete_by_query(index='index_name', doc_type='sample_doc', wait_for_completion=False, body=query, ignore=[400, 404])
response_task = ES.tasks.get(task_id) # check if the task is completed
isCompleted = response_task["completed"] # if complete key is true it means task is completed
One can write custom definition to check if the task is completed in some interval using while loop.
I have used python 3.x and ElasticSearch 6.x
You can use the 'request_timeout' global param. This will reset the Connections timeout settings, as mentioned here
For example -
es.delete_by_query(index=<index_name>, body=<query>,request_timeout=300)
Or set it at connection level, for example
es = Elasticsearch(**(get_es_connection_parms()),timeout=60)

Hot Swappable Index in Sitecore 7.2 with Lucene

I was experimenting about the new Sitecore 7.2 functionality SwitchOnRebuildLuceneIndex
Apparently, this functionality allow me to access in a readonly mode my Index meanwhile I am rebuilding the index.
Is there any way to have a full operational index (not read-only) meanwhile I am rebuilding the index?
The test that I am performing is the following one:
1) Rebuild a custom index with 30k items (it takes 30 sec)
2) meanwhile the index is rebuilding: Add a Sitecore Items (via code)
3) meanwhile the index rebuilding: access the custom index (via code) to get the count of items
4) after the index completed the rebuild: access the custom index (via code) to get the count of items
In step 3 it returns the original item counts 30000
In step 4 it returns the updated item counts 30001
thanks for the help
Stelio
I do not believe that this will be possible. Conceptually speaking, Sitecore is essentially a software that makes databases more user-friendly and defines a structure that both technical and non-technical persons can understand and follow. What you are talking about goes against the concept of ACID, database locks and transactions. I have commented with more technical (database) annotations on your steps, inline, below:
Rebuild a custom index... - Place a lock on the items in the database and start transaction
meanwhile ...: add a sitecore item... - A separate transaction running against the items, though not affecting the locked set that the transaction started in step 1 is using
meanwhile ...: access the custom... - Another transaction runs after the transaction in step 2, thus including the count of all items (including the locked set and the newly added item)
after the index completed... - Transaction 1 completed and lock released; Get count of items from custom index returns a different count than count of items if not counted from the index (the latter is greater, as a new item was added)
As such, step 3 returns the new count of items and step 4 returns the original.
If you want to keep track of the changes, which happened during the index rebuild you could use the IntervalAsynchronousStrategy as your index rebuild strategy.
<strategies hint="list:AddStrategy">
<intervalAsyncMaster type="Sitecore.ContentSearch.Maintenance.Strategies.IntervalAsynchronousStrategy, Sitecore.ContentSearch">
<param desc="database">master</param>
<param desc="interval">00:00:05</param>
<!-- whether full index rebuild should be triggered if the number of items in history engine exceeds ContentSearch.FullRebuildItemCountThreshold -->
<CheckForThreshold>true</CheckForThreshold>
</intervalAsyncMaster>
</strategies>
This reads through the History table and updates your index accordingly.
If you check the Sitecore implementation of this class, you can see that it handles the rebuilding event. If rebuilding is running it doesn't do anything and waits for the next time it is scheduled, and if the rebuild has finished it collects the entries from History table and applies them to the index. See Run method of the class:
...
if (IndexCustodian.IsIndexingPaused(this.index))
{
CrawlingLog.Log.Debug(string.Format("[Index={0}] IntervalAsynchronousUpdateStrategy triggered but muted. Indexing is paused.", this.index.Name), null);
}
else if (IndexCustodian.IsRebuilding(this.index))
{
CrawlingLog.Log.Debug(string.Format("[Index={0}] IntervalAsynchronousUpdateStrategy triggered but muted. Index is being built at the moment.", this.index.Name), null);
}
else
{
CrawlingLog.Log.Debug(string.Format("[Index={0}] IntervalAsynchronousUpdateStrategy executing.", this.index.Name), null);
HistoryEntry[] history = HistoryReader.GetHistory(this.Database, this.index.Summary.LastUpdated);
...

How can i manually reindex solr using sunspot?

I have couchdb. Sunspot was correctly indexing everything. But the Solr server crashed. I need to reindex the whole thing. rake sunspot:reindex wont work as it is tigthly coupled with active record. sunspot.index(model.all) didnt work. the solr core says 0 indexed docs even after doing that. is there a way out?
Post.solr_reindex
There are a number of options that can be passed to solr_reindex. The same options as to index; from the documentation
index in batches of 50, commit after each
Post.index
index all rows at once, then commit
Post.index(:batch_size => nil)
index in batches of 50, commit when all batches complete
Post.index(:batch_commit => false)
include the associated +author+ object when loading to index
Post.index(:include => :author)
What I was looking for was this:
Post.index!(Model.all)
There was something bad happening when I tried to index assuming that batch commits would happen automatically. Any way this worked totally fine for me.
I usually write below command to index models. It works perfectly every time.
For a Model i.e. (Post)
Sunspot.index Post.all
For a Model row i.e. (Post.where(id: 5))
Sunspot.index Post.where(id: 5)
It will work.
Cheers!

Resources