Hot Swappable Index in Sitecore 7.2 with Lucene - search

I was experimenting about the new Sitecore 7.2 functionality SwitchOnRebuildLuceneIndex
Apparently, this functionality allow me to access in a readonly mode my Index meanwhile I am rebuilding the index.
Is there any way to have a full operational index (not read-only) meanwhile I am rebuilding the index?
The test that I am performing is the following one:
1) Rebuild a custom index with 30k items (it takes 30 sec)
2) meanwhile the index is rebuilding: Add a Sitecore Items (via code)
3) meanwhile the index rebuilding: access the custom index (via code) to get the count of items
4) after the index completed the rebuild: access the custom index (via code) to get the count of items
In step 3 it returns the original item counts 30000
In step 4 it returns the updated item counts 30001
thanks for the help
Stelio

I do not believe that this will be possible. Conceptually speaking, Sitecore is essentially a software that makes databases more user-friendly and defines a structure that both technical and non-technical persons can understand and follow. What you are talking about goes against the concept of ACID, database locks and transactions. I have commented with more technical (database) annotations on your steps, inline, below:
Rebuild a custom index... - Place a lock on the items in the database and start transaction
meanwhile ...: add a sitecore item... - A separate transaction running against the items, though not affecting the locked set that the transaction started in step 1 is using
meanwhile ...: access the custom... - Another transaction runs after the transaction in step 2, thus including the count of all items (including the locked set and the newly added item)
after the index completed... - Transaction 1 completed and lock released; Get count of items from custom index returns a different count than count of items if not counted from the index (the latter is greater, as a new item was added)
As such, step 3 returns the new count of items and step 4 returns the original.

If you want to keep track of the changes, which happened during the index rebuild you could use the IntervalAsynchronousStrategy as your index rebuild strategy.
<strategies hint="list:AddStrategy">
<intervalAsyncMaster type="Sitecore.ContentSearch.Maintenance.Strategies.IntervalAsynchronousStrategy, Sitecore.ContentSearch">
<param desc="database">master</param>
<param desc="interval">00:00:05</param>
<!-- whether full index rebuild should be triggered if the number of items in history engine exceeds ContentSearch.FullRebuildItemCountThreshold -->
<CheckForThreshold>true</CheckForThreshold>
</intervalAsyncMaster>
</strategies>
This reads through the History table and updates your index accordingly.
If you check the Sitecore implementation of this class, you can see that it handles the rebuilding event. If rebuilding is running it doesn't do anything and waits for the next time it is scheduled, and if the rebuild has finished it collects the entries from History table and applies them to the index. See Run method of the class:
...
if (IndexCustodian.IsIndexingPaused(this.index))
{
CrawlingLog.Log.Debug(string.Format("[Index={0}] IntervalAsynchronousUpdateStrategy triggered but muted. Indexing is paused.", this.index.Name), null);
}
else if (IndexCustodian.IsRebuilding(this.index))
{
CrawlingLog.Log.Debug(string.Format("[Index={0}] IntervalAsynchronousUpdateStrategy triggered but muted. Index is being built at the moment.", this.index.Name), null);
}
else
{
CrawlingLog.Log.Debug(string.Format("[Index={0}] IntervalAsynchronousUpdateStrategy executing.", this.index.Name), null);
HistoryEntry[] history = HistoryReader.GetHistory(this.Database, this.index.Summary.LastUpdated);
...

Related

Catalog synchronization is too slow, although the sync is getting completed

I am running catalog synchronization with products, category etc. However, I can see too many logs with the following info:
INFO [SyncWorker<000000S8 2 of 8>] [AbstractItemCopyContext] cannot create item due to pending attributes (..nonmandotory attributes)
INFO [SyncWorker<000000S8 3 of 8>] [AbstractItemCopyContext] cannot create item due to pending attributes (..nonmandotory attributes)
The sync is successful, but takes a lot of time to complete. The reason for this log is in ItemCreator.java:
if (cannotCreate) {
throw new ItemCopyCreator.MissingInitialAttributes("cannot create item due to pending attributes (values " + this.getCopyContext().valuesToString(initialValues) + ")", this._sourceItem);
}
This flag is true for all mandatory fields as per my understanding, hence would be called for baseproduct attribute also(logged when base product is picked up).
Any inputs on how to avoid this scenario? Can this be the reason for slow sync? Query time looks normal as per jdbc logs

Is there a way to specify a "master" or "index" migration?

I'm working on an existing Django 2.2 application comprising a custom app in conjunction with a Wagtail CMS, where I'm iteratively adding new wagtail page-types in separate user stories over time.
I want to be able to create a "master" or "index" migration that pre-builds each page-type in the database automatically when migrations are run (ours are performed in an Ansible task upon deployment). As far as I can tell, what I need requires:
The auto-built migration that modifies the DB schema for each page
A further migration that is always run last and which contains a dependencies attr - able to be updated with a single list-entry representing the new page's migration name, each time one is added.
I can already auto-build page-types using the following logic in a create() method called from migrations.RunPython() but at the moment, this same page-build logic needs to exist in each page's migration - I'd prefer it if this existed in a single migration (or an alternative procedure if one exists in DJango) that can always be run.
Ideally, the page_types list below could be replaced by just iterating over BasePage.__subclasses__(), (Where all page-types inherit from BasePage) meaning this "master" migration need never be altered again.
Note: if it helps any, the project is still in development, so any solution that is slightly controversial or strictly "dev-only" is acceptable - assuming it can be made acceptable and therefore less controversial by merging migrations later.
...
...
# Fetch the pre-created, root Page"
root_page = BasePage.objects.all().first()
page_types = [
ManageAccountPage,
EditUserDetailPage,
]
path_init = int('000100020003') # The last value for `path` from 0007_initialise_site_ttm.py
# Create, then add all child pages
for page_type in page_types:
title_raw = page_type.__name__.replace('Page', '')
page = page_type(
title=utils.convert_camel_to_human(title_raw),
slug=title_raw.lower(),
show_in_menus='t',
content_type=ContentType.objects.get_for_model(page_type),
path=path_init + 1,
depth=2
)
try:
root_page.add_child(instance=page)
except exceptions.ValidationError:
continue
...
...
What's the problem?
(See "What I've tried" below)
What I've tried:
A custom pin_curr_migration() method called from migrations.RunPython() that deletes the "master" migration's own record in django_migrations allowing it to be re-run. This however, results in errors where DJango complains about previously built pages already existing.

How to get Salesforce REST API to paginate?

I'm using the simple_salesforce python wrapper for the Salesforce REST API. We have hundreds of thousands of records, and I'd like to split up the pull of the salesforce data so all records are not pulled at the same time.
I've tried passing a query like:
results = salesforce_connection.query_all("SELECT my_field FROM my_model limit 2000 offset 50000")
to see records 50K through 52K but receive an error that offset can only be used for the first 2000 records. How can I use pagination so I don't need to pull all records at once?
Your looking to use salesforce_connection.query(query=SOQL) and then .query_more(nextRecordsUrl, True)
Since .query() only returns 2000 records you need to use .query_more to get the next page of results
From the simple-salesforce docs
SOQL queries are done via:
sf.query("SELECT Id, Email FROM Contact WHERE LastName = 'Jones'")
If, due to an especially large result, Salesforce adds a nextRecordsUrl to your query result, such as "nextRecordsUrl" : "/services/data/v26.0/query/01gD0000002HU6KIAW-2000", you can pull the additional results with either the ID or the full URL (if using the full URL, you must pass ‘True’ as your second argument)
sf.query_more("01gD0000002HU6KIAW-2000")
sf.query_more("/services/data/v26.0/query/01gD0000002HU6KIAW-2000", True)
Here is an example of using this
data = [] # list to hold all the records
SOQL = "SELECT my_field FROM my_model"
results = sf.query(query=SOQL) # api call
## loop through the results and add the records
for rec in results['records']:
rec.pop('attributes', None) # remove extra data
data.append(rec) # add the record to the list
## check the 'done' attrubite in the response to see if there are more records
## While 'done' == False (more records to fetch) get the next page of records
while(results['done'] == False):
## attribute 'nextRecordsUrl' holds the url to the next page of records
results = sf.query_more(results['nextRecordsUrl', True])
## repeat the loop of adding the records
for rec in results['records']:
rec.pop('attributes', None)
data.append(rec)
Looping through the records and using the data
## loop through the records and get their attribute values
for rec in data:
# the attribute name will always be the same as the salesforce api name for that value
print(rec['my_field'])
Like the other answer says though, this can start to use up a lot of resources. But it what you're looking for if want to achieve page nation.
Maybe create a more focused SOQL statement to get only the records needed for your use case at that specific moment.
LIMIT and OFFSET aren't really meant to be used like that, what if somebody inserts or deletes a record on earlier position (not to mention you don't have ORDER BY in there). SF will open a proper cursor for you, use it.
https://pypi.org/project/simple-salesforce/ docs for "Queries" say that you can either call query and then query_more or you can go query_all. query_all will loop and keep calling query_more until you exhaust the cursor - but this can easily eat your RAM.
Alternatively look into the bulk query stuff, there's some magic in the API but I don't know if it fits your use case. It'd be asynchronous calls and might not be implemented in the library. It's called PK Chunking. I wouldn't bother unless you have millions of records.

Right way to delete and then reindex ES documents

I have a python3 script that attempts to reindex certain documents in an existing ElasticSearch index. I can't update the documents because I'm changing from an autogenerated id to an explicitly assigned id.
I'm currently attempting to do this by deleting existing documents using delete_by_query and then indexing once the delete is complete:
self.elasticsearch.delete_by_query(
index='%s_*' % base_index_name,
doc_type='type_a',
conflicts='proceed',
wait_for_completion=True,
refresh=True,
body={}
)
However, the index is massive, and so the delete can take several hours to finish. I'm currently getting a ReadTimeoutError, which is causing the script to crash:
WARNING:elasticsearch:Connection <Urllib3HttpConnection: X> has failed for 2 times in a row, putting on 120 second timeout.
WARNING:elasticsearch:POST X:9200/base_index_name_*/type_a/_delete_by_query?conflicts=proceed&wait_for_completion=true&refresh=true [status:N/A request:140.117s]
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='X', port=9200): Read timed out. (read timeout=140)
Is my approach correct? If so, how can I make my script wait long enough for the delete_by_query to complete? There are 2 timeout parameters that can be passed to delete_by_query - search_timeout and timeout, but search_timeout defaults to no timeout (which is I think what I want), and timeout doesn't seem to do what I want. Is there some other parameter I can pass to delete_by_query to make it wait as long as it takes for the delete to finish? Or do I need to make my script wait some other way?
Or is there some better way to do this using the ElasticSearch API?
You should set wait_for_completion to False. In this case you'll get task details and will be able to track task progress using corresponding API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html#docs-delete-by-query-task-api
Just to explain more in the form of codebase explained by Random for the newbee in ES/python like me:
ES = Elasticsearch(['http://localhost:9200'])
query = {'query': {'match_all': dict()}}
task_id = ES.delete_by_query(index='index_name', doc_type='sample_doc', wait_for_completion=False, body=query, ignore=[400, 404])
response_task = ES.tasks.get(task_id) # check if the task is completed
isCompleted = response_task["completed"] # if complete key is true it means task is completed
One can write custom definition to check if the task is completed in some interval using while loop.
I have used python 3.x and ElasticSearch 6.x
You can use the 'request_timeout' global param. This will reset the Connections timeout settings, as mentioned here
For example -
es.delete_by_query(index=<index_name>, body=<query>,request_timeout=300)
Or set it at connection level, for example
es = Elasticsearch(**(get_es_connection_parms()),timeout=60)

Creating a pagination index in CouchDB?

I'm trying to create a pagination index view in CouchDB that lists the doc._id for every Nth document found.
I wrote the following map function, but the pageIndex variable doesn't reliably start at 1 - in fact it seems to change arbitrarily depending on the emitted value or the index length (e.g. 50, 55, 10, 25 - all start with a different file, though I seem to get the correct number of files emitted).
function(doc) {
if (doc.type == 'log') {
if (!pageIndex || pageIndex > 50) {
pageIndex = 1;
emit(doc.timestamp, null);
}
pageIndex++;
}
}
What am I doing wrong here? How would a CouchDB expert build this view?
Note that I don't want to use the "startkey + count + 1" method that's been mentioned elsewhere, since I'd like to be able to jump to a particular page or the last page (user expectations and all), I'd like to have a friendly "?page=5" URI instead of "?startkey=348ca1829328edefe3c5b38b3a1f36d1e988084b", and I'd rather CouchDB did this work instead of bulking up my application, if I can help it.
Thanks!
View functions (map and reduce) are purely functional. Side-effects such as setting a global variable are not supported. (When you move your application to BigCouch, how could multiple independent servers with arbitrary subsets of the data know what pageIndex is?)
Therefore the answer will have to involve a traditional map function, perhaps keyed by timestamp.
function(doc) {
if (doc.type == 'log') {
emit(doc.timestamp, null);
}
}
How can you get every 50th document? The simplest way is to add a skip=0 or skip=50, or skip=100 parameter. However that is not ideal (see below).
A way to pre-fetch the exact IDs of every 50th document is a _list function which only outputs every 50th row. (In practice you could use Mustache.JS or another template library to build HTML.)
function() {
var ddoc = this,
pageIndex = 0,
row;
send("[");
while(row = getRow()) {
if(pageIndex % 50 == 0) {
send(JSON.stringify(row));
}
pageIndex += 1;
}
send("]");
}
This will work for many situations, however it is not perfect. Here are some considerations I am thinking--not showstoppers necessarily, but it depends on your specific situation.
There is a reason the pretty URLs are discouraged. What does it mean if I load page 1, then a bunch of documents are inserted within the first 50, and then I click to page 2? If the data is changing a lot, there is no perfect user experience, the user must somehow feel the data changing.
The skip parameter and example _list function have the same problem: they do not scale. With skip you are still touching every row in the view starting from the beginning: finding it in the database file, reading it from disk, and then ignoring it, over and over, row by row, until you hit the skip value. For small values that's quite convenient but since you are grouping pages into sets of 50, I have to imagine that you will have thousands or more rows. That could make page views slow as the database is spinning its wheels most of the time.
The _list example has a similar problem, however you front-load all the work, running through the entire view from start to finish, and (presumably) sending the relevant document IDs to the client so it can quickly jump around the pages. But with hundreds of thousands of documents (you call them "log" so I assume you will have a ton) that will be an extremely slow query which is not cached.
In summary, for small data sets, you can get away with the page=1, page=2 form however you will bump into problems as your data set gets big. With the release of BigCouch, CouchDB is even better for log storage and analysis so (if that is what you are doing) you will definitely want to consider how high to scale.

Resources