Delete solr docs without commiting - search

For my usecase, I will be deleting all solr docs everyday and indexing new solr docs right after it:
Delete:
conf = {
"set-property": [
{"requestDispatcher.requestParsers.enableRemoteStreaming": True},
{"requestDispatcher.requestParsers.enableStreamBody": True},
]
}
resp = requests.post(f"http://{SOLR_HOST}:{SOLR_PORT}/solr/product_{country}/config", json=conf)
resp = requests.get(
f"http://{SOLR_HOST}:{SOLR_PORT}/solr/product_{country}/update"
+ "?stream.body=<delete><query>*:*</query></delete>"
)
Insert:
pySolr.solr.add_objects(..., commit=true, softCommit=true)
This seems to work fine. However, if I add a breakpoint between insert and delete, I notice that my solr core is empty (0 docs). Is there any way I can maintain the old solr docs until the insert command runs succesfully?

You can create a new core with different name and then once it completed then delete the old one.
After deleting the old one you can rename the new one with required name.
Here is the api for renaming the core.
admin/cores?action=RENAME&core=core-name&other=other-core-name
core : The name of the Solr core to be renamed.
other : The new name for the Solr core.
Note : You can also check if SWAP works in your case.
SWAP atomically swaps the names used to access two existing Solr cores
You can refer the documentation here

If your use case is just to delete all the records and to re-index all of them, assuming that you have 'id' field custom generated instead of auto-generation and the no of records that would be re-indexed will be equal to or greater than the existing records in collection and while re-indexing all the records that already exists are re-indexed. Then you don't have to delete and then re-index. Indexing a existing document which has the same id replaces the existing document. Thus eliminating the step to delete the document.

Related

Comparing data in an Azure Index to data that is about to be uploaded

I have an index using the Azure Cognitive Search service. I'm writing a program to automate the upload of new data to this index. I don't want to delete and re-create the index from scratch unnecessarily each time. Is there a way of comparing what is currently in the index with the data that I am about to upload, without having to download that data from there first and manually compare it? I have been looking at the MS documentation and other articles but cannot see a way to do this comparison?
you can use MergeOrUpload operation, so if it's not there it will insert, otherwise update.
Please make sure the IDs are the same otherwise you'll endup always adding new items.
IndexAction.MergeOrUpload(
new Customer()
{
Id = "....",
UpdatedBy = new
{
Id = "..."
}
}
)
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.search.models.indexactiontype?view=azure-dotnet

couchDB conflicts when supplying own ID with large inserts using _bulk_docs

Same code works fine when letting couch auto generate UUID's. I am starting off with a new completely empty database yet I keep getting this
error: conflict
reason: Document update conflict
To reiterate I am posting new documents to an empty database so not sure how I can get update conflicts when nothing is being updated. Even stranger the conflicting documents still show up in the DB with only a single revision, but overall there are missing records.
I am trying to insert about 38,000 records with _bulk_docs in batches of 100. I am getting these records (100 at a time) from a RETS server, each record already has a unique ID that I want to use for the couchDB _id instead of their UUID's. I am using a promised based library to get the records and axios to insert them into couch. After getting the first batch of 100 I then run this code to add an _id to each of the 100 records before inserting
let batch = [];
batch = records.results.map((listing) => {
let temp = listing;
temp._id = listing.ListingKey;
return temp;
});
Then insert:
axios.post('http://127.0.0.1:5984/rets_store/_bulk_docs', { docs: batch })
This is all inside of a function that I call recursively.
I know this probably wont be enough to see the issue but thought Id start here. I know for sure it has something to do with my map() and adding the _id = ListingKey
Thanks!

Bulk delete documents of a specific type in elasticsearch

I have an index called "animals" in elasticsearch. That contains several documents of type "dogs".
I want to delete all "dogs" documents in "animals" index. I am using python elasticsearch package.
My python code is as follows:
connection = Elasticsearch([{"host": "myhost", "port": "myport" } ])
body = {} # What should I put in the body???
connection.bulk(body, index="animals", doc_type="dogs", ignore=[400, 404])
Here I don't know what do I need to put in the body. Can anyone help me out?
Bulk method is defined at https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/client/init.py#L1002
There is no api in elasticsearch-py to delete all documents of a type. This is because Elasticsearch 2.x onwards there is no api to delete a type.
From the documentation
In 1.x it was possible to delete a type
mapping, along with all of the documents of that type, using the
delete mapping API. This is no longer supported, because remnants of
the fields in the type could remain in the index, causing corruption
later on.
Instead, if you need to delete a type mapping, you should reindex to a
new index which does not contain the mapping. If you just need to
delete the documents that belong to that type, then use the
delete-by-query plugin instead.

Neo4j-php retrieve node

I have been exclusively using cypher queries of this client for Neo4j because there is no out of the box way of doing many things. One of those id to get nodes. There is no way to retrieve them without knowing their id, which is very low level. Any idea on how to run a
$client->findOne('property','value');
?
It should be straightforward but it isn't from the documentation.
Make Indexes on the properties you want to search, from a newly created $personNode
$personIndex = new \Everyman\Neo4j\NodeIndex($client, 'person');
$personIndex->add($personNode, 'name', $personNode->name);
Then later to search, the new PHP object $personIndex will reference the same, populated index as above.
$personIndex = new \Everyman\Neo4j\NodeIndex($client, 'person');
$match = $personIndex->findOne('name', 'edoceo');

Point-in-time restores of databases and documents using Cloudant

How can I save changes in CouchDB / Cloudant in order to later do point-in-time restores of my databases, or even specific documents?
We’re working on making this a first-class feature, but until we roll it out, this is how one of our customers did it:
You have collections, and within those collections, resources. So, you keep a logging database where every document has an ID like collection-resource, so for a collection named "cars" and a resource named "Ford", you'd have a document in your logging database named cars-ford. That document looks like this:
{
versions: [...]
}
Any time that resource is touched or modified, your application updates the logging document by appending the new version to the end of the versions field. That version might look like this:
{
timestamp: '...', # some integer timestamp, for sorting
doc: {...} # attributes of the document as of the save
}
We'll use that view to return a list of all versions of all documents, sorted by when each change occurred.
Then, here's how you use that to do restores and the like:
Getting the most recent version of a resource
Get the document in its entirety, and grab the last element in the versions field. That's the most recent version.
See all versions relative to a timestamp
We'll create a view to sort by timestamp. The view looks like this:
{
map: "function(doc) {
for(var i in doc.versions){
emit(doc.versions[i].timestamp, doc.versions[i].doc);
}
}"
}
Say our database is named loggy, the design doc where our views live is named restore, and the view itself is named time. Then we'll make a GET request to this URL:
{CLOUDANT_HOST}/loggy/_design/restore/_view/time?startkey='...'
...where the value for startkey is some timestamp. This, unmodified, will return every version after the indicated timestamp. Add limit=X and you'll get the X versions after the timestamp. Add descending=true and you'll get versions before the timestamp, instead of after.
See the Nth revision for a resource
Much like above, but we'll tweak our view a little:
{
map: "function(doc){
for(var i in doc.versions){
emit(i, doc.versions[i].doc);
}
}"
}
Now our view results are keyed by index rather than timestamp. So, instead of passing a timestamp to startkey, we just pass N to versions around the Nth revision.
Getting the number of revisions for a collection or resource
We'll use another view to group by collection and resource:
{
map: "function(doc){
// split te ID into collection and resource
var parts = doc._id.split('-');
// emit them as keys so we can group by them
emit([doc.parts[0], doc.parts[1]], null);
}",
reduce: "_count"
}
Use the query parameter group and group_level to group results by their keys. So, if we want the number of events that have touched resources in the cars collection, we would use a querystring like this:
?group=true&group_level=1&key="cars"
group groups results whose keys are the same, but group_level=1 says "only group on the first key", which in our case is the collection. key specifies to only return documents whose key matches the given value.
Getting all resources for a given collection
Using the _all_docs view, we'll use a querystring like this:
?reduce=false&startkey="{collection}-"&endkey="{collection}0"
Remember the reduce part of our function? That _count value means "return the number of records emitted by map". reduce=false means "Don't do that." Instead, only the map function is run.
That startkey and endkey pair uses how Cloudant sorts results to exclude everything but the values matching IDs that start with the given collection.
Updating docs
Once you've got the versions you'd like to restore, GET the current version of the resource, GET the past version from the loggy database, and PUT the past version to the resource using the current version's _rev value. Bam, restored. Rinse and repeat for point-in-time restore.

Resources