Keeping elasticsearch in sync with key or versioning - search

So I have a situation where I get in a lot of large XML files and I want that data sycronised on elasticsearch.
Current way
Have index_1
When data is updated create blank index_2
Load all of latest data into index_2
Alias to index_2 and delete index_1
Proposed way
Have a synced.xml file which has been sycronised with elasticsearch
When a new timedated xml file is availiable compare against synced.xml
If anything is new in the timedated xml file, add just that to ES
Rename timedated xml to synced.xml
This means out of 500,000 items, I only have to add the 5,000 items that have changed for example, not duplicate the 500,000 items.
Question
In a scenario like this, how to I ensure they are sycronised? For example, what happens if elasticsearch gets wiped, how can I tell my program that it would need to add the whole lot again. Is there a way to use some sort of sycronisation key on elasticsearch, or perhaps a better approach?

Here is what I recommend...
Add a stored field to your type to store a hash like MD5
Use Scan/Scroll to export the ID and Hash from ES
In your backing dataset export ID and Hash
Use something like MapReduce to "join" on exported ids from each
set
Where there are differences via comparing the hash or finding
missing keys, index/update
The hash is only useful if want to detect document changes. This also assume that either you persist ES's IDs back to your backing store or that you self assign IDs.

Related

How to get data from elastic search, if new data came then update it, and again inject it?

I have nearly 200 000 lines of tuples in my Pandas Dataframe. I injected that data into elastic search. Now, when I run the program It should check whether the present data already there in elastic search if not present insert into it.
I'd recommend to not worry about it and just load everything into Elasticsearch. As long as your _ids are consistent the existing documents will be overwritten instead of duplicated. So just be sure to specify an _id for each document and you are fine, the bulk helpers in the elasticsearch-py client all support you setting an _id value for each document alredy.

How to check for duplication before creating a new document in CouchDB/Cloudant?

We want to check if a document already exists in the database with the same fields and values of a new object we are trying to save to prevent duplicated item.
Note: This question is not about updating documents or about duplicated document IDs, we only check the data to prevent saving a new document with the same data of an existing one.
Preferably we'd like to accomplish this with Mango/Cloudant queries and not rely on views.
The idea so far is:
1) Scan the the data that we are trying to save and dynamically create a selector that matches that document's structure. (We can't have the selectors hardcoded because we have types of many documents)
2) Query de DB with for any documents matching that selector to if any document already exists that matches those criteria.
However I wonder about the performance of this approach since many of the selector fields will not be indexed.
I also much rather follow best practices than create something out of the blue, but haven't been able to find any known solutions for this specific scenario.
If you happen to know of any, please share.
Option 1 - Define a meaningful ID for your documents
The ID could be a logical coposition or a computed hash from the values that should be unique
If you want to check if a document ID already exists you can use the HEAD method
HEAD /db/docId
which returns 200-OK if the docId exits on the database.
If you would like to check if you have the same content in the new document and in the previous one, you may use the Validate Document Update Function which allows to compare both documents.
function(newDoc, oldDoc, userCtx, secObj) {
...
}
Option 2 - Use content hash computed outside CouchDB
Before create or update a document a hash should be computed using the values of the attributes that should be unique.
The hash is included in the document in a new attribute i.e. "key_hash"
Create a mango index using the "key_hash" attribute
When a new doc should be inserted, the hash should be computed and find for documents with the same hash value using a mango expression before the doc is inserted.
Option 3 - Compute hash in a View
Define a view which emit the computed hash for each document as key
Couchdb Javascript support does not include hashing functions, this could be difficult to include in a design document.
Use erlang to define the map function, where you can access to the erlang support for hashing.
Before creating a new document you should query the view using a the hash that you need to compute previously.
One solution would be to take Juanjo's and Alexis's comment one step further.
Select the keys you wish to keep unique
Put the values in a string and generate a hash
Set the document's _id to that hash
PUT the document on the database.
check return for failure
If another document already exists on the database with the same _id value, the PUT request will fail.

ArangoDB: How to list or export all documents in a database, regardless of collections

I see that there are APIs for listing and exporting all documents within a given collection, but I need to list/export all documents (ids) in all collections in a database simultaneously. (This is for V&V of a service I'm developing). Is this possible, or must I query for each collection one at a time?
Thanks!
arangodump will dump all the collections within a database, but the output format for a DOCUMENT is like this:
{"type":2300,"data": DOCUMENT}
There is one such entry per document in a per-collection file, which is named like so:
COLLECTION_07cf4f8f5d8b76282917320715dda2ad.data.json
It would be easy enough to extract DOCUMENT, e.g. using jq one would basically write: jq .data
arangoexport does allow one to specify multiple collections in a single invocation, but they must be specified explicitly.
One possibility for automating the use of arangoexport would be to use arangosh to generate the collection names in a specific database (using db._collections()), and then construct the appropriate arangoexport command or commands.

How to prevent saving same named file on CouchDb?

I am using CouchDB with Divan - C# interfacing library for CouchDb.
A file can be uploaded many times on CouchDb. Every time the "id" is changed after file is uploaded, but the "rev" remains the same.
This happens even if all custom attributes defined for file being uploaded are same any existing file on CouchDb with same name.
Is there any way that can avoid uploading same named file if all custom attributes are same? Fetching all files and checking them for file name repetition could be a way, but definitely not preferable for its required time depending on other factors.
Thanking you.
Let's say you have 3 attributes for a file :
name
size in bytes
Date of modification
I see two main possibilities to avoid duplicates in your database.
Client approach
You query the database to check if the document with the same attributes exists with a view. If it's not existing, create it.
User defined id
You could generate an id from the attributes as this library is doing.
For example, if my document has those attributes :
"name":"test.txt",
"size":"512",
"lastModified":"2016-11-08T15:44:29.563Z"
You could build a unique id like this :
"_id":"test.txt/2016-11-08T15:44:29.563Z/512"

Can data in Solr be extended with manually defined meta data?

I have several documents in a solr collection that I want to be able to search through. Most of the data comes from web sites I can easily crawl, however, I need to add some attributes manually to because I have to add these attributes manually.
So as an example I get the following info from a site (all attributes returned from crawled site):
Name: Porsche Boxter
Year: 1996
...
I want to add additional fields through a web interface (info not present on crawled sites):
Cool: yes
foo: bar
My questions:
Does it make sense at all to store additional information along the indexed data within Solr (inside the documents) or would a best practice only have all crawled data in Solr and merge with an external managed database during query time? To me it makes more sense to have all my data that is eventually queried in Solr as some of the manually added attributed are required search criteria (e.g. look only for cool cars from the 90s).
Is it possible to use Solr to store additional information about indexed documents? I know the entire schema in advance, perhaps this is useful?
If I store my data exclusively in Solr, how can I ensure that during the next crawl the manually added data is not overwritten? Would partial update be required?
Since I am new to Solr it would also be very helpful if someone could simply manage what to look for in the documentation that describes my use case.
That depends on how often the external data changes. The more often, the less meaningful. Generally it is a good idea to store such data along the index data, because you get them without an additional database query.
Yes. Use indexed:falseand stored:true. If you knew not know all of such fields in advance you could use a dynamicField like <dynamicField name="*_stored" type="string" indexed="false" stored="true" />.
Yes. You have to use partial update. This is no problem in your case, because the fields not updated have stored:true.

Resources