Comparing data in an Azure Index to data that is about to be uploaded

Comparing data in an Azure Index to data that is about to be uploaded - azure

I have an index using the Azure Cognitive Search service. I'm writing a program to automate the upload of new data to this index. I don't want to delete and re-create the index from scratch unnecessarily each time. Is there a way of comparing what is currently in the index with the data that I am about to upload, without having to download that data from there first and manually compare it? I have been looking at the MS documentation and other articles but cannot see a way to do this comparison?

you can use MergeOrUpload operation, so if it's not there it will insert, otherwise update.
Please make sure the IDs are the same otherwise you'll endup always adding new items.
IndexAction.MergeOrUpload(
new Customer()
{
Id = "....",
UpdatedBy = new
{
Id = "..."
}
}
)
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.search.models.indexactiontype?view=azure-dotnet

Related

Why does my Azure Cosmos DB SQL API Container Refuse Multiple Items With Same Partition Key Value?

In Azure Cosmos DB (SQL API) I've created a container whose "partition key" is set to /part_key and I am now trying to create and edit data in Data Explorer.
I created an item that looks like this:
{
"id": "test_id",
"value": "val000",
"magicNumber": 32,
"part_key": "asdf"
}
I am now trying to create an item that looks like this:
{
"id": "frank",
"value": "val001",
"magicNumber": 33,
"part_key": "asdf"
}
Based on the documentation I believe that each item within a partition key needs a distinct id, which to me implies that multiple items can in fact share a partition key, which makes a lot of sense.
However, I get an error when I try to save this second item:
{"code":409,"body":{"code":"Conflict","message":"Entity with the specified id already exists in the system...
I see that if I change the value of part_key to something else (say asdf2), then I can save this new item.
Either my expectations about this functionality are wrong, or else I'm doing this wrong somehow. What is wrong here?

Your understanding is correct, It could happen if you are trying to instead a new document with id equal to id of the existing document. This is not allowed, so operation fails.
Before you insert the modified copy, you need to assign a new id to it. I tested the scenario and it looks fine. May be try to create a new document and check

Rethinkdb replace document if document exists, else insert document

I would like to insert a document if it doesn't exist (client_nr not found).
If this exists, replace the whole document with new values.
The only other this is, that the client_nr is not the primary key. The primary key is the default id created by rethinkdb database.
I tried the below code in node js, but nothing happened. The data is in the variable jsonArray. I use the for loop to go through the whole jsonArray.
Any idea how to solve this problem?
Thanks!!!
for(var Ticker in jsonArray){
r.db(db).table('trades').filter({client_nr: jsonArray[Ticker].client_nr}).forEach(function(post) {
return r.branch(
post.eq(null),
r.db(db).table('log').insert(jsonArray[Ticker]),
r.db(db).table('log').replace(jsonArray[Ticker])
)
}).run()
}

This is much easier to do if client_nr is your primary key. I'd consider doing that instead of using the autogenerated IDs. That will also enforce uniqueness on the field, which is probably what you want.
I was also a little confused by your example because your description made it sound like you wanted to be inserting/replacing into the same table that you're filtering on, but your example is referencing two different tables.
Assuming you want to be using a single table, something like this should do it:
TABLE.filter({client_nr: jsonArray[Ticker].client_nr}).replace(function(row) {
return r(jsonArray[Ticker]).merge(row.pluck('id'));
}).do(function(res) {
return r.branch(
res('replaced').add(res('unchanged')).eq(0),
TABLE.insert(jsonArray[Ticker]),
res);
})

Neo4j-php retrieve node

I have been exclusively using cypher queries of this client for Neo4j because there is no out of the box way of doing many things. One of those id to get nodes. There is no way to retrieve them without knowing their id, which is very low level. Any idea on how to run a
$client->findOne('property','value');
?
It should be straightforward but it isn't from the documentation.

Make Indexes on the properties you want to search, from a newly created $personNode
$personIndex = new \Everyman\Neo4j\NodeIndex($client, 'person');
$personIndex->add($personNode, 'name', $personNode->name);
Then later to search, the new PHP object $personIndex will reference the same, populated index as above.
$personIndex = new \Everyman\Neo4j\NodeIndex($client, 'person');
$match = $personIndex->findOne('name', 'edoceo');

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}

Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Insert/update Doctrine object from Excel

On the project which I am currently working, I have to read an Excel file (with over a 1000 rows), extract all them and insert/update to a database table.
in terms of performance, is better to add all the records to a Doctrine_Collection and insert/update them after using the fromArray() method, right? One other possible approach is to create a new object for each row (a Excel row will be a object) and them save it but I think its worst in terms of performance.
Every time the Excel is uploaded, it is necessary to compare its rows to the existing objects on the database. If the row does not exist as object, should be inserted, otherwise updated. My first approach was turn both object and rows into arrays (or Doctrine_Collections); then compare both arrays before implementing the needed operations.
Can anyone suggest me any other possible approach?

We did a bit of this in a project recently, with CSV data. it was fairly painless. There's a symfony plugin tmCsvPlugin, but we extended this quite a bit since so the version in the plugin repo is pretty out of date. Must add that to the #TODO list :)
Question 1:
I don't explicitly know about performance, but I would guess that adding the records to a Doctrine_Collection and then calling Doctrine_Collection::save() would be the neatest approach. I'm sure it would be handy if an exception was thrown somewhere and you had to roll back on your last save..
Question 2:
If you could use a row field as a unique indentifier, (let's assume a username), then you could search for an existing record. If you find a record, and assuming that your imported row is an array, use Doctrine_Record::synchronizeWithArray() to update this record; then add it to a Doctrine_Collection. When complete, just call Doctrine_Collection::save()
A fairly rough 'n' ready implementation:
// set up a new collection
$collection = new Doctrine_Collection('User');
// assuming $row is an associative
// array representing one imported row.
foreach ($importedRows as $row) {
// try to find an existing record
// based on a unique identifier.
$user = Doctrine_Core::getTable('User')
->findOneByUsername($row['username']);
// create a new user record if
// no existing record is found.
if (!$user instanceof User) {
$user = new User();
}
// sync record with current data.
$user->synchronizeWithArray($row);
// add to collection.
$collection->add($user);
}
// done. save collection.
$collection->save();
Pretty rough but something like this worked well for me. This is assuming that you can use your imported row data in some way to serve as a unique identifier.
NOTE: be wary of synchronizeWithArray() if you're using sf1.2/doctrine 1.0 - if I remember correctly it was not implemented correctly. it works fine in doctrine 1.2 though.

I have never worked on Doctrine_Collections, but I can answer in terms of database queries and code logic in a broader sense. I would apply the following logic:-
Fetch all the rows of the excel sheet from database in a single query and store them in an array $uploadedSheet.
Create a single array of all the rows of the uploaded excel sheet, call it $storedSheet. I guess the structures of the Doctrine_Collections $uploadedSheet and $storedSheet will be similar (both two-dimensional - rows, cells can be identified and compared).
3.Run foreach loops on the $uploadedSheet as follows and only identify the rows which need to be inserted and which to be updated (do actual queries later)-
$rowsToBeUpdated =array();
$rowsToBeInserted=array();
foreach($uploadedSheet as $row=>$eachRow)
{
if(is_array($storedSheet[$row]))
{
foreach($eachRow as $column=>$value)
{
if($value != $storedSheet[$row][$column])
{//This is a representation of comparison
$rowsToBeUpdated[$row]=true;
break; //No need to check this row anymore - one difference detected.
}
}
}
else
{
$rowsToBeInserted[$row] = true;
}
}
4. This way you have two arrays. Now perform 2 database queries -
bulk insert all those rows of $uploadedSheet whose numbers are stored in $rowsToBeInserted array.
bulk update all the rows of $uploadedSheet whose numbers are stored in $rowsToBeUpdated array.
These bulk queries are the key to faster performance.
Let me know if this helped, or you wanted to know something else.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Comparing data in an Azure Index to data that is about to be uploaded - azure

Related

Why does my Azure Cosmos DB SQL API Container Refuse Multiple Items With Same Partition Key Value?

Rethinkdb replace document if document exists, else insert document

Neo4j-php retrieve node

Referencing external doc in CouchDB view

Insert/update Doctrine object from Excel

Categories

Resources