DocumentDB and Azure Search: Document removed from documentDB isn't updated in Azure Search index - azure

When i remove a document from DocumentDB it wont be removed from the Azure Search Index. The index will update if i change something in a document.
I'm not quite sure how i should use this "SoftDeleteColumnDeletionDetectionPolicy" in the datasource.
My datasource is as follows:
{
"name": "mydocdbdatasource",
"type": "documentdb",
"credentials": {
"connectionString": "AccountEndpoint=https://myDocDbEndpoint.documents.azure.com;AccountKey=myDocDbAuthKey;Database=myDocDbDatabaseId"
},
"container": {
"name": "myDocDbCollectionId",
"query": "SELECT s.id, s.Title, s.Abstract, s._ts FROM Sessions s WHERE s._ts > #HighWaterMark"
},
"dataChangeDetectionPolicy": {
"#odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
"highWaterMarkColumnName": "_ts"
},
"dataDeletionDetectionPolicy": {
"#odata.type": "#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
"softDeleteColumnName": "isDeleted",
"softDeleteMarkerValue": "true"
}
}
And i have followed this guide:
https://azure.microsoft.com/en-us/documentation/articles/documentdb-search-indexer/
What am i doing wrong? Am i missing something?

I will describe what I understand about SoftDeleteColumnDeletionDetectionPolicy in a data source. As the name suggests, it is Soft Delete policy and not the Hard Delete policy. Or in other words, the data is still there in your data source but it is somehow marked as deleted.
Essentially the way it works is periodically Search Service will query the data source and checks for the entries that are deleted by checking the value of the attribute defined in SoftDeleteColumnDeletionDetectionPolicy. So in your case, it will query the DocumentDB collection and find out the documents for which isDeleted attribute's value is true. It then removes the matching documents from the Index.
The reason it is not working for you is because you are actually deleting the records instead of changing the value of isDeleted from false to true. Thus it never finds matching values and no changes are done to the index.
One thing you could possibly do is instead of doing Hard Delete, you do Soft Delete in your DocumentDB collection to begin with. When the Search Service re-indexes your data, because the document is soft deleted from the source it will be removed from the index. Then to save storage costs at the DocumentDB level, you simply delete these documents through a background process some time later.

Related

Why does my Azure Cosmos DB SQL API Container Refuse Multiple Items With Same Partition Key Value?

In Azure Cosmos DB (SQL API) I've created a container whose "partition key" is set to /part_key and I am now trying to create and edit data in Data Explorer.
I created an item that looks like this:
{
"id": "test_id",
"value": "val000",
"magicNumber": 32,
"part_key": "asdf"
}
I am now trying to create an item that looks like this:
{
"id": "frank",
"value": "val001",
"magicNumber": 33,
"part_key": "asdf"
}
Based on the documentation I believe that each item within a partition key needs a distinct id, which to me implies that multiple items can in fact share a partition key, which makes a lot of sense.
However, I get an error when I try to save this second item:
{"code":409,"body":{"code":"Conflict","message":"Entity with the specified id already exists in the system...
I see that if I change the value of part_key to something else (say asdf2), then I can save this new item.
Either my expectations about this functionality are wrong, or else I'm doing this wrong somehow. What is wrong here?
Your understanding is correct, It could happen if you are trying to instead a new document with id equal to id of the existing document. This is not allowed, so operation fails.
Before you insert the modified copy, you need to assign a new id to it. I tested the scenario and it looks fine. May be try to create a new document and check

Get dynamodb items where a nested key exist

I would like to return items where a nested key exists. I have the following table:
"users": [
{
"active": true,
"apps": {
"app-name-1": {
"active": true,
"group": "aaaaaaaaa",
"settings": {}
}
},
"username: "user1"
},
{
"active": true,
"apps": {
"app-name-2": {
"active": true,
"group": "bbbbbb",
"settings": {}
}
},
"username: "user2"
]
So I want to return all users that have "app-name-1" under "apps". Which operation is the best for this purpose?
The question you need to ask yourself isn't just the "operation", but also how do you model your data in DynamoDB. I.e., how does that JSON array you showed translates into a DynamoDB table, with hash and sort keys?
While DynamoDB nominally does supports nested attributes, this support is actually only partial, with some features (notably secondary indexes) not supporting them, so as I'll show now it is better not to use them. To model your data without nested attributes, what you can do is to use a hash key "username" and sort key "appname". Each item in this table is one app belonging to one user. The user's "active" flag is a bit of a problem in this modeling, but you can implement it by using a fake appname for storing such user parameters.
This modeling makes it efficient to list all applications belonging to one user (I assume you need this feature as well...) but not all users with a certain application. But you were looking for the reverse operation - to get a list of users given an app name.
You can get this reverse with a Scan operation but this is a full-table scan, and accordingly can be very slow and expensive (you'll be paying to read the entire database, even if only part of the data is actually returned to the user).
If efficient search by app is important, you should create a secondary index (GSI) whose hash key is app-name and sort key is user (i.e., the opposite key order from that of the base table). You can then query this index to get - efficiently - the list of usernames that have this app.
Note that such a GSI wouldn't have been possible if you were to insist of modeling your "user" item with nested attributes, because GSIs don't support nested attributes as the key.

Azure Search Service REST API Delete Error: "Document key cannot be missing or empty."

I am seeing some intermittent and odd behavior when trying to use the Azure Search Service REST API to delete a blob storage blob/document. It works, sometimes, and then other times I get this:
The request is invalid. Details: actions : 0: Document key cannot be
missing or empty.
Once I start getting this error, it's the same results when I try to delete any of the document/blobs stored in that index. I do have 'metadata_storage_path' listed as my index key (see below).
I have not been able to get the query to succeed again, or I would examine the differences in Fiddler.
I have also tried the following with no luck:
Resetting and re-running the associated search indexer.
Creating a new indexer & index against the same container and deleting from that.
Creating a new container, indexer, & index and deleting from that.
Any additional suggestions or thoughts?
Copy/paste error: "metadata_storage_name" should be "metadata_storage_path".
[Insert head-banging-on-wall emoji here.]
For those who are still searching for the solution...
Instead of id,
{
"value": [
{
"#search.action": "delete",
"id":"TDVRT0FPQXcxZGtTQUFBQUFBQUFBQT090fdf"
}
]
}
Use rid of your document to delete.
{
"value": [
{
"#search.action": "delete",
"rid":"TDVRT0FPQXcxZGtTQUFBQUFBQUFBQT090fdf"
}
]
}
Because while creating Search Index, you might have selected rid as your unique id column.
Note: We can delete a document only with Unique Id Columns.

Query from Azure Comos DB and save to Azure Table Storage using Data Factory

I want to save C._ts+C.ttl as one entity in my Azure Table Storage. I do the following query in my Copy Activity:
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "#concat('SELECT (C.ts+C.ttl) FROM C WHERE (C.ttl+C._ts)<= ', string(pipeline().parameters.bufferdays))",
"type": "Expression"
},
"nestingSeparator": "."
},
I dont want to copy all the fields from my source i.e. CosmosDB to my sink i.e. Table Storage. I just want to store the result of this query as one value. How can I do that?
According to my test, I presume the null value you queried because of that the collection level ttl affects each document , but will not generate ttl property within document.
So when you execute SELECT c.ttl,c._ts FROM c , just get below result.
Document level ttl is not defined, just follow collection level ttl.
You need to bulk add ttl property into per document so that you could transfer _ts+ttl caculator results.
Your Copy Activity settings looks good , just add an alias in SQL, or set the name of the field via column mapping.
Hope it helps you.

Elasticsearch Mapping lost on sails lift with mongo-connector

I am developing an applications using MongoDB, Sails JS and ElasticSearch.
MongoDB is used to write records that are retrieve for the application. ElasticSearch is used for search text and geo locations distance search etc.
I am using mongo-connector to keep my data in sync from MongoDB to ElasticSearch.
Issue is, i am not able to maintain my mappings for geo_point for the fields that store lat and lon or parent/child or analyzer etc. Every time sails server is lifted i see in elasticsearch logs that all the mappings are removed, created and updated, and i lose my mapping for geo_point that is have created manually via the REST after every thing is up and running or even if i have created mapping at bootstrap time of sails js(as a work around).
I have also tried to create a mapping file and placed it in elasticsearch/config/mappings/index/mymapping.json but i get an error
Caused by: org.elasticsearch.index.mapper.MapperParsingException: Root type mapping not empty after parsing! Remaining fields: ...
Here i tried all the combinations to make this work but no success eg
{"mappings" : {
"locations" : {
"dynamic": "false",
"properties":{
"location": {
"type": "geo_point"
}
}
}
}
}
Also tried using a template to create the mapping but after that mongo-connector kick in and overrides the mapping.
As of now i am only able to make this work is to stop mongo-connector, delete the oplog.timestamp file, start the sails server(Here at bootstrap time i delete and recreate the mapping for that document) and then start mongo-connector. But this create accidents if we forgots to do a step.
Am i doing any thing wrong or is there a better way to sync the mongodb to elasticsearch without losing the custom mapping or an alternative mongo-connector.
According to the documentation, if you install a mapping on the filesystem, the file must be named <your_mapping>.json so in your case it should be named locations.json and be placed either in
elasticsearch/config/mappings/_default/locations.json
or
elasticsearch/config/mappings/<your_index_name>/locations.json
Moreover, you mapping file shouldn't contain the mappings keyword, it should instead look like this:
{
"locations" : {
"dynamic": "false",
"properties":{
"location": {
"type": "geo_point"
}
}
}
}
You should try again after correctly naming your mapping file and folders.

Resources