Elasticsearch Mapping lost on sails lift with mongo-connector - node.js

I am developing an applications using MongoDB, Sails JS and ElasticSearch.
MongoDB is used to write records that are retrieve for the application. ElasticSearch is used for search text and geo locations distance search etc.
I am using mongo-connector to keep my data in sync from MongoDB to ElasticSearch.
Issue is, i am not able to maintain my mappings for geo_point for the fields that store lat and lon or parent/child or analyzer etc. Every time sails server is lifted i see in elasticsearch logs that all the mappings are removed, created and updated, and i lose my mapping for geo_point that is have created manually via the REST after every thing is up and running or even if i have created mapping at bootstrap time of sails js(as a work around).
I have also tried to create a mapping file and placed it in elasticsearch/config/mappings/index/mymapping.json but i get an error
Caused by: org.elasticsearch.index.mapper.MapperParsingException: Root type mapping not empty after parsing! Remaining fields: ...
Here i tried all the combinations to make this work but no success eg
{"mappings" : {
"locations" : {
"dynamic": "false",
"properties":{
"location": {
"type": "geo_point"
}
}
}
}
}
Also tried using a template to create the mapping but after that mongo-connector kick in and overrides the mapping.
As of now i am only able to make this work is to stop mongo-connector, delete the oplog.timestamp file, start the sails server(Here at bootstrap time i delete and recreate the mapping for that document) and then start mongo-connector. But this create accidents if we forgots to do a step.
Am i doing any thing wrong or is there a better way to sync the mongodb to elasticsearch without losing the custom mapping or an alternative mongo-connector.

According to the documentation, if you install a mapping on the filesystem, the file must be named <your_mapping>.json so in your case it should be named locations.json and be placed either in
elasticsearch/config/mappings/_default/locations.json
or
elasticsearch/config/mappings/<your_index_name>/locations.json
Moreover, you mapping file shouldn't contain the mappings keyword, it should instead look like this:
{
"locations" : {
"dynamic": "false",
"properties":{
"location": {
"type": "geo_point"
}
}
}
}
You should try again after correctly naming your mapping file and folders.

Related

ExpressJS: How to cache on demand

I'm trying to build a REST API with express, sequelize (PostgreSQL dialect) and node.
Essentially I have two endpoints:
Method
Endpoint
Desc.
GET
/api/players
To get players info, including assets
POST
/api/assets
To create an asset
And there is a mechanism which updates a property (say price) of assets, over a cycle of 30 seconds.
Goal
I want to cache the results of GET /api/players, but I want some control over it, so that whenever a user creates an asset (using POST /api/assets) and right after that a request to GET /api/players should give the updated data (i.e. including the property which updates for every 30 seconds) and cache it until it gets updated in the next cycle.
Expected
The following should demonstrate it:
GET /api/players
JSON Response:
[
{
"name": "John Doe"
"assets": [
{
"id":1
"price": 10
}
]
}
]
POST /api/assets
JSON Request:
{
"id":2
}
GET /api/players
JSON Response:
{
"name": "John Doe"
"assets": [
{
"id":1
"price": 10
},
{
"id":2
"price": 7.99
}
]
}
What I have managed to do so far
I have made the routes, but GET /api/players has no cache mechanism and basically queries the database every time it is requested.
Some solutions I have found, but none seem to meet my scenario
apicache (https://www.youtube.com/watch?v=ZGymN8aFsv4&t=1360s): But I don't have a specific duration, because a user can create an asset anytime.
Example implementation
I have seen (kind off) similar implementation (that I desire) in Github actions workflow for implementing cache, where you define a key and unless the key has changed it uses the same packages and doesn't install packages everytime, (example: https://github.com/python-discord/quackstack/blob/6792fd5868f28573bb8f9565977df84e7ba50f42/.github/workflows/quackstack.yml#L39-L52)
Is there any package, to do that? So that while processing POST /api/assets I can change the key in its handler, and thus GET /api/players gives me the updated result (also I can change the key in that 30 seconds cycle too), and after that it gives me the cached result (until it is updated in the next cycle).
Note: If you have a solution please try to stick with some npm packages, rather than something like redis, unless its the only/best solution.
Thanks in advance!
(P.S. I'm a beginner and this is my first question in SO)
Typically caching is done with help of Redis. Redis is in-memory key-value store. You could handle the cache in the following manner.
In your handler for POST operation update/reset cached entry for players.
In your handler for GET operation if the Redis has the entry in cache return it, otherwise do the logic query the data, add the entry to the cache and return the data.
Alternatively, you could use Memcached.
A bit late to this answer but I was looking for a similar solution. I found that the apicache library not only allows for caching for specified durations, but the cache can also be manually cleared.
apicache.clear([target]) - clears cache target (key or group), or entire cache if no value passed, returns new index.
Here is an example for your implementation:
// POST /api/assets
app.post('/api/assets', function(req, res, next) {
// update assets then clear cache
apicache.clear()
// or only clear the specific players cache by using a parameter
// apicache.clear('players')
res.send(response)
})

U-SQL: How to skip files from analysis based on content

I have a lot of files each containing a set of json objects like this:
{ "Id": "1", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "2", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
{ "Id": "3", "Timestamp":"2017-07-20T10:43:21.8841599+02:00", "Session": { "Origin": "WebClient" }}
etc.
Each file containts information about a specific type of session. In this case it are sessions from a Web App, but it could also be sessions of a Desktop App. In that case the value for Origin is "DesktopClient" instead of "WebClient"
For analysis purposes say I am only interested in DesktopClient sessions.
All files representing a session are stored in Azure Blob Storage like this:
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef77.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef78.json
container/2017/07/20/00399076-2b88-4dbc-ba56-c7afeeb9ef79.json
Is it possible to skip files of which the first line already makes it clear if it is not a DesktopClient session file, like in my example? I think it would save a lot of query resources if files that I know of do not contain the right session type can be skipped since they can be quit big.
At the moment my query read the data like this:
#RawExtract = EXTRACT [RawString] string
FROM #"wasb://plancare-events-blobs#centrallogging/2017/07/20/{*}.json"
USING Extractors.Text(delimiter:'\b', quoting : false);
#ParsedJSONLines = SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple([RawString]) AS JSONLine
FROM #RawExtract;
...
Or should I create my own version of Extractors.Text and if so, how should I do that.
To answer some questions that popped up in the comments to the question first:
At this point we do not provide access to the Blob Store meta data. That means that you need to express any meta data either as part of the data in the file or as part of the file name (or path).
Depending on the cost of extraction and sizes of files, you can either extract all the rows and then filter out the rows where the beginning of the row is not fitting your criteria. That will extract all files and all rows from all files, but does not need a custom extractor.
Alternatively, write a custom extractor that checks for only the files that are appropriate (that may be useful if the first solution does not give you the performance and you can determine the conditions efficiently inside the extractors). Several example extractors can be found at http://usql.io in the example directory (including an example JSON extractor).

couchdb , procedure to add a CommonJS modules (show list function)

What is the exact procudure to add a CommonJs module on couchdb ?
I've read tutorials like:
https://caolan.org/posts/commonjs_modules_in_couchdb.html
from official doc:
http://docs.couchdb.org/en/1.6.1/query-server/javascript.html#commonjs-modules
The CommonJS module can be added to a design document, like so:
{
"views": {
"lib": {
"security": "function user_context(userctx, secobj) { ... }"
}
},
"validate_doc_update": "function(newdoc, olddoc, userctx, secobj) {
user = require('lib/security').user(userctx, secobj);
return user.is_admin();
}"
"_id": "_design/test"
}
but where I copy or paste that code? must I save to the file and add it with curl ? On fauxton i don'see where.
Managing CouchDB design documents is usually best accomplished via a tool like couchapp. It allows you to package up a directory of files and outputs a CouchDB design document.
You can manually edit that JSON in the futon/fauxton editor, but it's a pain and there are other tools out there depending on your toolchain. An external tool like this also aids in deployment, particularly across different environments.

DocumentDB and Azure Search: Document removed from documentDB isn't updated in Azure Search index

When i remove a document from DocumentDB it wont be removed from the Azure Search Index. The index will update if i change something in a document.
I'm not quite sure how i should use this "SoftDeleteColumnDeletionDetectionPolicy" in the datasource.
My datasource is as follows:
{
"name": "mydocdbdatasource",
"type": "documentdb",
"credentials": {
"connectionString": "AccountEndpoint=https://myDocDbEndpoint.documents.azure.com;AccountKey=myDocDbAuthKey;Database=myDocDbDatabaseId"
},
"container": {
"name": "myDocDbCollectionId",
"query": "SELECT s.id, s.Title, s.Abstract, s._ts FROM Sessions s WHERE s._ts > #HighWaterMark"
},
"dataChangeDetectionPolicy": {
"#odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
"highWaterMarkColumnName": "_ts"
},
"dataDeletionDetectionPolicy": {
"#odata.type": "#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
"softDeleteColumnName": "isDeleted",
"softDeleteMarkerValue": "true"
}
}
And i have followed this guide:
https://azure.microsoft.com/en-us/documentation/articles/documentdb-search-indexer/
What am i doing wrong? Am i missing something?
I will describe what I understand about SoftDeleteColumnDeletionDetectionPolicy in a data source. As the name suggests, it is Soft Delete policy and not the Hard Delete policy. Or in other words, the data is still there in your data source but it is somehow marked as deleted.
Essentially the way it works is periodically Search Service will query the data source and checks for the entries that are deleted by checking the value of the attribute defined in SoftDeleteColumnDeletionDetectionPolicy. So in your case, it will query the DocumentDB collection and find out the documents for which isDeleted attribute's value is true. It then removes the matching documents from the Index.
The reason it is not working for you is because you are actually deleting the records instead of changing the value of isDeleted from false to true. Thus it never finds matching values and no changes are done to the index.
One thing you could possibly do is instead of doing Hard Delete, you do Soft Delete in your DocumentDB collection to begin with. When the Search Service re-indexes your data, because the document is soft deleted from the source it will be removed from the index. Then to save storage costs at the DocumentDB level, you simply delete these documents through a background process some time later.

Change notification in CouchDB when a field is set

I'm trying to get notifications in a CouchDB change poll as soon as pre-defined field is set or changed. I've already had a look at filters that can be used for filtering change events(db/_changes?filter=myfilter). However, I've not yet found a way to include this temporal information, because you can only get the current version of the document in this filter functions.
Is there any possibility to create such a filter?
If it does not work, I could export my field to a separate database and the only poll for changes in that db, but I'd prefer to keep together my data for obvious reasons.
Thanks in advance!
You are correct: filters and _changes feeds can only see snapshots of a document. What you need is a function which can see the old document and the new document and act correctly. But that is unavailable in _filters and _changes.
Obviously your client code knows if it updates that field. You might update your client code however there is a better solution.
Update functions can access both documents. I suggest you make an _update
function which notices the field change and flags that in the document. Next you
have a simple filter checking for that flag. The best part is, you can use a
rewrite function to make the HTTP API exactly the same as before.
1. Create an update function to flag interesting updates
Your _design/myapp would be {"updates", "smart_updater": "(see below)"}.
Update functions are very flexible (see my recent update handlers
walkthrough). However we only want to mimic the normal HTTP/JSON API.
Your updates.smart_updater field would look like this:
function (doc, req) {
var INTERESTING = 'dollars'; // Set me to the interesting field.
var newDoc = JSON.parse(req.body);
if(newDoc.hasOwnProperty(INTERESTING)) {
// dollars was set (which includes 0, false, null, undefined
// values. You might test for newDoc[INTERESTING] if those
// values should not trigger this code.
if((doc === null) || (doc[INTERESTING] !== newDoc[INTERESTING])) {
// The field changed or created!
newDoc.i_was_changed = true;
}
}
if(!newDoc._id) {
// A UUID generator would be better here.
newDoc._id = req.id || Math.random().toString();
}
// Return the same JSON the vanilla Couch API does.
return [newDoc, {json: {'id': newDoc._id}}];
}
Now you can PUT or POST to /db/_design/myapp/_update/[doc_id] and it will feel
just like the normal API except if you update the dollars field, it will add
an additional flag, i_was_changed. That is how you will find this change
later.
2. Filter for documents with the changed field
This is very straightforward:
function(doc, req) {
return doc.i_was_changed;
}
Now you can query the _changes feed with a ?filter= parameter. (Replication
also supports this filter, so you could pull to your local system all documents
which most recently changed/created the field.
That is the basic idea. The remaining steps will make your life easier if you
already have lots of client code and do not want to change the URLs.
3. Use rewriting to keep the HTTP API the same
This is available in CouchDB 0.11, and the best resource is Jan's blog post,
nice URLs in CouchDB.
Briefly, you want a vhost which sends all traffic to your rewriter (which itself
is a flexible "bouncer" to all design doc functionality based on the URL).
curl -X PUT http://example.com:5984/_config/vhosts/example.com \
-d '"/db/_design/myapp/_rewrite"'
Then you want a rewrites field in your design doc, something like (not
tested)
[
{
"comment": "Updates should go through the update function",
"method": "PUT",
"from": "db/*",
"to" : "db/_design/myapp/_update/*"
},
{
"comment": "Creates should go through the update function",
"method": "POST",
"from": "db/*",
"to" : "db/_design/myapp/_update/*"
},
{
"comment": "Everything else is just like normal",
"from": "*",
"to" : "../../../*"
}
]
(Once again, I got this code from examples and existing code I have laying
around but it's not 100% debugged. However I think it makes the idea very clear.
Also remember this step is optional however the advantage is, you never have to
change your client code.)

Resources