The documentation has the following example to merge relationships if all the end-points exist. If the MERGE were to be run in the Javascript client, then how can I find out, if the MERGE created the nodes and relationships or not? In other words, how can I deduce if the nodes were present beforehand or not?
At the moment, it seems, some methods of ResultSet for statistics might have to be used. Any thoughts?
GRAPH.QUERY DEMO_GRAPH
"MERGE (charlie { name: 'Charlie Sheen' })
MERGE (wallStreet:Movie { name: 'Wall Street' })
MERGE (charlie)-[r:ACTED_IN]->(wallStreet)"
the result-set statistics will let you know how many nodes/edges were created, if none that means the pattern existed, otherwise merge created the pattern.
Related
Hello all!
I am trying to use the Aggregate filter plugin of Logstash v7.7 to correlate and combine data from two different CSV file inputs which represent API data calls. The idea is to produce a record showing a combined picture. As you can expect the data may or may not arrive in the right sequence.
Here is as an example:
/data/incoming/source_1/*.csv
StartTime, AckTime, Operation, RefData1, RefData2, OpSpecificData1
231313232,44343545,Register,ref-data-1a,ref-data-2a,op-specific-data-1
979898999,75758383,Register,ref-data-1b,ref-data-2b,op-specific-data-2
354656466,98554321,Cancel,ref-data-1c,ref-data-2c,op-specific-data-2
/data/incoming/source_1/*.csv
FinishTime,Operation,RefData1, RefData2, FinishSpecificData
67657657575,Cancel,ref-data-1c,ref-data-2c,FinishSpecific-Data-1
68445590877,Register,ref-data-1a,ref-data-2a,FinishSpecific-Data-2
55443444313,Register,ref-data-1a,ref-data-2a,FinishSpecific-Data-2
I have a single pipeline that is receiving both these CSVs and I am able to process and write them as individual records to a single Index. However, the idea is to combine records from the two sources into one record each representing a superset. of Operation related information
Unfortunately, despite several attempts I have been unable to figure out how to achieve this via Aggregate filter plugin. My primary question is whether this is a suitable use of the specific plugin? And if so, any suggestions would be welcome!
At the moment, I have this
input {
file {
path => ['/data/incoming/source_1/*.csv']
tags => ["source1"]
}
file {
path => ['/data/incoming/source_2/*.csv']
tags => ["source2"]
}
# use the tags to do some source 1 and 2 related massaging, calculations, etc
aggregate {
task_id = "%{Operation}_%{RefData1}_%{RefData1}"
code => "
map['source_files'] ||= []
map['source_files'] << {'source_file', event.get('path') }
"
push_map_as_event_on_timeout => true
timeout => 600 #assuming this is the most far apart they will arrive
}
...
}
output {
elastic { ...}
}
And other such variations. However, I keep getting individual records being written to the Index and am unable to get one combined. Yet again, as you can see from the data set there's no guarantee of the sequencing of records - so I am wondering if the filter is the right tool for the job, to begin with? :-\
Or is it just me not being able to use it right! ;-)
In either case, any inputs/ comments/ suggestions welcome. Thanks!
PS: This message is being cross-posted over from Elastic forums. I am providing a link there just in case some answers pop up there too.
The answer is to use Elastic search in upsert mode. Please see the specifics here..
I recommend first that the information reaches you in order so that the filter can take it better, secondly, you could set the options in your pipeline.yml: pipeline.workers: 1 and pipeline.ordered: true, thus guaranteeing the order of processing.
I have the following .xslx file:
My software regardless tis language will return the following graph:
My software iterates line by line and on each line iteration executes the following query
MERGE (A:POINT {x:{xa},y:{ya}}) MERGE (B:POINT {x:{xb},y:{yb}}) MERGE (C:POINT {x:{xc},y:{yc}}) MERGE (A)-[:LINKS]->(B)-[:LINKS]->(C) MERGE (C)-[:LINKS]->(A)
Will this avoid by inserting duplicate entries?
According to this question, yes it will avoid writing duplicate entries.
The query above will match any existing nodes and it will avoid to write duplicates.
A good rule of thumb is on each node that it may be a duplicate write it into a seperate MERGE query and afterwards write the merge statements for each relationship between 2 nodes.
Update
After some experiece when using asyncronous technologies such nodejs or even parallel threads you must verify that you read the next line AFTER you inserted the previous one. The reason why is because is that doing multiple insertions asyncronously may result having multiple nodes into your graph that are actually the same ones.
In node.js project of mine I read the excell file like:
const iterateWorksheet=function(worksheet,maxRows,row,callback){
process.nextTick(function(){
//Skipping first row
if(row==1){
return iterateWorksheet(worksheet,maxRows,2,callback);
}
if(row > maxRows){
return;
}
const alphas=_.range('A'.charCodeAt(0),config.excell.maxColumn.charCodeAt(0));
let rowData={};
_.each(alphas,(column) => {
column=String.fromCharCode(column);
const item=column+row;
const key=config.excell.columnMap[column];
if(worksheet[item] && key ){
rowData[key]=worksheet[item].v;
}
});
// The callback is the isertion over a neo4j db
return callback(rowData,(error)=>{
if(!error){
return iterateWorksheet(worksheet,maxRows,row+1,callback);
}
});
});
}
As you see I visit the next line when I successfully inserted the previous one. I find no way yet to serialize the inserts like most conventional RDBMS's does.
In case or web or server applications another UNTESTED approach is to use queue servers such as RabbitMQ or similar in order to queue the queries. Then the code responsimble for insertion will read from the queue so the whole isolation should be in the queue.
Furthermore ensure that all inserts are into a transaction.
I'm just starting out with Neo4j and I've been trying to send over simple relationships using Postman. I'm not having any issues sending the Nodes but the second I try to create a relationship, it creates two arbitrary grey nodes and builds the relationships. However, when I send a command like this:
CREATE (a:Person { name:'Tom Hanks', born:1956 })
-[r:ACTED_IN { roles: ['Forrest']}]->
(m:Movie { title:'Forrest Gump',released:1994 })
It properly displays the nodes and the relationship between them. See image below for more clarity.
This seems a bit odd as I would assume you'd be able to easily add nodes or create relationships at any point rather than when the Node's are being created.
Any feedback would be greatly appreciated.
Yes, you can create Relationship after the node is created using Match clause.
For e.g.
MATCH (node1:node),(node2:node)
WHERE node1.val=20 AND node2.val=21
CREATE (node1)-[r:Rel]->(node2)
RETURN r
I'm working on a custom adapter in sails#0.10.0-rc4 which will support associations but I am having trouble getting them working in conjunction with my adapter. My configuration is a one-to-many association between article and stats. My models and adapter are setup like this:
// api/models/article.js
module.exports = {
connection: ['myadapter'],
tableName: 'Knowledge_Base__kav',
attributes: {
KnowledgeArticleId: { type: 'string', primaryKey: true }
stats: {
collection: 'stats',
via: 'parentId'
}
}
// api/models/stats.js
module.exports = {
connection: ['myadapter'],
tableName: 'KnowledgeArticleViewStat',
attributes: {
count: 'integer',
ParentId: {
model: 'article'
}
}
}
// adapter.js
find: function(connectionName, collectionName, options, cb) {
console.dir(options)
// output
// {where: null}
db.query(options, function(err, res)) {
cb(err, res)
}
}
However, when I try to populate using Article.find().populate('stats').exec(console.log()), my adapter gets {where: null} as options when I would expect it to receive {where: {parentId: [<some-article-id>]}}. It will return a list of articles to me but the field which is supposed to be populated from another model (stats) is just an empty list.
I feel like this is related to the fact that my adapter is not getting the proper where param to search for the related model on the primary key. To test this further, I setup a test one-to-many relationship using the the sails-mongo adapter. In this case the adapter did receive params I expected and the association worked fine.
Does anyone have any idea on why .populate('stats') wouldn't be sending the proper "where" params to my adapter?
Update 3/7
So it seems like what happens in associations is that SomeModel.find() will hit the adapter once and then .populate('othermodel') hits the adapter again using the primary key of the first request. Then the results of both are joined together. In my case, the second hit against the adapter isn't happening for some unknown reason.
Update
The original issue was related to an attribute naming error that's mentioned in the comments below. However, there still appears to be some issue with the final population step mentioned by particlebanana:
Final step will: Take all of the query results from all the returned query operations
and combine them in-memory to build up a result set you can return in
the exec callback.
I'm seeing that all required queries are now firing but they are failing to actually populate the alias. Here's the call with some added debugging output in the form of a gist for easier consumption: https://gist.github.com/jasonsims/9423170
It looks like you are on the right track! The way the operation sets get built up, the .find() on the Article should run with the first log (empty where) and the second query should get run with the parentId criteria in the log. The second query isn't running because it can't build up that parentId array of primary keys when you don't return anything from the first query.
Short answer: you need to return something in the find callback to see the second log, which should match your expected criteria.
The query lifecycle looks something like this:
Check if all query pieces are on the same connection, if not break out which queries will run on which connections
For all queries on a single connection, check if the adapter supports native joins (has a .join() method, if so you can pass the criteria down and let the adapter handle the joins.
If no native join method is defined run the "parent" operation (in this case the Article.find())
Use the results of the parent operation to build up criteria for any populations that need to run. (The parentId array in your criteria) and run the child results.
Take all of the query results from all the returned query operations and combine them in-memory to build up a result set you can return in the exec callback.
I hope that helps some. Shoot me the url of your repo and I will look through it, if it's able to be open sourced, and can help some more if you come across any issues.
Just to summarize, there were multiple issues going on here which were causing associations not to populate:
Custom primary keys
There was a problem with waterline when joining data from models using custom primary keys. #particlebanana fixed this in 8eff54b and it should be included in the next rc of waterline (waterline#0.10.0-rc5).
Malformed SOQL query
When waterline queries the adapter for a second time in order to acquire the child rows, it does so using { foreignKey: [ value ] }. Since the value was a list, jsforce was incorrectly generating the SOQL query since it expected all list values to be accompanied by either $in or $nin operators. I addressed this issue in github/jsforce#9 and it's now included in jsforce#1.1.2.
Model attributes are case sensitive
The models in my project were defined in snakeCase but the json response from Salesforce was using EveryWordCapitalized. This causes 1-to-many joins in waterline to reduce the many child records to one when it runs _.uniq(childRows, pk). Since the model has defined pk == id but the actual value returned from Salesforce is pk == Id, this call to uniq blows away all child records but one. I'm not entirely sure if this should be a waterline bug or not but fixing the capitalization in the model attribute definitions resolved this.
I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.