How to merge nodes and relationships using py2neo v4 and Neo4j - python-3.x

I am trying to perform a basic merge operation to add nonexistent nodes and relationships to my graph by going through a csv file row by row. I'm using py2neo v4, and because there is basically no documentation or examples of how to use py2neo, I can't figure out how to actually get it done. This isn't my real code (it's very complicated to handle many different cases) but its structure is basically like this:
import py2neo as pn
graph = pn.Graph("bolt://localhost:###/", user="neo4j", password="py2neoSux")
matcher = pn.NodeMatcher(graph)
tx = graph.begin()
if (matcher.match("Prefecture", name="foo").first()) == None):
previousNode = pn.Node("Type1", name="fo0", yc=1)
else:
previousNode = matcher.match("Prefecture", name="foo").first())
thisNode = pn.Node("Type2", name="bar", yc=1)
tx.merge(previousNode)
tx.merge(thisNode)
theLink = pn.Relationship(thisNode, "PARTOF", previousNode)
tx.merge(theLink)
tx.commit()
Currently this throws the error
ValueError: Primary label and primary key are required for MERGE operation
the first time it needs to merge a node that it hasn't found (i.e., when creating a node). So then I change the line to:
tx.merge(thisNode,primary_label=list(thisNode.labels)[0], primary_key="name")
Which gives me the error IndexError: list index out of range from somewhere deep in the py2neo source code (....site-packages\py2neo\internal\operations.py", line 168, in merge_subgraph at node = nodes[i]). I tried to figure out what was going wrong there, but I couldn't decipher where the nodes list come from through various connections to other commands.
So, it currently matches and creates a few nodes without problem, but at some point it will match until it needs to create and then fails in trying to create that node (even though it is using the same code and doing the same thing under the same circumstances in a loop). It made it through all 20 rows in my sample once, but usually stops on the row 3-5.
I thought it had something to do with the transactions (see comments), but I get the same problem when I merge directly on the graph. Maybe it has to do with the py2neo merge function finding more identities for nodes than nodes. Maybe there is something wrong with how I specified my primarily label and/or key.
Because this error and code are opaque I have no idea how to move forward.
Anybody have any advice or instructions on merging nodes with py2neo?
Of course I'd like to know how to fix my current problem, but more generally I'd like to learn how to use this package. Examples, instructions, real documentation?

I am having a similar problem and just got done ripping my hair out to figure out what was wrong! SO! What I learned was that at least in my case.. and maybe yours too since we got similar error messages and were doing similar things. The problem lied for me in that I was trying to create a Node with a __primarykey__ field that had a different field name than the others.
PSEUDO EXAMPLE:
# in some for loop or complex code
node = Node("Example", name="Test",something="else")
node.__primarykey__ = "name"
<code merging or otherwise creating the node>
# later on in the loop you might have done something like this cause the field was null
node = Node("Example", something="new")
node.__primarykey__ = "something"
I hope this helps and was clear I'm still recovering from wrapping my head around things. If its not clear let me know and I'll revise.
Good luck.

Related

First Result Only Using Google-Search-API

I am using abenassi/Google-Search-API https://github.com/abenassi/Google-Search-API to make multiple Google queries in a small python script. I typically only need the first result (link) but the program is built to collect whole pages of results. So far I have been limiting the result as such:
results = google.search(query)
for result in iter(results[0:1]):
loc = result.link
The problem is that the script is slow as a result (I think) of having to wade through the whole page before I get my one link. Does anyone see something obvious I'm missing, or alternately, a simple way to modify the standard_search module https://github.com/abenassi/Google-Search-API/blob/master/google/modules/standard_search.py to limit results to first link only? Thanks!

Create automated report from web data

I have a set of multiple API's I need to source data from and need four different data categories. This data is then used for reporting purposes in Excel.
I initially created web queries in Excel, but my Laptop just crashes because there is too many querie which have to be updated. Do you guys know a smart workaround?
This is an example of the API I will source data from (40 different ones in total)
https://api.similarweb.com/SimilarWebAddon/id.priceprice.com/all
The data points I need are:
EstimatedMonthlyVisits, TopOrganicKeywords, OrganicSearchShare, TrafficSources
Any ideas how I can create an automated report which queries the above data on request?
Thanks so much.
If Excel is crashing due to the demand, and that doesn't surprise me, you should consider using Python or R for this task.
install.packages("XML")
install.packages("plyr")
install.packages("ggplot2")
install.packages("gridExtra")
require("XML")
require("plyr")
require("ggplot2")
require("gridExtra")
Next we need to set our working directory and parse the XML file as a matter of practice, so we're sure that R can access the data within the file. This is basically reading the file into R. Then, just to confirm that R knows our file is in XML, we check the class. Indeed, R is aware that it's XML.
setwd("C:/Users/Tobi/Documents/R/InformIT") #you will need to change the filepath on your machine
xmlfile=xmlParse("pubmed_sample.xml")
class(xmlfile) #"XMLInternalDocument" "XMLAbstractDocument"
Now we can begin to explore our XML. Perhaps we want to confirm that our HTTP query on Entrez pulled the correct results, just as when we query PubMed's website. We start by looking at the contents of the first node or root, PubmedArticleSet. We can also find out how many child nodes the root has and their names. This process corresponds to checking how many entries are in the XML file. The root's child nodes are all named PubmedArticle.
xmltop = xmlRoot(xmlfile) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop) #give name of node, PubmedArticleSet
xmlSize(xmltop) #how many children in node, 19
xmlName(xmltop[[1]]) #name of root's children
To see the first two entries, we can do the following.
# have a look at the content of the first child entry
xmltop[[1]]
# have a look at the content of the 2nd child entry
xmltop[[2]]
Our exploration continues by looking at subnodes of the root. As with the root node, we can list the name and size of the subnodes as well as their attributes. In this case, the subnodes are MedlineCitation and PubmedData.
#Root Node's children
xmlSize(xmltop[[1]]) #number of nodes in each child
xmlSApply(xmltop[[1]], xmlName) #name(s)
xmlSApply(xmltop[[1]], xmlAttrs) #attribute(s)
xmlSApply(xmltop[[1]], xmlSize) #size
We can also separate each of the 19 entries by these subnodes. Here we do so for the first and second entries:
#take a look at the MedlineCitation subnode of 1st child
xmltop[[1]][[1]]
#take a look at the PubmedData subnode of 1st child
xmltop[[1]][[2]]
#subnodes of 2nd child
xmltop[[2]][[1]]
xmltop[[2]][[2]]
The separation of entries is really just us, indexing into the tree structure of the XML. We can continue to do this until we exhaust a path—or, in XML terminology, reach the end of the branch. We can do this via the numbers of the child nodes or their actual names:
#we can keep going till we reach the end of a branch
xmltop[[1]][[1]][[5]][[2]] #title of first article
xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']] #same command, but more readable
Finally, we can transform the XML into a more familiar structure—a dataframe. Our command completes with errors due to non-uniform formatting of data and nodes. So we must check that all the data from the XML is properly inputted into our dataframe. Indeed, there are duplicate rows, due to the creation of separate rows for tag attributes. For instance, the ELocationID node has two attributes, ValidYN and EIDType. Take the time to note how the duplicates arise from this separation.
#Turning XML into a dataframe
Madhu2012=ldply(xmlToList("pubmed_sample.xml"), data.frame) #completes with errors: "row names were found from a short variable and have been discarded"
View(Madhu2012) #for easy checking that the data is properly formatted
Madhu2012.Clean=Madhu2012[Madhu2012[25]=='Y',] #gets rid of duplicated rows
Here is a link that should help you get started.
http://www.informit.com/articles/article.aspx?p=2215520
If you have never used R before, it will take a little getting used to, but it's worth it. I've been using it for a few years now and when compared to Excel, I have seen R perform anywhere from a couple hundred percent faster to many thousands of percent faster than Excel. Good luck.

2 Sequential Transactions, setting Detail Number (Revit API / Python)

Currently, I made a tool to rename view numbers (“Detail Number”) on a sheet based on their location on the sheet. Where this is breaking is the transactions. Im trying to do two transactions sequentially in Revit Python Shell. I also did this originally in dynamo, and that had a similar fail , so I know its something to do with transactions.
Transaction #1: Add a suffix (“-x”) to each detail number to ensure the new numbers won’t conflict (1 will be 1-x, 4 will be 4-x, etc)
Transaction #2: Change detail numbers with calculated new number based on viewport location (1-x will be 3, 4-x will be 2, etc)
Better visual explanation here: https://www.docdroid.net/EP1K9Di/161115-viewport-diagram-.pdf.html
Py File here: http://pastebin.com/7PyWA0gV
Attached is the python file, but essentially what im trying to do is:
# <---- Make unique numbers
t = Transaction(doc, 'Rename Detail Numbers')
t.Start()
for i, viewport in enumerate(viewports):
setParam(viewport, "Detail Number",getParam(viewport,"Detail Number")+"x")
t.Commit()
# <---- Do the thang
t2 = Transaction(doc, 'Rename Detail Numbers')
t2.Start()
for i, viewport in enumerate(viewports):
setParam(viewport, "Detail Number",detailViewNumberData[i])
t2.Commit()
Attached is py file
As I explained in my answer to your comment in the Revit API discussion forum, the behaviour you describe may well be caused by a need to regenerate between the transactions. The first modification does something, and the model needs to be regenerated before the modifications take full effect and are reflected in the parameter values that you query in the second transaction. You are accessing stale data. The Building Coder provides all the nitty gritty details and numerous examples on the need to regenerate.
Summary of this entire thread including both problems addressed:
http://thebuildingcoder.typepad.com/blog/2016/12/need-for-regen-and-parameter-display-name-confusion.html
So this issue actually had nothing to do with transactions or doc regeneration. I discovered (with some help :) ), that the problem lied in how I was setting/getting the parameter. "Detail Number", like a lot of parameters, has duplicate versions that share the same descriptive param Name in a viewport element.
Apparently the reason for this might be legacy issues, though im not sure. Thus, when I was trying to get/set detail number, it was somehow grabbing the incorrect read-only parameter occasionally, one that is called "VIEWER_DETAIL_NUMBER" as its builtIn Enumeration. The correct one is called "VIEWPORT_DETAIL_NUMBER". This was happening because I was trying to get the param just by passing the descriptive param name "Detail Number".Revising how i get/set parameters via builtIn enum resolved this issue. See images below.
Please see pdf for visual explanation: https://www.docdroid.net/WbAHBGj/161206-detail-number.pdf.html

Allocating matrix / structure data instead of string name to variable

I have a script that opens a folder and does some processing on the data present. Say, there's a file "XYZ.tif".
Inside this tif file, there are two groups of datasets, which show up in the workspace as
data.ch1eXYZ
and
data.ch3eXYZ
If I want to continue with the 2nd set, I can use
A=data.ch3eXYZ
However, XYZ usually is much longer and varies per file, whereas data.ch3e is consistent.
Therefore I tried
A=strcat('data.ch3e','origfilename');
where origfilename of course is XYZ, which has (automatically) been extracted before.
However, that gives me a string A (since I practically typed
A='data.ch3eXYZ'
instead of the matrix that data.ch3eXYZ actually is.
I think it's just a problem with ()'s, []'s, or {}'s but Ican't seem to figure it out.
Thanks in advance!
If you know the string, dynamic field references should help you here and are far better than eval
Slightly modified example from the linked blog post:
fldnm = 'fred';
s.fred = 18;
y = s.(fldnm)
Returns:
y =
18
So for your case:
test = data.(['ch3e' origfilename]);
Should be sufficient
Edit: Link to the documentation

Using indexed types for ElasticSearch in Titan

I currently have a VM running Titan over a local Cassandra backend and would like the ability to use ElasticSearch to index strings using CONTAINS matches and regular expressions. Here's what I have so far:
After titan.sh is run, a Groovy script is used to load in the data from separate vertex and edge files. The first stage of this script loads the graph from Titan and sets up the ES properties:
config.setProperty("storage.backend","cassandra")
config.setProperty("storage.hostname","127.0.0.1")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","db/es")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
The second part of the script sets up the indexed types:
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make();
The third part loads in the data from the CSV files, this has been tested and works fine.
My problem is, I don't seem to be able to use the ElasticSearch functions when I do a Gremlin query. For example:
g.E.has("property",CONTAINS,"test")
returns 0 results, even though I know this field contains the string "test" for that property at least once. Weirder still, when I change CONTAINS to something that isn't recognised by ElasticSearch I get a "no such property" error. I can also perform exact string matches and any numerical comparisons including greater or less than, however I expect the default indexing method is being used over ElasticSearch in these instances.
Due to the lack of errors when I try to run a more advanced ES query, I am at a loss on what is causing the problem here. Is there anything I may have missed?
Thanks,
Adam
I'm not quite sure what's going wrong in your code. From your description everything looks fine. Can you try the follwing script (just paste it into your Gremlin REPL):
config = new BaseConfiguration()
config.setProperty("storage.backend","inmemory")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","/tmp/es-so")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
g = TitanFactory.open(config)
g.makeKey("name").dataType(String.class).make()
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make()
g.makeLabel("knows").make()
g.commit()
alice = g.addVertex(["name":"alice"])
bob = g.addVertex(["name":"bob"])
alice.addEdge("knows", bob, ["property":"foo test bar"])
g.commit()
// test queries
g.E.has("property",CONTAINS,"test")
g.query().has("property",CONTAINS,"test").edges()
The last 2 lines should return something like e[1t-4-1w][4-knows-8]. If that works and you still can't figure out what's wrong in your code, it would be good if you can share your full code (e.g. in Github or in a Gist).
Cheers,
Daniel

Resources