Is there any way to make relationship of neo4j in Pandas dataframe? - python-3.x

I have created node with using py2neo package
from py2neo import Graph
from py2neo import Node
This is my pandas dataframe
I can create successfully node.
I have been trying to working with relationship getting error!
graph.create(Relationship(pmid, "Having_author", auth))
TypeError: Values of type <class 'pandas.core.series.Series'> are not supported
I have also refer stack overflow question but still getting Error!
Here is the link
Is there any other way to create a relationship with pandas dataframe ?

Your code is failing because the ID for a node must be a literal (integer or string), but you set the ID as a Series when you wrote ...id = data['PMID']) . It appears py2neo allowed you to create the node object with a faulty ID, but it really shouldn't, because all relationships with that node will fail since the ID is bad.
Recreate the Node Classes with an integer for the ID, and then the Relationship between them should be created without issues.
Note, I haven't tested this code, but this is how you would loop through a df and create nodes as you go.
for i, row in data.iterrows():
pmid = row['PMID'] #this is the integer PMID based on your df
pmi_node = Node("PMID row " + str(i), id=pmid) #create node
authid = row['AU'] #this is a string author name based on your df
auth_node = Node("Auth row " + str(i), id=authid) #create node
graph.create(pmi_node | auth_node)
#create a relationship between the PMI and Auth for that row
graph.create(Relationship(pmi_node , "Having_author", auth_node))
PS -- The SO link you referenced is not using the py2neo package, but is instead simply sending cypher code strings to the database using the neo4j python package. I'd recommend this route if you are a beginner.

I have convert series object in zip then it works for me
for pmid, au in tqdm(zip(data.PMID, data.AU),total = data.shape[0]):
pmid_node = Node("pmid", name=int(pmid))
au_node = Node("author", name=au)
graph.create(Relationship(pmid_node, "HAVING_AUTHOR", au_node))

Related

How to get values based on 2 user inputs in Python

As per below data, using Python how can I get Headers column value for the corresponding given input from DB & Table column.
DB Table Headers
Oracle Cust Id,Name,Mail,Phone,City,County
Oracle Cli Cid,shopNo,State
Oracle Addr Street,Area,City,Country
SqlSer Usr Name,Id,Addr
SqlSer Log LogId,Env,Stg
MySql Loc Flat,Add,Pin,Country
MySql Data Id,Txt,TaskId,No
Output: Suppose if i pass, Oracle & Cli as parameters, then it should return the value as "Cid,shopNo,State" in a list.
Trying with python dictionary, but it takes 2 values key and value. But i have 3 values. how to get ?
Looks like your data is in some sort of tabular format. In that case I would recommend using the pandas package, which is very convenient if you are working with tabular data.
pandas can read data into a DataFrame from a CSV file using pandas.read_csv. This dataframe you can then filter using the column names and the required values.
In the example below I assume that your data is tab (\t) separated. I read in the data from a string using io.StringIO. Normally you would just use pandas.read_csv('filename.csv').
import pandas as pd
import io
data = """DB\tTable\tHeaders
Oracle\tCust\tId,Name,Mail,Phone,City,County
Oracle\tCli\tCid,shopNo,State
Oracle\tAddr\tStreet,Area,City,Country
SqlSer\tUsr\tName,Id,Addr
SqlSer\tLog\tLogId,Env,Stg
MySql\tLoc\tFlat,Add,Pin,Country
MySql\tData\tId,Txt,TaskId,No"""
dataframe = pd.read_csv(io.StringIO(data), sep='\t')
db_is_oracle = dataframe['DB'] == 'Oracle'
table_is_cli = dataframe['Table'] == 'Cli'
filtered_dataframe = dataframe[db_is_oracle & table_is_cli]
print(filtered_dataframe)
This will result in :
DB Table Headers
1 Oracle Cli Cid,shopNo,State
Or to get the actual headers of the first match:
print(filtered_dataframe['Headers'].iloc[0])
>>> Cid,shopNo,State

Palantir Foundry spark.sql query

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)
try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )
I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

How to solve InvalidRequestError encountered during execution of pandas.to_sql() using sqlalchemy connection?

I am trying to replace an existing table in MySQL database. I used the below piece of code to convert the data frame called frame to a database table:
import pandas as pd
import sqlalchemy
from sqlalchemy.types import VARCHAR
database_username = 'root'
database_password = '1234'
database_ip = 'localhost'
database_name = 'my_new_database'
database_connection = sqlalchemy.create_engine('mysql+mysqlconnector://{0}:{1}#{2}/{3}'.format(database_username, database_password, database_ip, database_name),pool_size=3,pool_recycle=3600)
frame.to_sql(schema=database_name,con=database_connection,name='table1',if_exists='replace',chunksize=1000,dtype={'Enrollment No': VARCHAR(frame.index.get_level_values('Enrollment No').str.len().max())})
table1 gets created successfully. But when I rerun the last line of the above code i.e. frame.to_sql(), it throws the below error:
InvalidRequestError: Could not reflect: requested table(s) not available in Engine(mysql+mysqlconnector://root:***#localhost/my_new_database) schema 'my_new_database': (table1)
I want to know why this error is thrown when the table already exists, even though I've used if_exists='replace' and why it works correctly only when creating the table for the first time. What must be done to avoid getting this error?
N.B.: Answers to similar questions only suggest using the table name in lowercase, which I'm following by naming the table as 'table1'.

Trying to plot a pandas dataframe groupby with Bokeh

New here but I've been searching for hours now and can't seem to find the solution for this. What I'm trying to do is display an aggregate of a dataframe in a Bokeh chart. I tried using a groupby object but I get an error when passing the groupby object to the ColumnDataSource (as mentioned in the post below).
how use bokeh vbar chart parameter with groupby object?
Here's some sample code I'm using:
import pandas
from bokeh.models import ColumnDataSource
df = pandas.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
group = df.groupby("A")
source = ColumnDataSource(group)
Getting this error:
ValueError: expected a dict or pandas.DataFrame, got <pandas.core.groupby.DataFrameGroupBy object at 0x103f7bfd0>
Any ideas as to plot the groupby object in a chart with Bokeh?
Thanks in advance!
I haven't used Bokeh, however from what I see you are passing a pandas.core.groupby.DataFrameGroupBy and ColumnDataSource is expecting a pd.DataFrame. That said the problem is that when using groupby you create a data structure that resembles key, value storage. So each group in the groups object will have a key and value, that value is the DataFrame that your are looking for. Running your code as shown below will help you understand the resulting data structure from applying groupby() to a DataFrame:
groups = df.group('A')
for group in groups:
# get group key
key = group[0]
# Get group df
group_df = group[1]
Notice that I replaced group = df.group('A') with groups = df.group('A')

Make Python dictionary available to all spark partitions

I am trying to develop an algorithm in pyspark for which I am working with linalg.SparseVector class. I need to create a dictionary of key value pairs as input to each SparseVector object. Here the keys have to be integers as they represent integers (in my case representing user ids). I have a separate method that reads the input file and returns a dictionary where each user ID ( string) is mapped to an integer index. When I go through the file again and do a
FileRdd.map( lambda x: userid_idx[ x[0] ] ) . I receive a KeyError. I'm thinking this is because my dict is unavailable to all partitions. Is there a way to make userid_idx dict available to all partitions similar to a distributed map in MapReduce? Also I apologize for the mess. I am posting this using my phone. Will update in a while from my laptop.
The code as promised:
from pyspark.mllib.linalg import SparseVector
from pyspark import SparkContext
import glob
import sys
import time
"""We create user and item indices starting from 0 to #users and 0 to #items respectively. This is done to store them in sparseVectors as dicts."""
def create_indices(inputdir):
items=dict()
user_id_to_idx=dict()
user_idx_to_id=dict()
item_idx_to_id=dict()
item_id_to_idx=dict()
item_idx=0
user_idx=0
for inputfile in glob.glob(inputdir+"/*.txt"):
print inputfile
with open(inputfile) as f:
for line in f:
toks=line.strip().split("\t")
try:
user_id_to_idx[toks[1].strip()]
except KeyError:
user_id_to_idx[toks[1].strip()]=user_idx
user_idx_to_id[user_idx]=toks[1].strip()
user_idx+=1
try:
item_id_to_idx[toks[0].strip()]
except KeyError:
item_id_to_idx[toks[0].strip()]=item_idx
item_idx_to_id[item_idx]=toks[0].strip()
item_idx+=1
return user_idx_to_id,user_id_to_idx,item_idx_to_id,item_id_to_idx,user_idx,item_idx
# pass in the hdfs path to the input files and the spark context.
def runKNN(inputdir,sc,user_id_to_idx,item_id_to_idx):
rdd_text=sc.textFile(inputdir)
try:
new_rdd = rdd_text.map(lambda x: (item_id_to_idx[str(x.strip().split("\t")[0])],{user_id_to_idx[str(x.strip().split("\t")[1])]:1})).reduceByKey(lambda x,y: x.update(y))
except KeyError:
sys.exit(1)
new_rdd.saveAsTextFile("hdfs:path_to_output/user/hadoop/knn/output")
if __name__=="__main__":
sc = SparkContext()
u_idx_to_id,u_id_to_idx,i_idx_to_id,i_id_to_idx,u_idx,i_idx=create_indices(sys.argv[1])
u_idx_to_id_b=sc.broadcast(u_idx_to_id)
u_id_to_idx_b=sc.broadcast(u_id_to_idx)
i_idx_to_idx_b=sc.broadcast(i_idx_to_id)
i_id_to_idx_b=sc.broadcast(i_id_to_idx)
num_users=sc.broadcast(u_idx)
num_items=sc.broadcast(i_idx)
runKNN(sys.argv[1],sc,u_id_to_idx_b.value,i_id_to_idx_b.value)
In Spark, that dictionary will already be available to you as it is in all tasks. For example:
dictionary = {1:"red", 2:"blue"}
rdd = sc.parallelize([1,2])
rdd.map(lambda x: dictionary[x]).collect()
# Prints ['red', 'blue']
You will probably find that your issue is actually that your dictionary does not contain the key you are looking up!
From the Spark documentation:
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program.
A copy of local variables referenced will be sent to the node along with the task.
Broadcast variables will not help you here, they are simply a tool to improve performance by sending once per node rather than a once per task.

Resources