write pyspark Rdd or DF in neo4j

write pyspark Rdd or DF in neo4j - apache-spark

I'm experiencing some issue with py2neo and spark-driver since i could not insert node inside a foreach loop or map loop .Like the code below for exemple.
from py2neo import authenticate, Graph, cypher, Node
from pyspark import broadcast
infos=df.rdd
authenticate("localhost:7474", "neo4j", "admin")
graph = Graph(password='admin')
tx = graph.begin()
def node(row):
query = Node("item", event_id=row[0], text=row[19])
tx.create(query)
infos.foreach(node)
tx.commit()
here is the end of the stack trace :
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/rdd.py in _wrap_function(sc, func, deserializer, serializer, profiler)
2386 assert serializer, "serializer should not be empty"
2387 command = (func, profiler, deserializer, serializer)
-> 2388 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
2389 return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
2390 sc.pythonVer, broadcast_vars, sc._javaAccumulator)
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/rdd.py in _prepare_for_python_RDD(sc, command)
2372 # the serialized command will be compressed by broadcast
2373 ser = CloudPickleSerializer()
-> 2374 pickled_command = ser.dumps(command)
2375 if len(pickled_command) > (1 << 20): # 1M
2376 # The broadcast will have same life cycle as created PythonRDD
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/serializers.py in dumps(self, obj)
462
463 def dumps(self, obj):
--> 464 return cloudpickle.dumps(obj, 2)
465
466
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/cloudpickle.py in dumps(obj, protocol)
702
703 cp = CloudPickler(file,protocol
I think i can't pass the parameter tx inside the loop.
We try to overpass this issue by instanciating a connection directly inside the loop like the code bellow . It works for small matrix but when i try with a 20 million row one it stop at some point
from py2neo import authenticate, Graph, cypher, Node
infos=df.rdd
authenticate("localhost:7474", "neo4j", "password")
def node(row):
graph = Graph(password='admin')
tx = graph.begin()
query = Node("item", event_id=row[0], text=row[19])
tx.create(query)
tx.commit()
infos.foreach(node)
I made some research about the neo4j-spark connector it seems that you could add the library but there is no example provided and i'm not sure at all that such functionnality are actually provided in python. What would be the best way to solve this problem ?

A standard pattern to resolve this type of issues is to use foreachPartition:
def nodes(rows):
graph = Graph(password='admin')
tx = graph.begin()
for row in rows:
query = Node("item", event_id=row[0], text=row[19])
tx.create(query)
tx.commit()
infos.foreachPartition(nodes)

Related

How to create a PySpark pandas-on-Spark DataFrame from Snowflake SQL query?

NOTE: Need to use distributed processing, which is why I am utilizing Pandas API on Spark.
To create the pandas-on-Spark DataFrame, I attempted 2 different methods (outlined below: "OPTION 1", "OPTION 2").
Are either of these options feasible? If so, how do I proceed given errors (outlined below in "ISSUE(S)" and error log for "OPTION 2")?
Alternatively, should I start with PySpark SQL Pandas UDFs for the query, and then convert to pandas-on-Spark DataFrame?
# (Spark 3.2.0, Scala 2.12, DBR 10.0)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## I. Import libraries & dependencies
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
import numpy as np
import pandas as pd
import pyspark.pandas as ps
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql import Column
from pyspark.sql.functions import *
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## II. Load data + create Spark DataFrames
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
df_1 = spark.read.format("snowflake").options(**options).option("query","SELECT PROPERTY_ID,AVGRENT_MARKET FROM schema_1").load()
df_2 = spark.read.format("snowflake").options(**options).option("query","SELECT PROPERTY_ID,PROPERTY_ZIPCODE FROM schema_2").load()
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## III. OPTION 1: Union Spark DataFrames
## ISSUE(S): Results in 'None' values in PROPERTY_ZIPCODE column
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## Create merged dataframe from two Spark Dataframes
# df_3 = df_1.unionByName(df_2, allowMissingColumns=True)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## III. OPTION 2: Create Spark SQL DataFrame from SQL tables
## ISSUE(S): "AnalysisException: Reference 'PROPERTY_ID' is ambiguous, could be: table_1.PROPERTY_ID, table_2.PROPERTY_ID."
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## Create tables from two Spark DataFrames
df_1.createOrReplaceTempView("Table_1")
df_2.createOrReplaceTempView("Table_2")
## Specify SQL Snowflake query to merge tables
merge_tables = '''
SELECT Table_1.PROPERTY_ID,
Table_1.AVGRENT_MARKET,
Table_2.PROPERTY_ID,
Table_2.PROPERTY_ZIPCODE
FROM Table_2 INNER JOIN Table_1
ON Table_2.PROPERTY_ID=Table_1.PROPERTY_ID
LIMIT 25
'''
## Create merged Spark SQL dataframe based on query
df_3 = spark.sql(merge_tables)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
## Create a pandas-on-Spark DataFrame
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
df_3 = ps.DataFrame(df_3)
# df_3 = df_3.to_pandas_on_spark() # Alternative conversion option
Error log for "OPTION 2":
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<command-2142959205032388> in <module>
52 ##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~##
53 # df_3 = ps.DataFrame(df_3)
---> 54 df_3 = df_3.to_pandas_on_spark() # Alternative conversion option
/databricks/spark/python/pyspark/sql/dataframe.py in to_pandas_on_spark(self, index_col)
2777
2778 index_spark_columns, index_names = _get_index_map(self, index_col)
-> 2779 internal = InternalFrame(
2780 spark_frame=self, index_spark_columns=index_spark_columns, index_names=index_names
2781 )
/databricks/spark/python/pyspark/pandas/internal.py in __init__(self, spark_frame, index_spark_columns, index_names, index_fields, column_labels, data_spark_columns, data_fields, column_label_names)
633
634 # Create default index.
--> 635 spark_frame = InternalFrame.attach_default_index(spark_frame)
636 index_spark_columns = [scol_for(spark_frame, SPARK_DEFAULT_INDEX_NAME)]
637
/databricks/spark/python/pyspark/pandas/internal.py in attach_default_index(sdf, default_index_type)
865
866 if default_index_type == "sequence":
--> 867 return InternalFrame.attach_sequence_column(sdf, column_name=index_column)
868 elif default_index_type == "distributed-sequence":
869 return InternalFrame.attach_distributed_sequence_column(sdf, column_name=index_column)
/databricks/spark/python/pyspark/pandas/internal.py in attach_sequence_column(sdf, column_name)
878 #staticmethod
879 def attach_sequence_column(sdf: SparkDataFrame, column_name: str) -> SparkDataFrame:
--> 880 scols = [scol_for(sdf, column) for column in sdf.columns]
881 sequential_index = (
882 F.row_number().over(Window.orderBy(F.monotonically_increasing_id())).cast("long") - 1
/databricks/spark/python/pyspark/pandas/internal.py in <listcomp>(.0)
878 #staticmethod
879 def attach_sequence_column(sdf: SparkDataFrame, column_name: str) -> SparkDataFrame:
--> 880 scols = [scol_for(sdf, column) for column in sdf.columns]
881 sequential_index = (
882 F.row_number().over(Window.orderBy(F.monotonically_increasing_id())).cast("long") - 1
/databricks/spark/python/pyspark/pandas/utils.py in scol_for(sdf, column_name)
590 def scol_for(sdf: SparkDataFrame, column_name: str) -> Column:
591 """Return Spark Column for the given column name."""
--> 592 return sdf["`{}`".format(column_name)]
593
594
/databricks/spark/python/pyspark/sql/dataframe.py in __getitem__(self, item)
1657 """
1658 if isinstance(item, str):
-> 1659 jc = self._jdf.apply(item)
1660 return Column(jc)
1661 elif isinstance(item, Column):
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
121 # Hide where the exception came from that shows a non-Pythonic
122 # JVM exception message.
--> 123 raise converted from None
124 else:
125 raise
AnalysisException: Reference 'PROPERTY_ID' is ambiguous, could be: table_1.PROPERTY_ID, table_2.PROPERTY_ID.

If all you want is just a join, then use Spark join function instead. It's much cleaner and maintainable.
df_1 = spark.read...load()
df_2 = spark.read...load()
df_3 = df_1.join(df_2, on=['PROPERTY_ID'], how='inner')

How best to parallelize grakn queries with Python?

I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this?
My only experience with parellelization is on another project, where I submit a custom function to a dask distributed client to generate thousands of tasks. Right now, this same approach fails whenever the custom function receives or generates a grakn transaction object/handle. I get errors like:
Traceback (most recent call last):
File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\distributed\protocol\pickle.py", line 41, in dumps
return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
...
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
I've never used Python's multiprocessing module directly. What are other people doing to parallelize their queries to grakn?

The easiest approach that I've found to execute a batch of queries is to pass a Grakn session to each thread in a ThreadPool. Within each thread you can manage transactions and of course do some more complex logic:
from grakn.client import GraknClient
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
def write_query_batch(session, batch):
tx = session.transaction().write()
for query in batch:
tx.query(query)
tx.commit()
def multi_thread_write_query_batches(session, query_batches, num_threads=8):
pool = ThreadPool(num_threads)
pool.map(partial(write_query_batch, session), query_batches)
pool.close()
pool.join()
def generate_query_batches(my_data_entries_list, batch_size):
batch = []
for index, data_entry in enumerate(my_data_entries_list):
batch.append(data_entry)
if index % batch_size == 0 and index != 0:
yield batch
batch = []
if batch:
yield batch
# (Part 2) Somewhere in your application open a client and a session
client = GraknClient(uri="localhost:48555")
session = client.session(keyspace="grakn")
query_batches_iterator = generate_query_batches(my_data_entries_list, batch_size)
multi_thread_write_query_batches(session, query_batches_iterator, num_threads=8)
session.close()
client.close()
The above is a generic method. As a concrete example, you can use the above (omitting part 2) to parallelise batches of insert statements from two files. Appending this to the above should work:
files = [
{
"file_path": f"/path/to/your/file.gql",
},
{
"file_path": f"/path/to/your/file2.gql",
}
]
KEYSPACE = "grakn"
URI = "localhost:48555"
BATCH_SIZE = 10
NUM_BATCHES = 1000
# Entry point where migration starts
def migrate_graql_files():
start_time = time.time()
for file in files:
print('==================================================')
print(f'Loading from {file["file_path"]}')
print('==================================================')
open_file = open(file["file_path"], "r") # Here we are assuming you have 1 Graql query per line!
batches = generate_query_batches(open_file.readlines(), BATCH_SIZE)
with GraknClient(uri=URI) as client: # Using `with` auto-closes the client
with client.session(KEYSPACE) as session: # Using `with` auto-closes the session
multi_thread_write_query_batches(session, batches, num_threads=16) # Pick `num_threads` according to your machine
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
if __name__ == "__main__":
migrate_graql_files()
You should also be able to see how you can load from a csv or any other file type in this way, but taking the values you find in that file and substitution them into Graql query string templates. Take a look at the migration example in the docs for more on that.

An alternative approach using multi-processing instead of multi-threading follows below.
We empirically found that multi-threading doesn't yield particularly large performance gains, compared to multi-processing. This is probably due to Python's GIL.
This piece of code assumes a file enumerating TypeQL queries that are independent of each other, so they can be parallelised freely.
from typedb.client import TypeDB, TypeDBClient, SessionType, TransactionType
import multiprocessing as mp
import queue
def batch_writer(database, kill_event, batch_queue):
client = TypeDB.core_client("localhost:1729")
session = client.session(database, SessionType.DATA)
while not kill_event.is_set():
try:
batch = batch_queue.get(block=True, timeout=1)
with session.transaction(TransactionType.WRITE) as tx:
for query in batch:
tx.query().insert(query)
tx.commit()
except queue.Empty:
continue
print("Received kill event, exiting worker.")
def start_writers(database, kill_event, batch_queue, parallelism=4):
processes = []
for _ in range(parallelism):
proc = mp.Process(target=batch_writer, args=(database, kill_event, batch_queue))
processes.append(proc)
proc.start()
return processes
def batch(iterable, n=1000):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
if __name__ == '__main__':
batch_size = 100
parallelism = 1
database = "<database name>"
# filePath = "<PATH TO QUERIES FILE - ONE QUERY PER NEW LINE>"
with open(file_path, "r") as file:
statements = file.read().splitlines()[:]
batch_statements = batch(statements, n=batch_size)
total_batches = int(len(statements) / batch_size)
if total_batches % batch_size > 0:
total_batches += 1
batch_queue = mp.Queue(parallelism * 4)
kill_event = mp.Event()
writers = start_writers(database, kill_event, batch_queue, parallelism=parallelism)
for i, batch in enumerate(batch_statements):
batch_queue.put(batch, block=True)
if i*batch_size % 10000 == 0:
print("Loaded: {0}/{1}".format(i*batch_size, total_batches*batch_size))
kill_event.set()
batch_queue.close()
batch_queue.join_thread()
for proc in writers:
proc.join()
print("Done loading")

Getting Type Error Expected Strings or Bytes Like Object

I am working on a dataset with tweets and I am trying to find the mentions to other users in a tweet, these tweets can have none, single or multiple users mentioned.
Here is the head of the DataFrame:
The following is the function that I created to extract the list of mentions in a tweet:
def getMention(text):
mention = re.findall('(^|[^#\w])#(\w{1,15})', text)
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
I'm trying to create a new column in the DataFrame and apply the function with the following code:
df['mention'] = df['text'].apply(getMention)
On running this code I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-426da09a8770> in <module>
----> 1 df['mention'] = df['text'].apply(getMention)
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-42-d27373022afd> in getMention(text)
1 def getMention(text):
2
----> 3 mention = re.findall('(^|[^#\w])#(\w{1,15})', text)
4 if len(mention) > 0:
5 return [x[1] for x in mention]
~/anaconda3_501/lib/python3.6/re.py in findall(pattern, string, flags)
220
221 Empty matches are included in the result."""
--> 222 return _compile(pattern, flags).findall(string)
223
224 def finditer(pattern, string, flags=0):
TypeError: expected string or bytes-like object

I can't comment (not enough rep) so here's what I suggest to troubleshoot the error.
It seems findall raises an exception because text is not a string so you might want to check which type text actually is, using this:
def getMention(text):
print(type(text))
mention = re.findall(r'(^|[^#\w])#(\w{1,15})', text)
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
(or the debugger if you know how to)
And if text can be converted to a string maybe try this ?
def getMention(text):
mention = re.findall(r'(^|[^#\w])#(\w{1,15})', str(text))
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
P.S.: don't forget the r'...' in front of your regexp, to avoid special chars to be interpreted

Pool.apply_async().get() causes _thread.lock pickle error

I've recently made a python program that would benefit a lot from a consumer/producer parallel computing strategy.
I've tried to develop a module (Class) to ease the implementation of such processing strategy, but I quickly ran into a problem.
My ProducerConsumer class:
class ProducerConsumer(object):
def __init__(self, workers_qt, producer, consumer, min_producer_qt=1):
self.producer_functor = producer # Pointer to the producer function
self.consumer_functor = consumer # Pointer to the consumer function
self.buffer = deque([]) # Thread-safe double-ended queue item for intermediate result buffer
self.workers_qt = workers_qt
self.min_producer_qt = min_producer_qt # Minimum quantity of active producers (if enough remaining input data)
self.producers = [] # List of producers async results
self.consumers = [] # List of consumers async results
def produce(self, params, callback=None):
result = self.producer_functor(*params) # Execute the producer function
if callback is not None:
callback() # Call the callback (if there is one)
return result
def consume(self, params, callback=None):
result = self.consumer_functor(params) # Execute the producer function
if callback is not None:
callback() # Call the callback (if there is one)
return result
# Map a list of producer's input data to a list of consumer's output data
def map_result(self, producers_param):
result = [] # Result container
producers_param = deque(producers_param) # Convert input to double-ended queue (for popleft() member)
with Pool(self.workers_qt) as p: # Create a worker pool
while self.buffer or producers_param or self.consumers or self.producers: # Work remaining
# Create consumers
if self.buffer and (len(self.producers) >= self.min_producer_qt or not producers_param):
consumer_param = self.buffer.popleft() # Pop one set from the consumer param queue
if not isinstance(consumer_param, tuple):
consumer_param = (consumer_param,) # Force tuple type
self.consumers.append(p.apply_async(func=self.consume, args=consumer_param)) # Start new consumer
# Create producers
elif producers_param:
producer_param = producers_param.popleft() # Pop one set from the consumer param queue
if not isinstance(producer_param, tuple):
producer_param = (producer_param,) # Force tuple type
self.producers.append(p.apply_async(func=self.produce, args=producer_param)) # Start new producer
# Filter finished async_tasks
finished_producers = [r for r in self.producers if r.ready()] if self.producers else []
finished_consumers = [r for r in self.consumers if r.ready()] if self.consumers else []
# Remove finished async_tasks from the running tasks list
self.producers = [r for r in self.producers if r not in finished_producers]
self.consumers = [r for r in self.consumers if r not in finished_consumers]
# Extract result from finished async_tasks
for r in finished_producers:
assert r.ready()
self.buffer.append(r.get()) # Get the producer result and put it in the buffer
for r in finished_consumers:
assert r.ready()
result.append(r.get()) # Get the consumer tesult and put in in the function local result var
return result
In the member map_result(), when I try to "get()" the result of the apply_async(...) function, i get the following error (note that I'm running python3):
Traceback (most recent call last):
File "ProducerConsumer.py", line 91, in <module>
test()
File "ProducerConsumer.py", line 85, in test
result = pc.map_result(input)
File "ProducerConsumer.py", line 64, in map_result
self.buffer.append(r.get()) # Get the producer result and put it in the buffer
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/usr/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks
put(task)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
TypeError: can't pickle _thread.lock objects
And here is some code to reproduce my error (dependent on the class obviously) :
def test_producer(val):
return val*12
def test_consumer(val):
return val/4
def test():
pc = ProducerConsumer(4, test_producer, test_consumer)
input = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # Input for the test of the ProducerConsumer class
expected = [0, 3, 6, 9, 15, 18, 21, 23, 27] # Expected output for the test of the ProducerConsumer class
result = pc.map_result(input)
print('got : {}'.format(result))
print('expected : {}'.format(expected))
if __name__ == '__main__':
test()
Note that in the map_result() member of my class I only "get()" results that are "ready()".
From what I know about pickling (which I admit is not that much), I'd say that the fact that I Pool.apply_async(...) on a member function could play a role but I'd really like to keep the class structure if I can.
Thank you for the help!

So, the problem have been corrected when I also corrected some conception errors:
My 3 buffer variables (buffer, producers, consumers) had nothing to do as member of the class since they were semantically bound to the "map_result()" member itself.
So the patch was deleting these members and creating them as local variables of the member "map_result()".
Problem is, even if the conception was faulty, I still have a hard time understanding why the worker couldn't pickle the lock (of the param I now suppose) so...
If anyone have a clear explanation on what was going on (or a link to some) that would be really appreciated.

NotImplementedError: executemany is implemented for simple INSERT statements only

I try to append my vertica (SQL-type) table through pandas using sqlalchemy
import pandas as pd
import sqlalchemy as sa
Create engine to vertica:
def get_engine(base):
engine = sa.create_engine("{sys}+{dri}://{user}:" + \
"{password}#{host}:{port}/{database}".format(**login[base]))
return engine
engine = get_engine('vertica')
Just for clarity a simple query:
table = '***'
sql =\
'''
select *
from public.{table}
'''.format(table=table)
connection = engine.connect()
data = pd.read_sql(sql, connection)
connection.close()
Data is not empty:
print(len(data))
569955
And try to write to the same table:
fields = list(data.columns)
connection = engine.connect()
data.to_sql(table, connection, schema='public', index=False, if_exists='append', chunksize=30000,
dtype={fields[0]:sa.types.Integer,
fields[1]:sa.types.VARCHAR,
fields[2]:sa.types.Integer,
fields[3]:sa.types.Integer,
fields[4]:sa.types.Integer,
fields[5]:sa.types.VARCHAR,
fields[6]:sa.types.VARCHAR,
fields[7]:sa.types.VARCHAR,
fields[8]:sa.types.VARCHAR,
fields[9]:sa.types.VARCHAR,
fields[10]:sa.types.VARCHAR,
fields[11]:sa.types.VARCHAR,
fields[12]:sa.types.DateTime
})
connection.close()
And get this mistake:
...
\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py in do_executemany(self, cursor, statement, parameters, context)
465
466 def do_executemany(self, cursor, statement, parameters, context=None):
--> 467 cursor.executemany(statement, parameters)
468
469 def do_execute(self, cursor, statement, parameters, context=None):
\Anaconda3\lib\site-packages\vertica_python\vertica\cursor.py in executemany(self, operation, seq_of_parameters)
153 else:
154 raise NotImplementedError(
--> 155 "executemany is implemented for simple INSERT statements only")
156
157 def fetchone(self):
NotImplementedError: executemany is implemented for simple INSERT statements only

I got the same error when I was trying to write my data to vertica using sqlalchemy. For my case the issue was the column names. It seems that it can't write column names that include special characters. I could fix the error by removing all the '_', '%' and white space characters from column names in pandas and then I used df.to_sql() to write it in vertica.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

write pyspark Rdd or DF in neo4j - apache-spark

A standard pattern to resolve this type of issues is to use foreachPartition: def nodes(rows): graph = Graph(password='admin') tx = graph.begin() for row in rows: query = Node("item", event_id=row[0], text=row[19]) tx.create(query) tx.commit() infos.foreachPartition(nodes)

Related

How to create a PySpark pandas-on-Spark DataFrame from Snowflake SQL query?

How best to parallelize grakn queries with Python?

Getting Type Error Expected Strings or Bytes Like Object

Pool.apply_async().get() causes _thread.lock pickle error

NotImplementedError: executemany is implemented for simple INSERT statements only

Categories

Resources