DataStax driver for Cassandra Version 3.25.0,
Python version 3.9
Session.execute() fetches the first 100 records. As per the documentation, the driver is supposed to
tranparently fetch next pages as we reach the end of first page. However, it fetches the same page again and again and hence the first 100 records is all that is ever accessible.
The for loop that prints records goes infinite.
ssl_context.verify_mode = CERT_NONE
cluster = Cluster(contact_points=[db_host], port=db_port,
auth_provider = PlainTextAuthProvider(db_user, db_pwd),
ssl_context=ssl_context
)
session = cluster.connect()
query = "SELECT * FROM content_usage"
statement = SimpleStatement(query, fetch_size=100)
results = session.execute(statement)
for row in results:
print(f"{row}")
I could see other similar threads, but they are not answered too. Has anyone encountered this issue before? Any help is appreciated.
I'm a bit confused by the initial statement of the problem. You mentioned that the initial page of results is fetched repeatedly and that these are the only results available to your program. You also indicated that the for loop responsible for printing results turns into an infinite loop when you run the program. These statements seem contradictory to me; how can you know what the driver has fetched if you never get any output? I'm assuming that's what you meant by "goes infinite"... if I'm wrong please correct me.
The following code seems to run as expected against Cassandra 4.0.0 using cassandra-driver 3.25.0 on Python 3.9.0:
import argparse
import logging
import time
from cassandra.cluster import Cluster, SimpleStatement
def setupLogging():
log = logging.getLogger()
log.setLevel('DEBUG')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
def setupSchema(session):
session.execute("""create keyspace if not exists "baz" with replication = {'class':'SimpleStrategy', 'replication_factor':1};""")
session.execute("""create table if not exists baz.qux (run_ts bigint, idx int, uuid timeuuid, primary key (run_ts,idx))""")
session.execute("""truncate baz.qux""")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-d','--debug', action='store_true')
args = parser.parse_args()
cluster = Cluster()
session = cluster.connect()
if args.debug:
setupLogging()
setupSchema(session)
run_ts = int(time.time())
insert_stmt = session.prepare("""insert into baz.qux (run_ts,idx,uuid) values (?,?,now())""")
for idx in range(10000):
session.execute(insert_stmt, [run_ts, idx])
query = "select * from baz.qux"
stmt = SimpleStatement(query, fetch_size=100)
results = session.execute(stmt)
for row in results:
print(f"{row}")
cluster.shutdown()
$ time (python foo.py | wc -l)
10000
real 0m12.452s
user 0m3.786s
sys 0m2.197s
You might try running your sample app with debug logging enabled (see sample code above for how to enable this). It sounds like something might be off in your Cassandra configuration (or perhaps your client setup); the additional logging might help you identify what (if anything) is getting in the way.
The logic in your code is only calling execute() once so the contents of results will only ever be the same list of 100 rows.
You need to call execute() in your loop to get the next page of results like this:
query = "SELECT * FROM content_usage"
statement = SimpleStatement(query, fetch_size=100)
for row in session.execute(statement):
process_row(row)
For more info, see Paging with the Python driver. Cheers!
Below is the code snippet that finally worked for me, after restricting the driver version to 3.20:
statement = session.prepare(query)
# Execute the query once and retrieve the first page of results
results = self.session.execute(statement, params)
for row in results.current_rows:
process_row(row)
# Fetch more pages until they exhaust
while results.has_more_pages:
page_state = results.paging_state
results = session.execute(statement, parameters=params, paging_state=page_state)
for row in results.current_rows:
process_row(row)
Related
I've to transfer data from one postgreSQL DB (old) into another postgresSQL DB (new).
Old is encoded in win1252. New is encoded in utf-8.
I've already tried different methods ex. pandas.to_sql, sqlalchemy, psycopg2 and so on but failing all the time due to encoding "issues". I've done some researches and the most valid thing looks like an issue on the driver side. As far as I know psycopg2 uses the unicode driver but with my source database version (PostgreSQL 9.4.20 on x86_64) I've to use ANSI to bypass these encoding issues.
I've tested with an ETL tool if it's possible to export the affected table without encoding issues. It was possible without issues. Due to this test I'm pretty sure it's no real encoding issue instead of an driver handling issue.
When I just used a sample to test if loading the data in general works, I already noticed pandas is to slow. I've to load 1.2 mio reccords. But this runs for ever. Therefore the postgreSQL copy method is may prefered method. From my perspective psycopg2 is useing the standard connection string (https://halvar.at/python/odbc_dsn_connection_strings/). But I've to use the ANSI driver.
I tried to pass an SQLAlchemy to thy psycopg2 connector. But this does not work.
stage_engine_string = ("{PostgreSQL ANSI}+psycopg2://" + str(stage_user) + ":" + str(stage_password) + "#" + str(stage_host) + ":" + str(stage_port) + "/" + str(stage_database))
because
conn = psycopg2.connect(**params)
only allows to pass the arguments.
host =
database =
user =
password =
port =
Before I tried the above I tried for ex.
cur.copy_to(open("sql_tmp_export.csv", "w", encoding="utf-8", errors="ignore"), "table", sep=";", columns=("no","description"))
,
conn.decode("win1250").encode("utf8")
and
conn.set_client_encoding("win1250")
but I receive an encoidng issue all the time. Based on the doc of postgres switching between utf8 and win1250 should never be an problem.
On the ETL tool I'd a similar issue but was able to solve it via sending an
"set client_encoding=\"windows-1250\"
after esthablishing the connection to the database.
But if I try this in psycopg2 to
cur.execute("set client_encoding=\"windows-1250\;select * from table")
I stil get the encoding issue.
Any clue if I have an option to pass the driver on builing up a psycopg2 connection? I think this should solve my issue.
My real issue (getting data from db) wasn't solved because of follow up issues. If you want to get into, I'm happy to discuss on my next question: Downloading a postgreSQL pg_dump file from a remote server using Python
But I was able to solve this question. If you want to use an ANSI you've to install the last ODBC driver from https://www.postgresql.org/ftp/odbc/versions/msi/
Then you can swith the psycopg2 connection to an pyodbc connection.
import pyodbc
conn_str = (
"DRIVER={PostgreSQL Ansi(x64)};"
"DATABASE="+database+";"
"UID="+user+";"
"PWD="+password+";"
"SERVER="+host+";"
"PORT="+port+";"
)
conn = pyodbc.connect(conn_str)
cur = conn.execute("SELECT 1")
row = cur.fetchone()
print(row)
cur.close()
conn.close()
My general problem has been fixed now as well. But the solution was strange. If someone stucks on something similar, I simply run the same script twice but first of all with limit and offset.
def any_postrgres_method_to_load_data_from_db:
conn = some_lib.conect(var1, var2)
cur = conn.cursor()
sql_pre_statement = """\
set client_encoding = "Windows-1250"
"""
cur.execute(sql_pre_statement)
sql_statement = """\
select * from n
"""
cur.execute(sql_statement)
df = pandas.read_sql_query(sql, conn)
df.to_csv("sql_tmp_export.csv", index=False)
The script above returned several encoding issues.
After running the script slightly adjusted as shown below ones, I was able to run the original one working.
def any_postrgres_method_to_load_data_from_db:
conn = some_lib.conect(var1, var2)
cur = conn.cursor()
sql_pre_statement = """\
set client_encoding = "Windows-1250"
"""
cur.execute(sql_pre_statement)
sql_statement = """\
select * from n offset 500 limit 1000
"""
cur.execute(sql_statement)
df = pandas.read_sql_query(sql, conn)
df.to_csv("sql_tmp_export.csv", index=False)
I can't really explain this. I've just the feeling that there was something strange in the cache on the remote db.
Having 500, continously growing DataFrames, I would like to submit operations on the (for each DataFrame indipendent) data to dask. My main question is: Can dask hold the continously submitted data, so I can submit a function on all the submitted data - not just the newly submitted?
But lets explain it on an example:
Creating a dask_server.py:
from dask.distributed import Client, LocalCluster
HOST = '127.0.0.1'
SCHEDULER_PORT = 8711
DASHBOARD_PORT = ':8710'
def run_cluster():
cluster = LocalCluster(dashboard_address=DASHBOARD_PORT, scheduler_port=SCHEDULER_PORT, n_workers=8)
print("DASK Cluster Dashboard = http://%s%s/status" % (HOST, DASHBOARD_PORT))
client = Client(cluster)
print(client)
print("Press Enter to quit ...")
input()
if __name__ == '__main__':
run_cluster()
Now I can connect from my my_stream.py and start to submit and gather data:
DASK_CLIENT_IP = '127.0.0.1'
dask_con_string = 'tcp://%s:%s' % (DASK_CLIENT_IP, DASK_CLIENT_PORT)
dask_client = Client(self.dask_con_string)
def my_dask_function(lines):
return lines['a'].mean() + lines['b'].mean
def async_stream_redis_to_d(max_chunk_size = 1000):
while 1:
# This is a redis queue, but can be any queueing/file-stream/syslog or whatever
lines = self.queue_IN.get(block=True, max_chunk_size=max_chunk_size)
futures = []
df = pd.DataFrame(data=lines, columns=['a','b','c'])
futures.append(dask_client.submit(my_dask_function, df))
result = self.dask_client.gather(futures)
print(result)
time sleep(0.1)
if __name__ == '__main__':
max_chunk_size = 1000
thread_stream_data_from_redis = threading.Thread(target=streamer.async_stream_redis_to_d, args=[max_chunk_size])
#thread_stream_data_from_redis.setDaemon(True)
thread_stream_data_from_redis.start()
# Lets go
This works as expected and it is really quick!!!
But next, I would like to actually append the lines first before the computation takes place - And wonder if this is possible? So in our example here, I would like to calculate the mean over all lines which have been submitted, not only the last submitted ones.
Questions / Approaches:
Is this cummulative calculation possible?
Bad Alternative 1: I
cache all lines locally and submit all the data to the cluster
every time a new row arrives. This is like an exponential overhead. Tried it, it works, but it is slow!
Golden Option: Python
Program 1 pushes the data. Than it would be possible to connect with
another client (from another python program) to that cummulated data
and move the analysis logic away from the inserting logic. I think Published DataSets are the way to go, but are there applicable for this high-speed appends?
Maybe related: Distributed Variables, Actors Worker
Assigning a list of futures to a published dataset seems ideal to me. This is relatively cheap (everything is metadata) and you'll be up-to-date as of a few milliseconds
client.datasets["x"] = list_of_futures
def worker_function(...):
futures = get_client().datasets["x"]
data = get_client.gather(futures)
... work with data
As you mention there are other systems like PubSub or Actors. From what you say though I suspect that Futures + Published datasets are simpler and a more pragmatic option.
I am using Google Cloud Functions to connect to a Google Bigquery database and update some rows. The cloud function is written using Python 3.
I need help figuring out how to get the result message or the number of updated/changed rows whenever I run an update dml through the function. Any ideas?
from google.cloud import bigquery
def my_update_function(context,data):
BQ = bigquery.Client()
query_job = BQ.query("Update table set etc...")
rows = query_job.result()
return (rows)
I understand that rows always come back as _emptyrowiterator object. Any way i can get result or result message? Documentation says I have to get it from a bigquery job method. But can't seem to figure it out.
I think that you are searching for QueryJob.num_dml_affected_rows. It contain number of rows affected by update or any other DML statement. If you just paste it to your code instead of rows in return statement you will get number as int or you can create some massage like :
return("Number of updated rows: " + str(job_query.num_dml_affected_rows))
I hope it will help :)
Seems like there is no mention in the bigquery Python DB-API documentation on rows returned. https://googleapis.dev/python/bigquery/latest/reference.html
I decided to use a roundabout method on dealing with this issue by generating a SELECT statement first to check if there are any matches to the WHERE clause in the UPDATE statement.
Example:
from google.cloud.bigquery import dbapi as bq
def my_update_function(context,data):
try:
bq_connection = bq.connect()
bq_cursor = bq_connection.cursor()
bq_cursor.execute("select * from table where ID = 1")
results = bq_cursor.fetchone()
if results is None:
print("Row not found.")
else:
bq_cursor.execute("UPDATE table set name = 'etc' where ID = 1")
bq_connection.commit()
bq_connection.close()
except Exception as e:
db_error = str(e)
So I've queried data from oracle database using cursor.execute(). A relatively simple select query. It works.
But when I try to fetch data from it, python crashes.
The same occurs for fetchall(), fetchmany() and fetchone().
When the query first broke in fetchmany() I decided to loop through fetchone() and it worked for the first two rows then broke at the third.
I'm guessing it is because there's too much data in third row.
So, is there any way to bypass this issue and pull the data?
(Please ignore the wrong indents could not copy properly in my phone)
EDIT:
I removed four columns with type "ROWID". There was no issue after that. I was easily able to fetch 100 rows in one go.
So to confirm my suspicion I went ahead and created another copy with only those rowed columns and it crashes as expected.
So is there any issue with ROWID type?
Test table for the same.
Insert into TEST_FOR_CX_ORACLE (Z$OEX0_LINES,Z$OEX0_ORDER_INVOICES,Z$OEX0_ORDERS,Z$ITEM_ROWID) values ('ABoeqvAEyAAB0HOAAM','AAAL0DAEzAAClz7AAN','AAAVeuABHAAA4vdAAH','ABoeo+AIVAAE6dKAAQ');
Insert into TEST_FOR_CX_ORACLE (Z$OEX0_LINES,Z$OEX0_ORDER_INVOICES,Z$OEX0_ORDERS,Z$ITEM_ROWID) values ('ABoeqvABQAABKo6AAI','AAAL0DAEzAAClz7AAO','AAAVeuABHAAA4vdAAH','ABoeo+AIVAAE6dKAAQ');
Insert into TEST_FOR_CX_ORACLE (Z$OEX0_LINES,Z$OEX0_ORDER_INVOICES,Z$OEX0_ORDERS,Z$ITEM_ROWID) values ('ABoeqvABQAABKo6AAG','AAAL0DAEzAAClz7AAP','AAAVeuABHAAA4vdAAH','ABoeo+AHIAAN+OIAAM');
Insert into TEST_FOR_CX_ORACLE (Z$OEX0_LINES,Z$OEX0_ORDER_INVOICES,Z$OEX0_ORDERS,Z$ITEM_ROWID) values ('ABoeqvAEyAAB0HOAAK','AAAL0DAEzAACl0EAAC','AAAVeuABHAAA4vdAAH','ABoeo+AHIAAN+OIAAM');
Script:
from cx_Oracle import makedsn,connect,Cursor
from pandas import read_sql_table, DataFrame, Series
from time import time
def create_conn( host_link , port , service_name , user_name , password ):
dsn=makedsn(host_link,port,service_name=service_name)
return connect(user=user_name, password=password, dsn=dsn)
def initiate_connection(conn):
try:
dbconnection = create_conn(*conn)
print('Connected to '+conn[2]+' !')
except Exception as e:
print(e)
dbconnection = None
return dbconnection
def execute_query(query,conn):
dbconnection=initiate_connection(conn)
try:
cursor = dbconnection.cursor()
print ('Cursor Created!')
return cursor.execute(query)
except Exception as e:
print(e)
return None
start_time = time()
query='''SELECT * FROM test_for_cx_oracle'''
try:
cx_read_query = execute_query(query,ecspat_c)
time_after_execute_query = time()
print('Query Executed')
columns = [i[0] for i in cx_read_query.description]
time_after_getting_columns = time()
except Exception as e:
print(e)
print(time_after_execute_query-start_time,time_after_getting_columns-time_after_execute_query)
Unfortunately, this is a bug in the Oracle Client libraries. You will see it if you attempt to fetch the same rowid value multiple times in consecutive rows. If you avoid that situation all is well. You can also set the environment variable ORA_OCI_NO_OPTIMIZED_FETCH to the value 1 before you run the query to avoid the problem.
This has been reported earlier here: https://github.com/oracle/python-cx_Oracle/issues/120
I have the following question, I want to set up a routine to perform iterations inside a dataframe (pandas) to extract longitude and latitude data, after supplying the address using the 'geopy' library.
The routine I created was:
import time
from geopy.geocoders import GoogleV3
import os
arquivo = pd.ExcelFile('path')
df = arquivo.parse("Table1")
def set_proxy():
proxy_addr = 'http://{user}:{passwd}#{address}:{port}'.format(
user='usuario', passwd='senha',
address='IP', port=int('PORTA'))
os.environ['http_proxy'] = proxy_addr
os.environ['https_proxy'] = proxy_addr
def unset_proxy():
os.environ.pop('http_proxy')
os.environ.pop('https_proxy')
set_proxy()
geo_keys = ['AIzaSyBXkATWIrQyNX6T-VRa2gRmC9dJRoqzss0'] # API Google
geolocator = GoogleV3(api_key=geo_keys )
for index, row in df.iterrows():
location = geolocator.geocode(row['NO_LOGRADOURO'])
time.sleep(2)
lat=location.latitude
lon=location.longitude
timeout=10)
address = location.address
unset_proxy()
print(str(lat) + ', ' + str(lon))
The problem I'm having is that when I run the code the following error is thrown:
GeocoderQueryError: Your request was denied.
I tried the creation without passing the key to the google API, however, I get the following message.
KeyError: 'http_proxy'
and if I remove the unset_proxy () statement from within the for, the message I receive is:
GeocoderQuotaExceeded: The given key has gone over the requests limit in the 24 hour period or has submitted too many requests in too short a period of time.
But I only made 5 requests today, and I'm putting a 2-second sleep between requests. Should the period be longer?
Any idea?
api_key argument of the GoogleV3 class must be a string, not a list of strings (that's the cause of your first issue).
geopy doesn't guarantee the http_proxy/https_proxy env vars to be respected (especially the runtime modifications of the os.environ). The advised (by docs) usage of proxies is:
geolocator = GoogleV3(proxies={'http': proxy_addr, 'https': proxy_addr})
PS: Please don't ever post your API keys to the public. I suggest to revoke the key you've posted in the question and generate a new one, to prevent the possibility of it being abused by someone else.