pandas.read_sql with sqlite is extremly slow - python-3.x

I'm using pandas.read_sql with an sqlite Database and it is extremly slow.
I have a table with 800 rows and 49 columns (dataype just TEXT and REAL) and it takes over 3 Minutes to fetch the data from database to the dataframe.
The DB-File and the python script are running on the same machine and the same filesystem.
Is there any way to speed up pandas.read_sql?
This is the code fragment:
self.logger.info('{} - START read_sql: {}'.format(table))
result = pd.read_sql("select * from {}".format(table), self.dbconn,
index_col=indexcolname)
self.logger.info('{} - END read_sql: {}'.format(table))

I found the solution myself:
The Problem was using the connection as instance attribut: self.dbconn
If i always initiate a new connection and close it at the end perfomance is absolutely no problem !
conn = self.create_connection(self.db_file)
self.logger.info('{} - START from sql: {}'.format(self.botid, table))
Result = pd.read_sql("select * from {}".format(table), conn,
index_col=indexcolname)
self.logger.info('{} - END from sql: {}'.format(self.botid, table))
conn.close()

Related

Write Pandas dataframe data to CSV file

I am trying to write a pipeline to bring oracle database table data to aws.
It only takes a few ms to fill the dataframe, but when I try to write the dataframe to a csv-file it takes more than 2 min to write 10000 rows. In addition, one of the column's datatypes is cx_oracle lob type.
I thought this meant that it must take time to write data. So I converted the data to categorical data. But then the operation will take more memory space. Does anyone have any suggestions on how to optimize this process?
query = 'select * from tablename'
cursor.execute(query)
iter_idx = 0
while True:
results = cursor.fetchmany()
if not results:
break
iter_idx += 1
df = pd.DataFrame(results)
df.columns = field['source_field_names']
rec_count = df.shape[0]
t_rec_count += rec_count
file = generate_micro_file()
print('memory usage : \n', df.info(memory_usage='deep'))
# sd = dd.from_pandas(df, npartitions=1)
df.to_csv(file, encoding=str(encoding_type), header=False, index=False, escapechar='\\',chunksize=arraysize)
code output:
From the data access side, there is room for improvement by optimizing the fetching of rows across the network. Either by:
passing a large num_rows value to fetchmany(), see the cx_Oracle doc on [Cursor.fetchmany()[(https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#Cursor.fetchmany).
or increasing the value of Cursor.arraysize.
Your question didn't explain enough about your LOB usage. See the sample return_lobs_as_strings.py for optimizing fetches.
See the cx_Oracle documentation Tuning Fetch Performance.
Is there a particular reason to spend the overhead of converting to a Pandas dataframe? Why not write directly using the csv module?
Maybe something like this:
with connection.cursor() as cursor:
sql = "select * from all_objects where rownum <= 100000"
cursor.arraysize = 10000
with open("testwrite.csv", "w", encoding="utf-8") as outputfile:
writer = csv.writer(outputfile, lineterminator="\n")
results = cursor.execute(sql)
writer.writerows(results)
You should benchmark and choose the best solution.

Optimize inserting data to Cassandra database through Python driver

I try to insert 150.000 generated data to the Cassandra using BATCH in Python driver. And it take approximately 30 seconds. What should I do to optimize it and insert data faster ?
Here is my code:
from cassandra.cluster import Cluster
from faker import Faker
import time
fake = Faker()
cluster = Cluster(['127.0.0.1'], port=9042)
session = cluster.connect()
session.default_timeout = 150
num = 0
def create_data():
global num
BATCH_SIZE = 1500
BATCH_STMT = 'BEGIN BATCH'
for i in range(BATCH_SIZE):
BATCH_STMT += f" INSERT INTO tt(id, title) VALUES ('{num}', '{fake.name()}')";
num += 1
BATCH_STMT += ' APPLY BATCH;'
prep_batch = session.prepare(BATCH_STMT)
return prep_batch
tt = []
session.execute('USE ttest_2')
prep_batch = []
print("Start create data function!")
start = time.time()
for i in range(100):
prep_batch.append(create_data())
end = time.time()
print("Time for create fake data: ", end - start)
start = time.time()
for i in range(100):
session.execute(prep_batch[i])
time.sleep(0.00000001)
end = time.time()
print("Time for execution insert into table: ", end - start)
Main problem is that you're using batches for inserting the data - in Cassandra, that's a bad practice (see documentation for explanation). Instead you need to prepare a query, and insert data one by one - this will allow driver to route data to specific node, decreasing the load onto that node, and allow to perform data insertion faster. Pseudo-code would look as following (see the python driver code for exact syntax):
prep_statement = session.prepare("INSERT INTO tt(id, title) VALUES (?, ?)")
for your_loop:
session.execute(prep_statement, [id, title])
Another problem is that you're using synchronous API - this means that driver waits until insert happens & then fire the next one. To speedup you need to use asynchronous API instead (see the same doc for details). See the Developing applications with DataStax drivers guide for a list of best practices, etc.
But really, if you just want to load database with data, I recommend not to re-invent the wheel, but either:
generate the data into CSV file & load into Cassandra using DSBulk that is heavily optimized for loading of data
use NoSQLBench to generate data & populate Cassandra - it's also heavily optimized for data generation & loading (not only into Cassandra).

How does psycopg2 server side cursor operate when itersize is less than data size and fetch number is less than itersize?

I have read the documentation and several articles and posts and threads and all but I am not sure if I understand this clearly. Lets suppose this scenario:
1. I have a server side cursor.
2. I set the itersize to 1000.
3. I execute a SELECT query which would normally return 10000 records.
4. I use fetchmany to fetch 100 records at a time.
My question is how is this done behind the scene? My understanding is that the query is executed, but 1000 of the records are read by the server side cursor. The cursor refrains from reading the next 1000 unless it scrolls past the last record of the currently read 1000. Furthermore, the server side cursor holds the 1000 in the server's memory and scrolls over them 100 at a time, sending them to the client. I'm also curious to know what would the ram consumption look like? By my understanding, if executing the full query takes 10000 kb of the memory, the server side cursor would consume only 1000 kb on the server because it reads only 1000 records at a time and the client side cursor would use 100 kb. Is my understanding correct?
UPDATE
Per the documents and the discussion we had in the responses, I would expect this code to print a list of 10 items at a time:
from psycopg2 import connect, extras as psg_extras
with connect(host="db_url", port="db_port", dbname="db_name", user="db_user", password="db_password") as db_connection:
with db_connection.cursor(name="data_operator",
cursor_factory=psg_extras.DictCursor) as db_cursor:
db_cursor.itersize = 10
db_cursor.execute("SELECT rec_pos FROM schm.test_data;")
for i in db_cursor:
print(i)
print(">>>>>>>>>>>>>>>>>>>")
However, in each iteration it prints just one record. The only way I get 10 records is if I use fetchmany:
from psycopg2 import connect, extras as psg_extras
with connect(host="db_url", port="db_port", dbname="db_name", user="db_user", password="db_password") as db_connection:
with db_connection.cursor(name="data_operator",
cursor_factory=psg_extras.DictCursor) as db_cursor:
db_cursor.execute("SELECT rec_pos FROM schm.test_data;")
records = db_cursor.fetchmany(10)
while len(records) > 0:
print(i)
print(">>>>>>>>>>>>>>>>>>>")
records = db_cursor.fetchmany(10)
Based on these two code snippets, what I'm guessing is happening in the scenario mentioned before is that given the code bellow...
from psycopg2 import connect, extras as psg_extras
with connect(host="db_url", port="db_port", dbname="db_name", user="db_user", password="db_password") as db_connection:
with db_connection.cursor(name="data_operator",
cursor_factory=psg_extras.DictCursor) as db_cursor:
db_cursor.itersize = 1000
db_cursor.execute("SELECT rec_pos FROM schm.test_data;")
records = db_cursor.fetchmany(100)
while len(records) > 0:
print(i)
print(">>>>>>>>>>>>>>>>>>>")
records = db_cursor.fetchmany(100)
... itersize is a server side thing. What it does is that when the query runs, it sets a limit to load only 1000 records from the database. But fetchmany is a client side thing. It gets 100 of the 1000 from the server. Each time fetchmany runs, the next 100 is fetched from the server. When all the 1000 on the server side are scrolled over, the next 1000 are fetched from the DB on the server side. But I'm rather confused because that does not seem to be what the docs imply. But then again... the code seems to imply that.
I would spend some time here Server side cursor.
What you will find is that itersize only applies when you are iterating over a cursor:
for record in cur:
print record
Since you are using fetchmany(size=100) you will only be working with 100 rows at a time. The server will not be holding 1000 rows in memory. I was wrong sort of. The cursor will return all the rows to the client in memory and then fetchmany() will pull the rows from there in the batch size specified if a named cursor is not used. If a named cursor is used then it will fetch from server in the batch size.
UPDATE. Show how itersize and fetchmany() work.
Using itersize and fetchmany() with named cursor:
cur = con.cursor(name='cp')
cur.itersize = 10
cur.execute("select * from cell_per")
for rs in cur:
print(rs)
cur.close()
#Log
statement: DECLARE "cp" CURSOR WITHOUT HOLD FOR select * from cell_per
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: CLOSE "cp"
cur = con.cursor(name='cp')
cur.execute("select * from cell_per")
cur.fetchmany(size=10)
#Log
statement: DECLARE "cp" CURSOR WITHOUT HOLD FOR select * from cell_per
statement: FETCH FORWARD 10 FROM "cp"
Using fetchmany with unnamed cursor:
cur = con.cursor()
cur.execute("select * from cell_per")
rs = cur.fetchmany(size=10)
len(rs)
10
#Log
statement: select * from cell_per
So the named cursor fetches the rows(from server) in batches set by itersize when iterated over or by size when using fetchmany(size=n). Whereas a non-named cursor pulls all the rows into memory and then fetches them from there according to size set in fetchmany(size=n).
Further Update.
itersize only applies when you are iterating over the cursor object itself:
cur = con.cursor(name="cp")
cur.itersize = 10
cur.execute("select * from cell_per")
for r in cur:
print(r)
cur.close()
#Postgres log:
statement: DECLARE "cp" CURSOR WITHOUT HOLD FOR select * from cell_per
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: FETCH FORWARD 10 FROM "cp"
statement: CLOSE "cp"
In above the r will be a single row that is fetched from each batch of 10 rows that the server side(named) cursor returns. That batch size is = itersize. So when you are iterating over the named cursor object itself all the rows the query specifies will be returned in the iterator, just in batches of itersize.
Not iterating over named cursor object. Using fetchmany(size=n):
cur = con.cursor(name="cp")
cur.itersize = 10
cur.execute("select * from cell_per")
cur.fetchmany(size=20)
cur.fetchmany(size=20)
cur.close()
#Postgres log:
statement: DECLARE "cp" CURSOR WITHOUT HOLD FOR select * from cell_per
statement: FETCH FORWARD 20 FROM "cp"
statement: FETCH FORWARD 20 FROM "cp"
CLOSE "cp"
The itersize was set but it has no effect as the named cursor object
is not being iterated over. Instead the fetchmany(size=20) is having the server side cursor send a batch of 20 records each time it is called.

How to speed up my sql index query in Python?

I have a select index sql query in Python that I want to run but when I run it it runs really slow it's about 59152 records in the index any reason why it runs so slow? I have a whole file but this part is where it slow down and runs really slow.
DbConnect = '********'
myDbConn = pymssql.connect("******","*******", "**************",DbConnect)
cursor = myDbConn.cursor(as_dict=True)
cursor.execute("""select * from s20data with(INDEX(Storesort)) ;""")
s20data_rows = cursor.fetchall() or []
for s20data_rec in s20data_rows:
storeid = s20data_rec['storeid']
ssn = s20data_rec['ssn']

Azure SQL Data Warehouse hanging or not responding to simple query after large BCP operation

I have a preview version of Azure Sql Data Warehouse running which was working fine until I imported a large table (~80 GB) through BCP. Now all the tables including the small one do not respond even to a simple query
select * from <MyTable>
Queries to Sys tables are working still.
select * from sys.objects
The BCP process was left over the weekend, so any Statistics Update should have been done by now. Is there any way to figure out what is making this happen? Or at lease what is currently running to see if anything is blocking?
I'm using SQL Server Management Studio 2014 to connect to the Data Warehouse and executing queries.
#user5285420 - run the code below to get a good view of what's going on. You should be able to find the query easily by looking at the value in the "command" column. Can you confirm if the BCP command still shows as status="Running" when the query steps are all complete?
select top 50
(case when requests.status = 'Completed' then 100
when progress.total_steps = 0 then 0
else 100 * progress.completed_steps / progress.total_steps end) as progress_percent,
requests.status,
requests.request_id,
sessions.login_name,
requests.start_time,
requests.end_time,
requests.total_elapsed_time,
requests.command,
errors.details,
requests.session_id,
(case when requests.resource_class is NULL then 'N/A'
else requests.resource_class end) as resource_class,
(case when resource_waits.concurrency_slots_used is NULL then 'N/A'
else cast(resource_waits.concurrency_slots_used as varchar(10)) end) as concurrency_slots_used
from sys.dm_pdw_exec_requests AS requests
join sys.dm_pdw_exec_sessions AS sessions
on (requests.session_id = sessions.session_id)
left join sys.dm_pdw_errors AS errors
on (requests.error_id = errors.error_id)
left join sys.dm_pdw_resource_waits AS resource_waits
on (requests.resource_class = resource_waits.resource_class)
outer apply (
select count (steps.request_id) as total_steps,
sum (case when steps.status = 'Complete' then 1 else 0 end ) as completed_steps
from sys.dm_pdw_request_steps steps where steps.request_id = requests.request_id
) progress
where requests.start_time >= DATEADD(hour, -24, GETDATE())
ORDER BY requests.total_elapsed_time DESC, requests.start_time DESC
Checkout the resource utilization and possibly other issues from https://portal.azure.com/
You can also run sp_who2 from SSMS to get a snapshot of what's threads are active and whether there's some crazy blocking chain that's causing problems.

Resources