slow(er) SELECTs on Postgres index in Python script - python-3.x

I've recently migrated a simple single-column DB (this column is indexed TEXT) from SQLite to PostgreSQL. This column has ~100m rows and i use to simply check if the column contains a certain text value.
[please avoid recommending better options for this simple problem as i need to use PostgreSQL in future for another app requiring fast select queries anyway.]
The problem I'm having is select queries are around 8x slower than SQLite, at around 140k selects per minute looping over the same text file.
I've simplified the code as much as possible (using psycopg2 library here, and omitted pvt info):
with open('data.txt', 'r') as f:
cnt=0
conn = get_conn()
c = conn.cursor()
for line in f:
c.execute('SELECT mycol from mytbl where mycol = %s', (line.strip(),))
if c.fetchone():
pass
else:
pass
cnt += 1
print(cnt)
sqlite3 test is the same, with same single indexed column DB structure.
some clues:
The DB is hosted my local PC
The index is created implicitly using the primary key constraint
Query plan: Index Only Scan using mycol_pkey on mytable

Related

mariadb python - executemany using SELECT

Im trying to input many rows to a table in a mariaDB.
For doing this i want to use executemany() to increase speed.
The inserted row is dependent on another table, which is found with SELECT.
I have found statements that SELECT doent work in a executemany().
Are there other ways to sole this problem?
import mariadb
connection = mariadb.connect(host=HOST,port=PORT,user=USER,password=PASSWORD,database=DATABASE)
cursor = connection.cursor()
query="""INSERT INTO [db].[table1] ([col1], [col2] ,[col3])
VALUES ((SELECT [colX] from [db].[table2] WHERE [colY]=? and
[colZ]=(SELECT [colM] from [db].[table3] WHERE [colN]=?)),?,?)
ON DUPLICATE KEY UPDATE
[col2]= ?,
[col3] =?;"""
values=[input_tuplets]
When running the code i get the same value for [col1] (the SELECT-statement) which corresponds to the values from the from the first tuplet.
If SELECT doent work in a executemany() are there another workaround for what im trying to do?
Thx alot!
I think that reading out the tables needed,
doing the search in python,
use exeutemany() to insert all data.
It will require 2 more queries (to read to tables) but will be OK when it comes to calculation time.
Thanks for your first question on stackoverflow which identified a bug in MariaDB Server.
Here is a simple script to reproduce the problem:
CREATE TABLE t1 (a int);
CREATE TABLE t2 LIKE t1;
INSERT INTO t2 VALUES (1),(2);
Python:
>>> cursor.executemany("INSERT INTO t1 VALUES \
(SELECT a FROM t2 WHERE a=?))", [(1,),(2,)])
>>> cursor.execute("SELECT a FROM t1")
>>> cursor.fetchall()
[(1,), (1,)]
I have filed an issue in MariaDB Bug tracking system.
As a workaround, I would suggest reading the country table once into an array (according to Wikipedia there are 195 different countries) and use these values instead of a subquery.
e.g.
countries= {}
cursor.execute("SELECT country, id FROM countries")
for row in cursor:
countries[row[0]]= row[1]
and then in executemany
cursor.executemany("INSERT INTO region (region,id_country) values ('sounth', ?)", [(countries["fra"],) (countries["ger"],)])

creating python pop function for sqlite3

I'm trying to create a pop function getting a row of data from a sqlite database and deleting that same row. I would like to not have to create an ID column so I am using ROWID. I want to always get the first row and return it. This is the code I have:
import sqlite3
db = sqlite3.connect("Test.db")
c=db.cursor()
def sqlpop():
c.execute("SELECT * from DATA WHERE ROWID=1")
data = c.fetchall()
c.execute("DELETE from DATA WHERE ROWID=1")
db.commit()
return(data)
when I call the function it gets the first item correctly, but after the first call the function returns nothing. like this:
>>> sqlpop()
[(1603216325, 'placeholder IP line 124', 'placeholder Device line 124', '1,2,0', 1528, 1564)]
>>> sqlpop()
[]
>>> sqlpop()
[]
>>> sqlpop()
[]
what do I need to change for this function to work correctly?
update:
using what Schwern said I got the funtion to work:
def sqlpop():
c.execute("SELECT * from DATA ORDER BY ROWID LIMIT 1")
data = c.fetchone()
c.execute("DELETE from DATA ORDER BY ROWID LIMIT 1")
db.commit()
return data
rowid is not the row order, it is a unique identifier for the row created by SQLite unless you say otherwise.
SQL rows have no inherent order. You could grab just one row...
select * from table limit 1;
But you'll get them in no guaranteed order. And without a rowid you have no way to identify it again to delete it.
If you want to get the "first" row you must define what "first" means. To do that you need something to order by. For example, a timestamp. Or perhaps an auto-incrementing integer. You cannot use rowid, rowids are not guaranteed to be assigned in any particular order.
select *
from table
where created_at = max(created_at)
limit 1
So long as created_at is indexed, that should work fine. Then delete by its rowid.
You also don't need to use fetchall to fetch one row, use fetchone. In general, fetchall should be avoided as it risks consuming all your memory by slurping all the data in at once. Instead, use iterators.
for row in c.execute(...)

How to count the number of rows from query with SQLAlchemy when no model is specified?

I'm looking for a way to count with SQLAlchemy the number of rows that is returned from a given query (that potentially includes filters) but everything I find on the net makes explicit use of a model (example here). My problem is that I don't have a model, I only have a Table object (because I'm dealing with temporary tables that vary in format from time to time). For the moment I can do the following:
tbl = Table(mytablename,metadata,autoload=True, autoload_with=myengine, schema=myschemaname)
query = select([tbl])
filters = build_filters(...) #my function that build filters
query = query.where(and_(*filters))
conn = myengine.connect()
ResultProxy = conn.execute(query)
totalCount = len(ResultProxy.fetchall())
but it's very inefficient. Is there a way to do the count efficiently and without referring to any model?
Try the SQLAlchemy Core 'count' function documented here. I believe you can attach your filters on to that like you're doing now. So, (not guaranteeing my syntax here, but here's something to start you with)...
query = select([func.count()]).select_from(my_table).where(and_(*filters))
conn = myengine.connect()
ResultProxy = conn.execute(query)
totalCount = ResultProxy.fetchone()[0]
According to the documentation, I believe this will actually generate a SELECT COUNT from the database, not actually bring all the rows back from the DB and then count them.

How to insert value in already created Database table through pandas `df.to_sql()`

I'm creating new table then inserting values in it because the tsv file doesn't have headers so i need to create table structure first then insert the value. I'm trying to insert the value in database table which is been created. I'm using df.to_sql function to insert tsv values into database table but its creating table but it's not inserting values in that table and its not giving any type of error either.
I have tried to create new table through sqalchemy and insert value it worked but it didn't worked for already created table.
conn, cur = create_conn()
engine = create_engine('postgresql://postgres:Shubham#123#localhost:5432/walmart')
create_query = '''create table if not exists new_table(
"item_id" TEXT, "product_id" TEXT, "abstract_product_id" TEXT,
"product_name" TEXT, "product_type" TEXT, "ironbank_category" TEXT,
"primary_shelf" TEXT, apparel_category" TEXT, "brand" TEXT)'''
cur.execute(create_query)
conn.commit()
file_name = 'new_table'
new_file = "C:\\Users\\shubham.shinde\\Desktop\\wallll\\new_file.txt"
data = pd.read_csv(new_file, delimiter="\t", chunksize=500000, error_bad_lines=False, quoting=csv.QUOTE_NONE, dtype="unicode", iterator=True)
with open(file_name + '_bad_rows.txt', 'w') as f1:
sys.stderr = f1
for df in data:
df.to_sql('new_table', engine, if_exists='append')
data.close()
I want to insert values from df.to_sql() into database table
Not 100% certain if this argument works with postgresql, but I had a similar issue when doing it on mssql. .to_sql() already creates the table in the first argument of the method in new_table. The if_exists = append also doesn't check for duplicate values. If data in new_file is overwritten, or run through your function again, it will just add to the table. As to why you're seeing the table name, but not seeing the data in it, might be due to the size of the df. Try setting fast_executemany=True as the second argument of the create_engine.
My suggestion, get rid of create_query, and handle the data types after to_sql(). Once the SQL table is created, you can use your actual SQL table, and join against this staging table for duplicate testing. The non-duplicates can be written to the actual table, converting datatypes on UPDATE to match the tables data type structure.

I wolud like to copy sqlite3 db from memory to hard drive. How can I do? [duplicate]

This question already has answers here:
Moving back and forth between an on-disk database and a fast in-memory database?
(2 answers)
Closed 7 years ago.
I wolud like to copy sqlite db from memory to hard drive. How can I do?
I try this way:
conn_phy = sqlite3.connect("phy.db")
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c_phy = conn.cursor()
for row in c.execute("select * from table"):
c_phy.execute(' insert into tablephy (column1, column2, column3) values\
(?,?,?)', row) #phy db column number equal memory column number
conn_phy.commit()
sqlite3 Python module does not expose sqlite3_backup_*() C functions. There is probably no direct way to save in-memory database into a database file on disk.
An analog of old_db.iterdump() + new_db.executescript() could be used (not tested):
import sqlite3
old_db = sqlite3.connect(':memory:')
# here's code that works with old_db
# ...
# dump old database in the new one
with sqlite3.connect('test.db') as new_db:
new_db.executescript("".join(old_db.iterdump()))
Based on #mouad's answer to the opposite quesiton (loading from file to in-memory db).

Resources