insert table from excel into MongoDB and query it - excel

I have this table in excel (shown in fig1), that I want to insert into MongoDB. I am using the following code which gives me the result as in fig 2, but that is not enough for me as I cannot query any value.
df = pd.read_excel(r'path.xlsx')
# Connect to the MongoDB instance
client = MongoClient("mongodb://localhost:27017/")
db= client['test']
mycollection= db['mycollection']
data = df.to_dict('records')
data = [{str(k): v for k, v in row.items()} for row in data]
db.mycollection.insert_many(data)
I can use a for loop to set all the headers of the ROWS as the ObjectID and then do a query, but I am looking for an easier way to do so.
So either please help me to query a value according to the header of the ROWS based on the code that I have
OR
suggest me a better way to insert the table in MongoDB that will help with the query based on '_id': 'all first values of each row' field.

Related

How can I convert from SQLite3 format to dictionary

How can i convert my SQLITE3 TABLE to a python dictionary where the name and value of the column of the table is converted to key and value of dictionary.
I have made a package to solve this issue if anyone got into this problem..
aiosqlitedict
Here is what it can do
Easy conversion between sqlite table and Python dictionary and vice-versa.
Get values of a certain column in a Python list.
Order your list ascending or descending.
Insert any number of columns to your dict.
Getting Started
We start by connecting our database along with
the reference column
from aiosqlitedict.database import Connect
countriesDB = Connect("database.db", "user_id")
Make a dictionary
The dictionary should be inside an async function.
async def some_func():
countries_data = await countriesDB.to_dict("my_table_name", 123, "col1_name", "col2_name", ...)
You can insert any number of columns, or you can get all by specifying
the column name as '*'
countries_data = await countriesDB.to_dict("my_table_name", 123, "*")
so you now have made some changes to your dictionary and want to
export it to sql format again?
Convert dict to sqlite table
async def some_func():
...
await countriesDB.to_sql("my_table_name", 123, countries_data)
But what if you want a list of values for a specific column?
Select method
you can have a list of all values of a certain column.
country_names = await countriesDB.select("my_table_name", "col1_name")
to limit your selection use limit parameter.
country_names = await countriesDB.select("my_table_name", "col1_name", limit=10)
you can also arrange your list by using ascending parameter
and/or order_by parameter and specifying a certain column to order your list accordingly.
country_names = await countriesDB.select("my_table_name", "col1_name", order_by="col2_name", ascending=False)

creating python pop function for sqlite3

I'm trying to create a pop function getting a row of data from a sqlite database and deleting that same row. I would like to not have to create an ID column so I am using ROWID. I want to always get the first row and return it. This is the code I have:
import sqlite3
db = sqlite3.connect("Test.db")
c=db.cursor()
def sqlpop():
c.execute("SELECT * from DATA WHERE ROWID=1")
data = c.fetchall()
c.execute("DELETE from DATA WHERE ROWID=1")
db.commit()
return(data)
when I call the function it gets the first item correctly, but after the first call the function returns nothing. like this:
>>> sqlpop()
[(1603216325, 'placeholder IP line 124', 'placeholder Device line 124', '1,2,0', 1528, 1564)]
>>> sqlpop()
[]
>>> sqlpop()
[]
>>> sqlpop()
[]
what do I need to change for this function to work correctly?
update:
using what Schwern said I got the funtion to work:
def sqlpop():
c.execute("SELECT * from DATA ORDER BY ROWID LIMIT 1")
data = c.fetchone()
c.execute("DELETE from DATA ORDER BY ROWID LIMIT 1")
db.commit()
return data
rowid is not the row order, it is a unique identifier for the row created by SQLite unless you say otherwise.
SQL rows have no inherent order. You could grab just one row...
select * from table limit 1;
But you'll get them in no guaranteed order. And without a rowid you have no way to identify it again to delete it.
If you want to get the "first" row you must define what "first" means. To do that you need something to order by. For example, a timestamp. Or perhaps an auto-incrementing integer. You cannot use rowid, rowids are not guaranteed to be assigned in any particular order.
select *
from table
where created_at = max(created_at)
limit 1
So long as created_at is indexed, that should work fine. Then delete by its rowid.
You also don't need to use fetchall to fetch one row, use fetchone. In general, fetchall should be avoided as it risks consuming all your memory by slurping all the data in at once. Instead, use iterators.
for row in c.execute(...)

slow(er) SELECTs on Postgres index in Python script

I've recently migrated a simple single-column DB (this column is indexed TEXT) from SQLite to PostgreSQL. This column has ~100m rows and i use to simply check if the column contains a certain text value.
[please avoid recommending better options for this simple problem as i need to use PostgreSQL in future for another app requiring fast select queries anyway.]
The problem I'm having is select queries are around 8x slower than SQLite, at around 140k selects per minute looping over the same text file.
I've simplified the code as much as possible (using psycopg2 library here, and omitted pvt info):
with open('data.txt', 'r') as f:
cnt=0
conn = get_conn()
c = conn.cursor()
for line in f:
c.execute('SELECT mycol from mytbl where mycol = %s', (line.strip(),))
if c.fetchone():
pass
else:
pass
cnt += 1
print(cnt)
sqlite3 test is the same, with same single indexed column DB structure.
some clues:
The DB is hosted my local PC
The index is created implicitly using the primary key constraint
Query plan: Index Only Scan using mycol_pkey on mytable

Reading a database table without Pandas

I am trying to read a table from a HANA database in Python using SQLAlchemy library. Typically, I would use the Pandas package and use the pd.read_sql() method for this operation. However, for some reason, the environment I am using does not support the Pandas package. Therefore, I need to read the table without the Pandas library. So far, the following is what I have been able to do:
query = ('''SELECT * FROM "<schema_name>"."<table_name>"'''
''' WHERE <conditional_clauses>'''
)
with engine.connect() as con:
table = con.execute(query)
row = table.fetchone()
However, while this technique allows me to read table row by row, I am do not get the column names of the table.
How can I fix this?
Thanks
I am do not get the column names of the table
You won't get the column names of the table but you can get the column names (or aliases) of the result set:
with engine.begin() as conn:
row = conn.execute(sa.text("SELECT 1 AS foo, 2 AS bar")).fetchone()
print(row.items()) # [('foo', 1), ('bar', 2)]
#
# or, for just the column names
#
print(row.keys()) # ['foo', 'bar']

How to insert value in already created Database table through pandas `df.to_sql()`

I'm creating new table then inserting values in it because the tsv file doesn't have headers so i need to create table structure first then insert the value. I'm trying to insert the value in database table which is been created. I'm using df.to_sql function to insert tsv values into database table but its creating table but it's not inserting values in that table and its not giving any type of error either.
I have tried to create new table through sqalchemy and insert value it worked but it didn't worked for already created table.
conn, cur = create_conn()
engine = create_engine('postgresql://postgres:Shubham#123#localhost:5432/walmart')
create_query = '''create table if not exists new_table(
"item_id" TEXT, "product_id" TEXT, "abstract_product_id" TEXT,
"product_name" TEXT, "product_type" TEXT, "ironbank_category" TEXT,
"primary_shelf" TEXT, apparel_category" TEXT, "brand" TEXT)'''
cur.execute(create_query)
conn.commit()
file_name = 'new_table'
new_file = "C:\\Users\\shubham.shinde\\Desktop\\wallll\\new_file.txt"
data = pd.read_csv(new_file, delimiter="\t", chunksize=500000, error_bad_lines=False, quoting=csv.QUOTE_NONE, dtype="unicode", iterator=True)
with open(file_name + '_bad_rows.txt', 'w') as f1:
sys.stderr = f1
for df in data:
df.to_sql('new_table', engine, if_exists='append')
data.close()
I want to insert values from df.to_sql() into database table
Not 100% certain if this argument works with postgresql, but I had a similar issue when doing it on mssql. .to_sql() already creates the table in the first argument of the method in new_table. The if_exists = append also doesn't check for duplicate values. If data in new_file is overwritten, or run through your function again, it will just add to the table. As to why you're seeing the table name, but not seeing the data in it, might be due to the size of the df. Try setting fast_executemany=True as the second argument of the create_engine.
My suggestion, get rid of create_query, and handle the data types after to_sql(). Once the SQL table is created, you can use your actual SQL table, and join against this staging table for duplicate testing. The non-duplicates can be written to the actual table, converting datatypes on UPDATE to match the tables data type structure.

Resources