Updating column in database with new values from pandas dataframe - python-3.x

I am attempting to update a column on an oracle database with new values which have been calculated and inputted into a pandas dataframe. The table name in the database is protein_info and the column I want to update is pct. I'm getting the following error when I run my code:
Traceback (most recent call last):
File "./update_nsaf.py", line 81, in
df.to_sql(protein_info, engine, index=False, if_exists='replace')
AttributeError: type object 'protein_info' has no attribute 'lower'
df = df[['id', 'pct']]
engine=create_engine('oracle://scott:tiger#localhost:5432/mydatabase', echo=False)
connect = engine.raw_connection()
df.to_sql(protein_info, engine, index=False, if_exists='replace')
sql = """
UPDATE protein_info
SET protein_info.pct = pct
FROM protein_info
WHERE protein_info.id = id
"""
connect.execute(sql)
connect.close()

Related

Can't query datetime column in SQLAlchemy with postgreSQL

I want to delete rows based on a datetime filter.
I created a table with DateTime column without timezone using similar script.
class VolumeInfo(Base):
...
date: datetime.datetime = Column(DateTime, nullable=False)
Then I try to delete rows using such filter
days_interval = 10
to_date = datetime.datetime.combine(
datetime.datetime.utcnow().date(),
datetime.time(0, 0, 0, 0),
).replace(tzinfo=None)
from_date = to_date - datetime.timedelta(days=days_interval)
query = delete(VolumeInfo).where(Volume.date < from_date)
Unexpectedly, sometimes there is no error, but sometimes there is the error:
Traceback (most recent call last):
...
File "script.py", line 381, in delete_volumes
db.execute(query)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1660, in execute
) = compile_state_cls.orm_pre_session_exec(
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 1843, in orm_pre_session_exec
update_options = cls._do_pre_synchronize_evaluate(
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 2007, in _do_pre_synchronize_evaluate
matched_objects = [
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 2012, in <listcomp>
and eval_condition(state.obj())
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/evaluator.py", line 211, in evaluate
return operator(eval_left(obj), eval_right(obj))
TypeError: can't compare offset-naive and offset-aware datetimes
Using python3.10 in docker (image python:3.10-slim) and postgreSQL database with psycopg2 driver.
I have already tried all possible options, but this error appears every once in a while
How can i solve this? or where I made a mistake?
UPD1:

Decode an encoded value from DB using Python

I encoded a value from input file and inserted into Sqlite DB
cur.execute('''INSERT INTO Locations (address, geodata)
VALUES ( ?, ? )''', (memoryview(address.encode()), memoryview(data.encode()) ) )
Now I'm trying to decode it but I'm getting
Traceback (most recent call last):
File "return.py", line 9, in
print(c.decode('utf-8'))
AttributeError: 'tuple' object has no attribute 'decode'
My code looks like this:
import sqlite3
conn = sqlite3.connect('geodata.sqlite')
cur = conn.cursor()
cur.execute('SELECT address FROM Locations')
for c in cur:
print(c.decode('utf-8'))
Regardless of how many columns are selected, rows are returned as tuples. You would get the first element of the tuple the usual way.

get columns post group by in pyspark with dataframes

I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.
joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'
Entire code snippet
df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
joinedDF = df.join(df_agg, "company")
on the second line you have .show at the end
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
remove it like this:
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)
and your code should work.
You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)

I am trying to read a csv file with pandas and then to search a string in the first column, to use total row for calculations

I am reading a CSV file with pandas, and then I try to find a word like "Net income" in the first column. Then I want to use the whole row which has this structure: string/number/number/number/... to do some calculations with the numbers.
The problem is that find is not working.
data = pd.read_csv(name)
data.str.find('Net income')
Traceback (most recent call last):
File "C:\Users\thoma\Desktop\python programme\manage.py", line 16, in <module>
data.str.find('Net income')
I am using CSV files from here: Income Statement for Deutsche Lufthansa AG (DLAKF) from Morningstar.com
I found this: Python | Pandas Series.str.find() - GeeksforGeeks
Traceback (most recent call last):
File "C:\Users\thoma\Desktop\python programme\manage.py", line 16, in <module>
data.str.find('Net income')
File "C:\Users\thoma\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 5067, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
So, it works now. But I still have a question. After using the describe function with pandas I get this:
<bound method NDFrame.describe of 2014-12 615
2015-12 612
2016-12 636
2017-12 713
2018-12 736
Name: Goodwill, dtype: object>
I have problems to use the data. So how can I f.e. use the second column here? I tried to do a new table:
new_Table['Goodwill'] = data1['Goodwill'].describe
but this does not work.
I also would like to add more "second" columns to new_Table.
Hi you should filter the column name like df[‘col name’].str.find(x) this required a series not a data frame.
I recommend setting your header row if pandas isnt recognizing named rows in your CSV file.
Something like:
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row
data.columns = new_header
From there you can summarize each column by name:
data['Net Income'].describe
edit: I looked at the csv file, I recommend reshaping the data first before analyzing columns.Something like...
data=data.transpose
So in summation:
data = pd.read_csv(name)
data=data.transpose #flip the columns/rows
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row
data.columns = new_header
data['Net Income'].describe #analyze

How to iterate over column names with PyTable?

I have a large matrix (15000 rows x 2500 columns) stored using PyTables and getting see how to iterate over the columns of a row. In the documentation I only see how to access each row by name manually.
I have columns like:
ID
X20160730_Day10_123a_2
X20160730_Day10_123b_1
X20160730_Day10_123b_2
The ID column value is a string like '10692.RFX7' but all other cell values are floats. This selection works and I can iterate the rows of results but I cannot see how to iterate over the columns and check their values:
from tables import *
import numpy
def main():
h5file = open_file('carlo_seth.h5', mode='r', title='Three-file test')
table = h5file.root.expression.readout
condition = '(ID == b"10692.RFX7")'
for row in table.where(condition):
print(row['ID'].decode())
for col in row.fetch_all_fields():
print("{0}\t{1}".format(col, row[col]))
h5file.close()
if __name__ == '__main__':
main()
If I just iterate with "for col in row" nothing happens. As the code is above, I get a stack:
10692.RFX7
Traceback (most recent call last):
File "tables/tableextension.pyx", line 1497, in tables.tableextension.Row.__getitem__ (tables/tableextension.c:17226)
KeyError: b'10692.RFX7'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tables/tableextension.pyx", line 126, in tables.tableextension.get_nested_field_cache (tables/tableextension.c:2532)
KeyError: b'10692.RFX7'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./read_carlo_pytable.py", line 31, in <module>
main()
File "./read_carlo_pytable.py", line 25, in main
print("{0}\t{1}".format(col, row[col]))
File "tables/tableextension.pyx", line 1501, in tables.tableextension.Row.__getitem__ (tables/tableextension.c:17286)
File "tables/tableextension.pyx", line 133, in tables.tableextension.get_nested_field_cache (tables/tableextension.c:2651)
File "tables/utilsextension.pyx", line 927, in tables.utilsextension.get_nested_field (tables/utilsextension.c:8707)
AttributeError: 'numpy.bytes_' object has no attribute 'encode'
Closing remaining open files:carlo_seth.h5...done
You can access a column value by name in each row:
for row in table:
print(row["10692.RFX7"])
Iterate over all columns:
names = table.coldescrs.keys()
for row in table:
for name in names:
print(name, row[name])

Resources