I am using psycopg2 on a python script
The script parses json files, and put them in a Postgres RDS.
When a value is missing on the json file, the script supposed to put skip the specific column
(so it supposed to inert null value in the table, but instead it puts NaN)
Has anybody encountered this issue?
The part that checks if the column is empty -
if (str(df.loc[0][col]) == "" or df.loc[0][col] is None or str(df.loc[0][col]) == 'None' or str(df.loc[0][col]) == 'NaN' or str(df.loc[0][col]) == 'null'):
df.drop(col, axis=1, inplace=True)
else:
cur.execute("call mrr.add_column_to_table('{0}', '{1}');".format(table_name, col))
The insertion part -
def copy_df_to_sql(df, conn, table_name):
if len(df) > 0:
df_columns = list(df)
columns = '","'.join(df_columns) # create (col1,col2,...)
# create VALUES('%s', '%s",...) one '%s' per column
values = "VALUES({})".format(",".join(["%s" for _ in df_columns]))
# create INSERT INTO table (columns) VALUES('%s',...)
emp = '"'
insert_stmt = 'INSERT INTO mrr.{} ({}{}{}) {}'.format(table_name, emp, columns, emp, values)
cur = conn.cursor()
import psycopg2.extras
psycopg2.extras.execute_batch(cur, insert_stmt, df.values)
conn.commit()
cur.close()
Ok, so the reason this is happening is probably because pandas is treating null values as NaN,
so when I insert a Dataframe into the table in inserts the null values as pandas null, which is NaN
Related
Code to create dataframe:
source_df = spark.createDataFrame(
[
("Jose", "BLUE"),
("lI", "BrOwN")
],
["name", "eye_color"]
)
I have written following code to convert the 'eye-color' column to lowercase:
actual_df = source_df
for col_name in actual_df.columns if column == 'eye_color' else column for column in actual_df.columns:
actual_df = actual_df.withColumn(col_name, lower(col(col_name)))
I am getting following error:
Cell In [26], line 2
for col_name in actual_df.columns if column == 'eye_color' else column for column in actual_df.columns:
^
SyntaxError: invalid syntax
This is more a python problem than a spark problem. Your python syntax is not correct.
If you want to keep the same structure, that is make a transformation for each column that matches some criteria, there are multiple ways to do it:
# using a if
for col_name in actual_df.columns:
if col_name == 'eye_color':
actual_df = actual_df.withColumn(col_name, lower(col(col_name)))
# using filter
for col_name in filter(lambda column: column == 'eye_color', actual_df.columns):
actual_df = actual_df.withColumn(col_name, lower(col(col_name)))
# using list comprehension
for col_name in [column for column in actual_df.columns if column == 'eye_color']:
actual_df = actual_df.withColumn(col_name, lower(col(col_name)))
But in your situation, as mentioned in one comment, since you only transform one column I would not use a loop. A single withColumn would do the trick.
source_df.withColumn('eye_color', lower(col('eye_color')))
I'm using the hdbcli package to load data from SAP HANA.
Problem: When loading data, I only get the value rows without the actual headers of the SQL table.
When I load only 3 columns (as below), I can manually add them myself, even though it is very ugly. This becomes impossible when I execute a Select * statement, as I really don't want to have to add them manually and might not know when there is a change.
Question: Is there a flag / command to get the column headers from a table?
Code-MRE:
#Initialize your connection
conn = dbapi.connect(
address='00.0.000.00',
port='39015',
user='User',
password='Password',
encrypt=True,
sslValidateCertificate=False
)
cursor = conn.cursor()
sql_command = "select TITLE, FIRSTNAME, NAME from HOTEL.CUSTOMER;"
cursor.execute(sql_command)
rows = cursor.fetchall() # returns only data, not the column values
for row in rows:
for col in row:
print ("%s" % col, end=" ")
print (" ")
cursor.close()
conn.close()
Thanks to #astentx' comment I found a solution:
cursor = conn.cursor()
sql_command = "select TITLE, FIRSTNAME, NAME from HOTEL.CUSTOMER;"
cursor.execute(sql_command)
rows = cursor.fetchall() # returns only data, not the column headers
column_headers = [i[0] for i in cursor.description] # get column headers
cursor.close()
conn.close()
result = [[column_header]] # insert header
for row in rows: # insert rows
current_row = []
for cell in row:
current_row.append(cell)
result.append(current_row)
I'm trying to delete a row from my pysimplegui table that will also delete the same row data from my sqlite3 database. Using events, I've tried to use the index eg. -TABLE- {'-TABLE-': [1]} to index the row position using values['-TABLE-'] like so:
if event == 'Delete':
row_index = 0
for num in values['-TABLE-']:
row_index = num + 1
c.execute('DELETE FROM goals WHERE item_id = ?', (row_index,))
conn.commit()
window.Element('-TABLE-').Update(values=get_table_data())
I realized that this wouldn't work since I'm using a ROW_ID in my database that Auto-increments with every new row of data and stays fixed like so (this is just to show how my database is set up):
conn = sqlite3.connect('goals.db')
c = conn.cursor()
c.execute('''CREATE TABLE goals (item_id INTEGER PRIMARY KEY, goal_name text, goal_type text)''')
conn.commit()
conn.close()
Is there a way to use the index ( values['-TABLE-'] ) to find the data inside the selected row in pysimplegui and then using the selected row's data to find the row in my sqlite3 database to delete it, or is there any other way of doing this that I'm not aware of?
////////////////////////////////////////
FIX:
Upon more reading into the docs I discovered a .get() method. This method returns a nested list of all Table Rows, the method is callable on the element of '-TABLE-'. Using values['-TABLE-'] I can also find the row index and use the .get() method to index the specific list where the Data lays which I want to delete.
Here is the edited code that made it work for me:
if event == 'Delete':
row_index = 0
for num in values['-TABLE-']:
row_index = num
# Returns nested list of all Table rows
all_table_vals = window.element('-TABLE-').get()
# Index the selected row
object_name_deletion = all_table_vals[row_index]
# [0] to Index the goal_name of my selected Row
selected_goal_name = object_name_deletion[0]
c.execute('DELETE FROM goals WHERE goal_name = ?', (selected_goal_name,))
conn.commit()
window.Element('-TABLE-').Update(values=get_table_data())
Here is a small example to delete a row from table
import sqlite3
def deleteRecord():
try:
sqliteConnection = sqlite3.connect('SQLite_Python.db')
cursor = sqliteConnection.cursor()
print("Connected to SQLite")
# Deleting single record now
sql_delete_query = """DELETE from SqliteDb_developers where id = 6"""
cursor.execute(sql_delete_query)
sqliteConnection.commit()
print("Record deleted successfully ")
cursor.close()
except sqlite3.Error as error:
print("Failed to delete record from sqlite table", error)
finally:
if (sqliteConnection):
sqliteConnection.close()
print("the sqlite connection is closed")
deleteRecord()
In your case id will me the name of any column name which has unique value for every row in thetable of the database
I am able to get the column names and table name from using sql parse for only simple select SQL's.
Can somebody help how can get the column names and table name from any complex SQL's.
Here is a solution for extracting column names from complex sql select statements. Python 3.9
import sqlparse
def get_query_columns(sql):
stmt = sqlparse.parse(sql)[0]
columns = []
column_identifiers = []
# get column_identifieres
in_select = False
for token in stmt.tokens:
if isinstance(token, sqlparse.sql.Comment):
continue
if str(token).lower() == 'select':
in_select = True
elif in_select and token.ttype is None:
for identifier in token.get_identifiers():
column_identifiers.append(identifier)
break
# get column names
for column_identifier in column_identifiers:
columns.append(column_identifier.get_name())
return columns
def test():
sql = '''
select
a.a,
replace(coalesce(a.b, 'x'), 'x', 'y') as jim,
a.bla as sally -- some comment
from
table_a as a
where
c > 20
'''
print(get_query_columns(sql))
test()
# outputs: ['a', 'jim', 'sally']
This is how you print the table name in sqlparse
1) Using SELECT statement
>>> import sqlparse
>>> print([str(t) for t in parse[0].tokens if t.ttype is None][0])
'dbo.table'
(OR)
2) Using INSERT statement:
def extract_tables(sql):
"""Extract the table names from an SQL statment.
Returns a list of (schema, table, alias) tuples
"""
parsed = sqlparse.parse(sql)
if not parsed:
return []
# INSERT statements must stop looking for tables at the sign of first
# Punctuation. eg: INSERT INTO abc (col1, col2) VALUES (1, 2)
# abc is the table name, but if we don't stop at the first lparen, then
# we'll identify abc, col1 and col2 as table names.
insert_stmt = parsed[0].token_first().value.lower() == "insert"
stream = extract_from_part(parsed[0], stop_at_punctuation=insert_stmt)
return list(extract_table_identifiers(stream))
The column names may be tricky because column names can be ambiguous or even derived. However, you can get the column names, sequence and type from virtually any query or stored procedure.
Until FROM keyword is encountered, all the column names are fetched.
def parse_sql_columns(sql):
columns = []
parsed = sqlparse.parse(sql)
stmt = parsed[0]
for token in stmt.tokens:
if isinstance(token, IdentifierList):
for identifier in token.get_identifiers():
columns.append(str(identifier))
if isinstance(token, Identifier):
columns.append(str(token))
if token.ttype is Keyword: # from
break
return columns
I'm writing a function to automatically check the quantity of null values per column in a data frame, then if the amount of nulls is less than or equal to 2000, drop the rows containing null values
I've written some code that successfully outputs the text strings to mark which column it has analyzed
def drop_na(df,cols):
for i in cols:
missing_vals = df[i].isnull().sum()
if missing_vals <= 2000:
df = df.dropna(subset=[i])
print(f'finished checking column "{i}"')
print('FINISHED ALL!')
I am checking to see if the null containing rows have been dropped with data.isnull().sum() after running the code successfully (where data is the name of my data frame) but the same null counts exist in the columns
I call the function with drop_na(data, data.columns)
It looks like you are only deleting the rows only inside the function. Doing it inplace solves the problem as in the following code works:
def drop_na(data):
cols = data.cols
subset = []
# Determine bad columns, and store them in `subset` list.
for i in cols:
missing_vals = df[i].isnull().sum()
if missing_vals <= 2000:
subset.append(i)
# Now remove all bad columns at once, but inplace.
df.dropna(subset=subset, inplace=True)
print('FINISHED ALL!')
If you don't want to do it inplace, then returning the df, and assigning the returned value to a new variable df2 = drop_na(data) works. Do not forget to re-index the new dataframe if you need to.