Getting NONE in the last row of dataframe when using pd.read_sql_query - python-3.x

I am trying to create a db using sqlite3. i created methods to read write delete and show table. however in order to view table in proper format on Command line, i decided to use pandas (pd.read_sql_query). However, when i do that i get None in the last row of the first column.
I tried writing the table to a csv and there was no none value there.
def show_table():
df = pd.read_sql_query("SELECT * FROM ticket_info", SQLITEDB.conn, index_col='resource_id')
print(df)
df.to_csv('hahaha.csv')
def fetch_from_db(query):
df = pd.read_sql_query('SELECT * FROM ticket_info WHERE {}'.format(query), SQLITEDB.conn, index_col='resource_id')
print(df)
here's the output as a picture.output image
Everything is correct but the last None value, where is it coming from? and how do i gt rid of it?

You are adding query as a variable. You might have a query that doesn't return any data from you table.

Related

Palantir Foundry spark.sql query

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)
try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )
I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

creating python pop function for sqlite3

I'm trying to create a pop function getting a row of data from a sqlite database and deleting that same row. I would like to not have to create an ID column so I am using ROWID. I want to always get the first row and return it. This is the code I have:
import sqlite3
db = sqlite3.connect("Test.db")
c=db.cursor()
def sqlpop():
c.execute("SELECT * from DATA WHERE ROWID=1")
data = c.fetchall()
c.execute("DELETE from DATA WHERE ROWID=1")
db.commit()
return(data)
when I call the function it gets the first item correctly, but after the first call the function returns nothing. like this:
>>> sqlpop()
[(1603216325, 'placeholder IP line 124', 'placeholder Device line 124', '1,2,0', 1528, 1564)]
>>> sqlpop()
[]
>>> sqlpop()
[]
>>> sqlpop()
[]
what do I need to change for this function to work correctly?
update:
using what Schwern said I got the funtion to work:
def sqlpop():
c.execute("SELECT * from DATA ORDER BY ROWID LIMIT 1")
data = c.fetchone()
c.execute("DELETE from DATA ORDER BY ROWID LIMIT 1")
db.commit()
return data
rowid is not the row order, it is a unique identifier for the row created by SQLite unless you say otherwise.
SQL rows have no inherent order. You could grab just one row...
select * from table limit 1;
But you'll get them in no guaranteed order. And without a rowid you have no way to identify it again to delete it.
If you want to get the "first" row you must define what "first" means. To do that you need something to order by. For example, a timestamp. Or perhaps an auto-incrementing integer. You cannot use rowid, rowids are not guaranteed to be assigned in any particular order.
select *
from table
where created_at = max(created_at)
limit 1
So long as created_at is indexed, that should work fine. Then delete by its rowid.
You also don't need to use fetchall to fetch one row, use fetchone. In general, fetchall should be avoided as it risks consuming all your memory by slurping all the data in at once. Instead, use iterators.
for row in c.execute(...)

Reading a database table without Pandas

I am trying to read a table from a HANA database in Python using SQLAlchemy library. Typically, I would use the Pandas package and use the pd.read_sql() method for this operation. However, for some reason, the environment I am using does not support the Pandas package. Therefore, I need to read the table without the Pandas library. So far, the following is what I have been able to do:
query = ('''SELECT * FROM "<schema_name>"."<table_name>"'''
''' WHERE <conditional_clauses>'''
)
with engine.connect() as con:
table = con.execute(query)
row = table.fetchone()
However, while this technique allows me to read table row by row, I am do not get the column names of the table.
How can I fix this?
Thanks
I am do not get the column names of the table
You won't get the column names of the table but you can get the column names (or aliases) of the result set:
with engine.begin() as conn:
row = conn.execute(sa.text("SELECT 1 AS foo, 2 AS bar")).fetchone()
print(row.items()) # [('foo', 1), ('bar', 2)]
#
# or, for just the column names
#
print(row.keys()) # ['foo', 'bar']

Python3 - Return CSV with row-level errors for missing data

New to Python. I'm importing a CSV, then if any data is missing I need to return a CSV with an additional column to indicate which rows are missing data. Colleague suggested that I import CSV into a dataframe, then create a new dataframe with a "Comments" column, fill it with a comment on the intended rows, and append it to the original dataframe. I'm stuck at the step of filling my new dataframe, "dferr", with the correct number of rows that would match up to "dfinput".
Have Googled, "pandas csv return error column where data is missing", but haven't found anything related to creating a new CSV that marks bad rows. I don't even know if the proposed way is the best way to go about this.
import pandas as pd
dfinput = None
try:
dfinput = pd.read_csv(r"C:\file.csv")
except:
print("Uh oh!")
if dfinput is None:
print("Ack!")
quit(10)
dfinput.reset_index(level=None, drop=False, inplace=True, col_level=0,
col_fill='')
dferr = pd.DataFrame(columns=['comment'])
print("Empty DataFrame", dferr, sep='\n')
Expected results: "dferr" would have an index column with number of rows equal to "dfinput", and comments on the correct rows where "dfinput" has missing values.
Actual results: "dferr" is empty.
My understanding of 'missing data' here would be null values. It seems that for every row, you want the names of null fields.
df = pd.DataFrame([[1,2,3],
[4,None,6],
[None,8,None]],
columns=['foo','bar','baz'])
# Create a dataframe of True/False, True where a criterion is met
# (in this case, a null value)
nulls = df.isnull()
# Iterate through every row of *nulls*,
# and extract the column names where the value is True by boolean indexing
colnames = nulls.columns
null_labels = nulls.apply(lambda s:colnames[s], axis=1)
# Now you have a pd.Series where every entry is an array
# (technically, a pd.Index object)
# Pandas arrays have a vectorized .str.join method:
df['nullcols'] = null_labels.str.join(', ')
The .apply() method in pandas can sometimes be a bottleneck in your code; there are ways to avoid using this, but here it seemed to be the simplest solution I could think of.
EDIT: Here's an alternate one-liner (instead of using .apply) that might cut down computation time slightly:
import numpy as np
df['nullcols'] = [colnames[x] for x in nulls.values]
This might be even faster (a bit more work is required):
np.where(df.isnull(),df.columns,'')

How to handle min_itemsize exception in writing to pandas HDFStore

I am using pandas HDFStore to store dfs which I have created from data.
store = pd.HDFStore(storeName, ...)
for file in downloaded_files:
try:
with gzip.open(file) as f:
data = json.loads(f.read())
df = json_normalize(data)
store.append(storekey, df, format='table', append=True)
except TypeError:
pass
#File Error
I have received the error:
ValueError: Trying to store a string with len [82] in [values_block_2] column but
this column has a limit of [72]!
Consider using min_itemsize to preset the sizes on these columns
I found that it is possible to set min_itemsize for the column involved but this is not a viable solution as I do not know the max length I will encounter and all the columns which I will encounter the problem.
Is there a solution to automatically catch this exception and handle it each item it occur?
I think you can do it this way:
store.append(storekey, df, format='table', append=True, min_itemsize={'Long_string_column': 200})
basically it's very similar to the following create table SQL statement:
create table df(
id int,
str varchar(200)
);
where 200 is the maximal allowed length for the str column
The following links might be very helpful:
https://www.google.com/search?q=pandas+ValueError%3A+Trying+to+store+a+string+with+len+in+column+but+min_itemsize&pws=0&gl=us&gws_rd=cr
HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

Resources