Using parameterized SQL query while reading large table into pandas dataframe using COPY - python-3.x

I am trying to read a large table (10-15M rows) from a database into pandas dataframe and I'm using the following code:
def read_sql_tmpfile(query, db_engine):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pandas.read_csv(tmpfile)
return df
I can use this if I have a simple query like this and I pass this into above func:
'''SELECT * from hourly_data'''
But what if I want to pass some variable into this query i.e.
'''SELECT * from hourly_data where starttime >= %s '''
Now where do I pass the parameter?

You cannot use parameters with COPY. Unfortunately that extends to the query you use inside COPY, even if you could use parameters with the query itself.
You will have to construct a query string including the parameter (beware of SQL injection) and use that with COPY.

Related

How to get variable by select in spark

I want get variable with sql query:
Dt = spark.sql("""select max(dt) from table""")
Script = """select * from table where dt > """ + dt
Spark.sql(script)
but when I try to substitute a variable in the request I get error:
"Can only concatenate str (not dataframe) to str"
How do I get the variable as a string and not a dataframe?
To get the result in a variable, you can use collect() and extract the value. Here's an example that pulls the max date-month (YYYYMM) from a table and stores it in a variable.
max_mth = spark.sql('select max(mth) from table').collect()[0][0]
print(max_mth)
# 202202
You can either cast the value to string in the sql statement, or use str() on the variable while using to convert the integer value to string.
P.S. - the [0][0] is to select the first row-column

Python: replaced dynamic named parameter in string(for SQL)

I have a SQL as below:
sql= '''
select name from table1 where asof between '$varA' and '$varB'
union
select name from table2 where asof between '$varC' and '$varD'
'''
This sql contains dynamic variables.
Use Template.substitude can replace the variables to the value, but in my situation the variable name is dynamic. That is to say I don't know if it's $varA, $varB...
Is there a way i can do dynamic substitude?
Thanks
I got it!
def parseSQL(sql, dict):
template = Template(sql)
# print(dict)
try:
sql = template.substitute(**dict)
except KeyError:
print('Incomplete substitution resulted in KeyError!')
finally:
return sql
usage:
dict = {'startDate': '2021-01-01', 'endDate': '2021-01-31'}
sql = parseSQL(sql, dict)
Just use different "dict" to parse the SQL

python code to execute multiple queries and create csv

Hi I am pretty new to python I have a code that reads mysql query thorugh pandas and if there is data converts it into the csv now I need to move a step ahead and add another query and create another csv I am not sure what will be the best way to do it. Any help is appreciated thanks.
My code is something like this
def data_to_df(connection):
query = """
select * from abs
"""
data = pd.read_sql(sql=query,con=connection)
return data
def main():
# DB connection and data retrieval
cnx = database_connection(db_credentials)
print(cnx)
exit()
df = data_to_df(cnx)
Convert Dataframe to textfile
df.to_csv(file_location, sep = '|', na_rep = 'NULL', index=False, quoting=csv.QUOTE_NONE)
print('File created successfully! \n')
how can I add another query that will be executed and will create a different file altogether

How to convert sql query to list?

I am trying to convert my sql query output into a list to look a certain way.
Here is my code:
def get_sf_metadata():
import sqlite3
#Tables I want to be dynamically created
table_names=['AcceptedEventRelation','Asset', 'Book']
#SQLIte Connection
conn = sqlite3.connect('aaa_test.db')
c = conn.cursor()
#select the metadata table records
c.execute("select name, type from sf_field_metadata1 limit 10 ")
print(list(c))
get_sf_metadata()
Here is my output:
[('Id', 'id'), ('RelationId', 'reference'), ('EventId', 'reference')]
Is there any way to make the output looks like this:
[Id id, RelationId reference, EventId reference]
You can try
print(["{} {}".format(i[0], i[1]) for i in list(c)])
That will print you
['Id id', 'RelationId reference', 'EventId reference']

BigQuery Storage API: Best Practice for Using Client from Spark Pandas UDF?

I have a spark script that needs to make 60 api calls for every row. Currently I am using BigQuery as a data warehouse. I was wondering if there was a way I can use either the BigQuery API or BigQuery Storage API to query the database from my udf? Maybe a way to perform batch queries? Would pandas-gbq be a better solution? Each query that I need to make per row is a select count(*) from dataset.table where {...} query.
Currently I am using the big query client as shown in the code snippet below, but I am not sure if this is the best way to utilize my resources. Apologies if the code is not done properly for this use case, I am new to spark and BigQuery.
def clients():
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/hadoop/credentials.json'
credentials, your_project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
# Make clients.
bqclient = bigquery.Client(
credentials=credentials,
project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
credentials=credentials
)
return bqclient, bqstorageclient
def query_cache(query):
bqclient, bqstorageclient = clients()
dataframe = (
bqclient.query(query)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
return dataframe['f0_'][0]
#pandas_udf(schema(), PandasUDFType.GROUPED_MAP)
def calc_counts(df):
query = "select count(*) from dataset.table where ...{some column filters}..."
df['count'] = df.apply(query_cache, args=(query), axis=1)
The simpler option is to use the spark-bigquery-connector, which let you query BigQuery directly and get the result as a Spark dataframe. Converting this dataframe into pandas is then simple:
spark_df = spark.read.format('bigquery').option('table', table).load()
pandas_df = spark_df.toPandas()

Resources