How to get variable by select in spark - apache-spark

I want get variable with sql query:
Dt = spark.sql("""select max(dt) from table""")
Script = """select * from table where dt > """ + dt
Spark.sql(script)
but when I try to substitute a variable in the request I get error:
"Can only concatenate str (not dataframe) to str"
How do I get the variable as a string and not a dataframe?

To get the result in a variable, you can use collect() and extract the value. Here's an example that pulls the max date-month (YYYYMM) from a table and stores it in a variable.
max_mth = spark.sql('select max(mth) from table').collect()[0][0]
print(max_mth)
# 202202
You can either cast the value to string in the sql statement, or use str() on the variable while using to convert the integer value to string.
P.S. - the [0][0] is to select the first row-column

Related

Inserting Timestamp Into Snowflake Using Python 3.8

I have an empty table defined in snowflake as;
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP
);
And it creates the correct table, which has been checked using desc command in sql. Then using a snowflake python connector we are trying to execute following query;
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},{ct});'
ctx.cursor().execute(insert_query)
Just before this query the variables are defined, The main challenge is getting the current time stamp written into snowflake. Here the value of ct is defined as;
import datetime
ct = datetime.datetime.now()
print(ct)
2021-04-30 21:54:41.676406
But when we try to execute this INSERT query we get the following errr message;
ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 1 at position 157 unexpected '21'.
Can I kindly get some help on ow to format the date time value here? Help is appreciated.
In addition to the answer #Lukasz provided you could also think about defining the current_timestamp() as default for the TIME_PREDICTED column:
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP DEFAULT current_timestamp
);
And then just insert ACCOUNT_ID and PREDICTED_PROBABILITY:
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY) VALUES ({accountId}, {risk_score});'
ctx.cursor().execute(insert_query)
It will automatically assign the insert time to TIME_PREDICTED
Educated guess. When performing insert with:
insert_query = f'INSERT INTO ...(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED)
VALUES ({accountId}, {risk_score},{ct});'
It is a string interpolation. The ct is provided as string representation of datetime, which does not match a timestamp data type, thus error.
I would suggest using proper variable binding instead:
ctx.cursor().execute("INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES "
"(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) "
"VALUES(:1, :2, :3)",
(accountId,
risk_score,
("TIMESTAMP_LTZ", ct)
)
);
Avoid SQL Injection Attacks
Avoid binding data using Python’s formatting function because you risk SQL injection. For example:
# Binding data (UNSAFE EXAMPLE)
con.cursor().execute(
"INSERT INTO testtable(col1, col2) "
"VALUES({col1}, '{col2}')".format(
col1=789,
col2='test string3')
)
Instead, store the values in variables, check those values (for example, by looking for suspicious semicolons inside strings), and then bind the parameters using qmark or numeric binding style.
You forgot to place the quotes before and after the {ct}. The code should be :
insert_query = "INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},'{ct}');".format(accountId=accountId,risk_score=risk_score,ct=ct)
ctx.cursor().execute(insert_query)

Need help fetching data from a column

Sorry for this but I'm real new to sqlite: i've created a database from an excel sheet I had, and I can't seem to fetch the values of the column I need
query = """ SELECT GNCR from table"""
cur.execute(query)
This actually works, but
query = """ SELECT ? from table"""
cur.execute(query, my_tuple)
doesn't
Here's my code:
def print_col(to_print):
db = sqlite3.connect('my_database.db')
cur = db.cursor()
query = " SELECT ? FROM my_table "
cur.execute(query, to_print)
results = cur.fetchall()
print(results)
print_col(('GNCR',))
The result is:
[('GNCR',), ('GNCR',), ('GNCR',), ('GNCR',), [...]]
instead of the actual values
What's the problem ? I can't figure it out
the "?" character in query is used for parameter substitution. Sqlite will escape the parameter you passed and replace "?" with the send text. So in effect you query after parameter substitution will be SELECT 'GNCR' FROM my_table where GNCR will be treated as text so you will get the text for each row returned by you query instead of the value of that column.
Basically you should use the query parameter where you want to substitute the parameter with escaped string like in where clause. You can't use it for column name.

Using parameterized SQL query while reading large table into pandas dataframe using COPY

I am trying to read a large table (10-15M rows) from a database into pandas dataframe and I'm using the following code:
def read_sql_tmpfile(query, db_engine):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pandas.read_csv(tmpfile)
return df
I can use this if I have a simple query like this and I pass this into above func:
'''SELECT * from hourly_data'''
But what if I want to pass some variable into this query i.e.
'''SELECT * from hourly_data where starttime >= %s '''
Now where do I pass the parameter?
You cannot use parameters with COPY. Unfortunately that extends to the query you use inside COPY, even if you could use parameters with the query itself.
You will have to construct a query string including the parameter (beware of SQL injection) and use that with COPY.

How to put list values with single quotes as well as double quotes to Postgressql Select Query

I'm Executing select query to postgresql database and after fetching those results I'm appending those results to list and then I'm giving that list as the input to another postgresql select query.
But due to conversion of those values to list it converts values with apostrophe(special character) cat's to double quotes "cat's". while executing second select query the value with double quotes is not been fetched because value with double quotes is not present in the database it is without double quotes cat's.
And there it gives me error that value is not present.
I have tried JSON dumps method but its isn't working because I cannot convert JSON list to tuple and give it as the input to postgresql select query
select_query = """select "Unique_Shelf_Names" from "unique_shelf" where category = 'Accessory'"""
cur.execute(select_query)
count = cur.fetchall()
query_list = []
for co in count:
for c in co:
query_list.append(c)
output of query_list:
query_list = ['parrot', 'dog', "leopard's", 'cat', "zebra's"]
Now this querylist is been converted to tuple and given as the input to another select query.
list2 = tuple(query_list)
query = """select category from "unique_shelf" where "Unique_Shelf_Names" in {} """.format(list2)
cur.execute(query)
This is where it gives me error "leopard's" doesn't exist but in database leopard's exists.
I want all the values in the query_list to be double quotes so this error doesn't arises.
Do not use format to construct the query. Simply use %s and pass the tuple into execute
query = """select category from "unique_shelf" where "Unique_Shelf_Names" in %s """
cur.execute(query,(list2,))
Tuples adaptation

How to ensure Python3 infers numbers as a string instead of an integer?

I have a line of code here:
query = """SELECT v.retailiqpo_ordernumber
FROM public.vmi_purchase_orders v
WHERE v.vendor_account = {}""".format(str(primary_account_number))
I tried to load in the string value of the number, but psycopg2 still throws this error.
psycopg2.ProgrammingError: operator does not exist: character varying = integer
What options do I have to ensure Psycopg2 sees this as a string? Or should I just change the overall structure of the database to just integers?
It's (almost) always better to let psycopg2 interpolate query parameters for you. (http://initd.org/psycopg/docs/usage.html#the-problem-with-the-query-parameters)
query = """SELECT v.retailiqpo_ordernumber
FROM public.vmi_purchase_orders v
WHERE v.vendor_account = %s"""
cur.execute(query, (str(primary_account_number),))
This way psycopg2 will deal with the proper type formatting based on the type of the python value passed.
Use
query = """
SELECT v.retailiqpo_ordernumber
FROM public.vmi_purchase_orders v
WHERE v.vendor_account = '{}'
""".format(primary_account_number)
That way the number inside your query is passed as a string - if your c.vendor_account is of a stringtype (varchar i.e.). The important part are the ' before/after {} so the query string sees it as string.
As Jon Clements pointed out, it is better to let the api handle the conversion:
query = """
SELECT v.retailiqpo_ordernumber
FROM public.vmi_purchase_orders v
WHERE v.vendor_account = %s
"""
cursor.execute(query, (str(primary_account_number),)
Doku: Psycopg - Passing parameters to sql queries

Resources