Pyspark.sql throwing AnalysisException - azure

Issue with spark sql query:
Error :
AnalysisException: cannot resolve '(global_temp.temptab.Agent.AgencyCode = '720')' due to data type mismatch: differing types in '(global_temp.temptab.Agent.AgencyCode = '720')' (array and string).; line 1 pos 64;
Here Agent is of type array. In case of multiple records how to iterate the query.
In case of Array and Dict how to write spark sql query to iterate all records.
Sample code I tried from my azure notebook:
df=spark.read.format('json').option("header","true").load("/mnt/bankabc/Test/Cust.json/")
df.createOrReplaceGlobalTempView("temptab")
display(spark.sql("select * from global_temp.temptab where id='73' and Agent.AgencyCode='720'"))

Related

How to insert multiple rows of a pandas dataframe into Azure Synapse SQL DW using pyodbc?

I am using pyodbc to establish connection with Azure Synapse SQL DW. The connection is successfully established. However when it comes to inserting a pandas dataframe into the database, I am getting an error when I try inserting multiple rows as values. However, it works if I insert rows one by one. Inserting multiple rows together as values used to work fine with AWS Redshift and MS SQL, but fails with Azure Synapse SQL DW. I think the Azure Synapse SQL is T-SQL and not MS-SQL. Nonetheless, I am unable to find any relevant documentation as well.
I have a pandas df named 'df' that looks like this:
student_id admission_date
1 2019-12-12
2 2018-12-08
3 2018-06-30
4 2017-05-30
5 2020-03-11
This code below works fine
import pandas as pd
import pyodbc
#conn object below is the pyodbc 'connect' object
batch_size = 1
i = 0
chunk = df[i:i+batch_size]
conn.autocommit = True
sql = 'insert INTO {} values {}'.format('myTable', ','.join(
str(e) for e in zip(chunk.student_id.values, chunk.admission_date.values.astype(str))))
print(sql)
cursor = conn.cursor()
cursor.execute(sql)
As you can see, it's inserting just 1 row of the 'df'. So, yes, I can loop through and insert one by one but it takes hell lot of time when it comes dataframes of larger sizes
This code below doesn't work when I try to insert all rows together
import pandas as pd
import pyodbc
batch_size = 5
i = 0
chunk = df[i:i+batch_size]
conn.autocommit = True
sql = 'insert INTO {} values {}'.format('myTable', ','.join(
str(e) for e in zip(chunk.student_id.values, chunk.admission_date.values.astype(str))))
print(sql)
cursor = conn.cursor()
cursor.execute(sql)
The error I get this one below:
ProgrammingError: ('42000', "[42000]
[Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Parse error at
line: 1, column: 74: Incorrect syntax near ','. (103010)
(SQLExecDirectW)")
This is the sample SQL query for 2 rows which fails:
insert INTO myTable values (1, '2009-12-12'),(2, '2018-12-12')
That's because Azure Synapse SQL does not support multi-row insert via the values constructor.
One work around is to chain "select (value list) union all". Your pseudo SQL should look like so:
insert INTO {table}
select {chunk.student_id.values}, {chunk.admission_date.values.astype(str)} union all
...
select {chunk.student_id.values}, {chunk.admission_date.values.astype(str)}
COPY statement in Azure Synapse Analytics is a better way for loading your data in Synapse SQL Pool.
COPY INTO test_parquet
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/*.parquet'
WITH (
FILE_FORMAT = myFileFormat,
CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)
You can save your pandas dataframe into blob storage, and then trigger the copy command using execute method.

Replacing blanks with Null in PySpark

I am working on a Hive table on Hadoop and doing Data wrangling with PySpark. I read the dataset:
dt = sqlContext.sql('select * from db.table1')
df.select("var1").printSchema()
|-- var1: string (nullable = true)
have some empty values in the dataset that Spark seems to be unable to recognize! I can easily find Null values by
df.where(F.isNull(F.col("var1"))).count()
10163101
but when I use
df.where(F.col("var1")=='').count()
it gives me zero however when I check in sql, I have 6908 empty values.
Here are SQL queries and their results:
SELECT count(*)
FROM [Y].[dbo].[table1]
where var1=''
6908
And
SELECT count(*)
FROM [Y].[dbo].[table1]
where var1 is null
10163101
the counts for SQL and Pyspark table are the same:
df.count()
10171109
and
SELECT count(*)
FROM [Y].[dbo].[table1]
10171109
And when I try to find blanks by using length or size, I get an error:
dt.where(F.size(F.col("var1")) == 0).count()
AnalysisException: "cannot resolve 'size(var1)' due to data type
mismatch: argument 1 requires (array or map) type, however, 'var1'
is of string type.;"
How should I address this issue? My Spark version is '1.6.3'
Thanks
I tried regexp and finally was able to find those blanks!!
dtnew = dt.withColumn('test',F.regexp_replace(F.col('var1') , '\s+|,',''))
dtnew.where(F.col('test')=='').count()
6908

Spark Data frame search column starting with a string

I have a requirement to filter a data frame based on a condition that a column value should starts with a predefined string.
I am trying following:
val domainConfigJSON = sqlContext.read
.jdbc(url, "CONFIG", prop)
.select("DID", "CONF", "KEY").filter("key like 'config.*'")
And getting exception:
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:
You have an error in your SQL syntax; check the manual that
corresponds to your MariaDB server version for the right syntax to use
near 'KEY = 'config.*'' at line 1
Using spark: 1.6.1
You can use the startsWith function present in Column class.
myDataFrame.filter(col("columnName").startswith("PREFIX"))
I used the same function but I was getting errors then I checked what is the error?
actually, we need to use startsWith(literals: String) but the above function having lowercase startswith().
Ex : df.filter(col("ACCOUNT_NUMBER").startsWith("9"))

What does "Correlated scalar subqueries must be Aggregated" mean?

I use Spark 2.0.
I'd like to execute the following SQL query:
val sqlText = """
select
f.ID as TID,
f.BldgID as TBldgID,
f.LeaseID as TLeaseID,
f.Period as TPeriod,
coalesce(
(select
f ChargeAmt
from
Fact_CMCharges f
where
f.BldgID = Fact_CMCharges.BldgID
limit 1),
0) as TChargeAmt1,
f.ChargeAmt as TChargeAmt2,
l.EFFDATE as TBreakDate
from
Fact_CMCharges f
join
CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE)
where
f.ActualProjected = 'Lease'
except(
select * from TT1 t2 left semi join Fact_CMCharges f2 on t2.TID=f2.ID)
"""
val query = spark.sql(sqlText)
query.show()
It seems that the inner statement in coalesce gives the following error:
pyspark.sql.utils.AnalysisException: u'Correlated scalar subqueries must be Aggregated: GlobalLimit 1\n+- LocalLimit 1\n
What's wrong with the query?
You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.
So when catalyst can't make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.
If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:
first
avg
max
min

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

Resources