I am working on a Hive table on Hadoop and doing Data wrangling with PySpark. I read the dataset:
dt = sqlContext.sql('select * from db.table1')
df.select("var1").printSchema()
|-- var1: string (nullable = true)
have some empty values in the dataset that Spark seems to be unable to recognize! I can easily find Null values by
df.where(F.isNull(F.col("var1"))).count()
10163101
but when I use
df.where(F.col("var1")=='').count()
it gives me zero however when I check in sql, I have 6908 empty values.
Here are SQL queries and their results:
SELECT count(*)
FROM [Y].[dbo].[table1]
where var1=''
6908
And
SELECT count(*)
FROM [Y].[dbo].[table1]
where var1 is null
10163101
the counts for SQL and Pyspark table are the same:
df.count()
10171109
and
SELECT count(*)
FROM [Y].[dbo].[table1]
10171109
And when I try to find blanks by using length or size, I get an error:
dt.where(F.size(F.col("var1")) == 0).count()
AnalysisException: "cannot resolve 'size(var1)' due to data type
mismatch: argument 1 requires (array or map) type, however, 'var1'
is of string type.;"
How should I address this issue? My Spark version is '1.6.3'
Thanks
I tried regexp and finally was able to find those blanks!!
dtnew = dt.withColumn('test',F.regexp_replace(F.col('var1') , '\s+|,',''))
dtnew.where(F.col('test')=='').count()
6908
Related
Issue with spark sql query:
Error :
AnalysisException: cannot resolve '(global_temp.temptab.Agent.AgencyCode = '720')' due to data type mismatch: differing types in '(global_temp.temptab.Agent.AgencyCode = '720')' (array and string).; line 1 pos 64;
Here Agent is of type array. In case of multiple records how to iterate the query.
In case of Array and Dict how to write spark sql query to iterate all records.
Sample code I tried from my azure notebook:
df=spark.read.format('json').option("header","true").load("/mnt/bankabc/Test/Cust.json/")
df.createOrReplaceGlobalTempView("temptab")
display(spark.sql("select * from global_temp.temptab where id='73' and Agent.AgencyCode='720'"))
I want get variable with sql query:
Dt = spark.sql("""select max(dt) from table""")
Script = """select * from table where dt > """ + dt
Spark.sql(script)
but when I try to substitute a variable in the request I get error:
"Can only concatenate str (not dataframe) to str"
How do I get the variable as a string and not a dataframe?
To get the result in a variable, you can use collect() and extract the value. Here's an example that pulls the max date-month (YYYYMM) from a table and stores it in a variable.
max_mth = spark.sql('select max(mth) from table').collect()[0][0]
print(max_mth)
# 202202
You can either cast the value to string in the sql statement, or use str() on the variable while using to convert the integer value to string.
P.S. - the [0][0] is to select the first row-column
Parquet Entry Example (All entries have is_active_entity as true)
{
"is_active_entity": true,
"is_removed": false
}
Query that demonstrates all values are taken as NULL
select $1:IS_ACTIVE_ENTITY::boolean, count(*) from #practitioner_delta_stage/part-00000-49224c02-150b-493b-8036-54ab30a8ff40-c000.snappy.parquet group by $1:IS_ACTIVE_ENTITY::boolean ;
Output has only one group for NULL
$1:IS_ACTIVE_ENTITY::BOOLEAN COUNT(*)
NULL 4930277
I don't know where I am going wrong, Spark writes the correct schema in parquet as evident from the example but snowflake takes it as NULL.
How do I fix this?
The columns in your file are quoted. As a consequence "is_active_entity" is not the same like "IS_ACTIVE_ENTITY"
Please try this query:
select $1:is_active_entity::boolean, count(*) from #practitioner_delta_stage/part-00000-49224c02-150b-493b-8036-54ab30a8ff40-c000.snappy.parquet group by $1:IS_ACTIVE_ENTITY::boolean ;
More infos: https://docs.snowflake.com/en/sql-reference/identifiers-syntax.html#:~:text=The%20identifier%20is%20case%2Dsensitive.
I use Spark 2.0.
I'd like to execute the following SQL query:
val sqlText = """
select
f.ID as TID,
f.BldgID as TBldgID,
f.LeaseID as TLeaseID,
f.Period as TPeriod,
coalesce(
(select
f ChargeAmt
from
Fact_CMCharges f
where
f.BldgID = Fact_CMCharges.BldgID
limit 1),
0) as TChargeAmt1,
f.ChargeAmt as TChargeAmt2,
l.EFFDATE as TBreakDate
from
Fact_CMCharges f
join
CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE)
where
f.ActualProjected = 'Lease'
except(
select * from TT1 t2 left semi join Fact_CMCharges f2 on t2.TID=f2.ID)
"""
val query = spark.sql(sqlText)
query.show()
It seems that the inner statement in coalesce gives the following error:
pyspark.sql.utils.AnalysisException: u'Correlated scalar subqueries must be Aggregated: GlobalLimit 1\n+- LocalLimit 1\n
What's wrong with the query?
You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.
So when catalyst can't make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.
If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:
first
avg
max
min
I am reading some data from a hive table using a hive context in spark and the out put is a ROW with only one column. I need to convert this to an array of Double. I have tried all possible ways to do it myself with no success. Can somebody please help in this ?
val qRes = hiveContext.sql("""
Select Sum(EQUnit) * Sum( Units)
From pos_Tran_orc T
INNER JOIN brand_filter B
On t.mbbrandid = b.mbbrandid
inner join store_filter s
ON t.msstoreid = s.msstoreid
Group By Transdate
""")
What next ????
You can simply map using Row.getDouble method:
qRes.map(_.getDouble(0)).collect()