regex_extract_all not working with spark sql - apache-spark

I'm using databricks notebook to extract all field occurrences from a text column using the regexp_extract_all function.
Here is the input:
field_map#'IFDSIMP.7'.$1.$0 == 'X') OR (field_map#'IFDSIMP.14'.$1.$0 == 'X')
I'm able to extract values using df as a view
SELECT regexp_extract_all(raw, "field_map#\'.*?\'", 0) as field1 from fieldViews
+--------
|field1 |
+-------
|"field_map#'IFDSDEP.7'"|
| field_map#'IFDSIMP.14'|
However, getting empty results set with spark SQL.
spark.sql("SELECT regexp_extract_all(raw, 'field_map#\'.*?\'', 0) as field1from fieldViews")

Related

Can't use a new created derived column in same spark sql code

I have creating a new column in using "AS" statement in pyspark-sql code
accounts_2.price - COALESCE(cast(accounts_2.price as Numeric(38,8)), 0) AS follow_on_price
As you see here I am creating a new column "follow_on_price
but when I am trying to use this newly created column in my same spark sql code
, accounts_2.unlock_price - COALESCE(cast(accounts_2.upfront_price as Numeric(38,8)), 0) AS follow_on_price
, follow_on_price * exchange_rate_usd AS follow_on_price_usd
It does not recognise the follow_on_price used immediately in the same spark SQL so when I create a new temp view and use it as new table for the next step then its able to do the same . Please explain why so? Why can't spark Sql take the new column reference from the same spark code so that I don't have to create an extra step for "follow_on_price * exchange_rate_usd AS follow_on_price_usd" and it can be done in single step. like we do in normal sql like Postgres.
it's SQL standard behavior, to prevents ambiguities in query. we cannot reference column aliases in the same SELECT list.
you can use inner query instead like below
>>> data2 = [{"COL_A": 'A',"COL_B": "B","price":212891928.90},{"COL_A": "A1","COL_B": "cndmmkdssac","price":9496943.4593},{"COL_A": 'A',"COL_B": "cndds","price":4634609.6994}]
>>> df=spark.createDataFrame(data2)
>>> df.show()
+-----+-----------+-------------+
|COL_A| COL_B| price|
+-----+-----------+-------------+
| A| B|2.128919289E8|
| A1|cndmmkdssac| 9496943.4593|
| A| cndds| 4634609.6994|
+-----+-----------+-------------+
>>> df.registerTempTable("accounts_2")
>>> spark.sql("select s1.follow_on_price, s1.follow_on_price*70 as follow_on_price_usd from (select COALESCE(accounts_2.price) AS follow_on_price from accounts_2) s1")
+---------------+-------------------+
|follow_on_price|follow_on_price_usd|
+---------------+-------------------+
| 2.128919289E8| 1.4902435023E10|
| 9496943.4593| 6.64786042151E8|
| 4634609.6994| 3.24422678958E8|

How to pick latest record in spark structured streaming join

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have rates meta data of currency sample as below :
val ratesMetaDataDf = Seq(
("EUR","5/10/2019","1.130657","USD"),
("EUR","5/9/2019","1.13088","USD")
).toDF("base_code", "rate_date","rate_value","target_code")
.withColumn("rate_date", to_date($"rate_date" ,"MM/dd/yyyy").cast(DateType))
.withColumn("rate_value", $"rate_value".cast(DoubleType))
Sales records which i received from kafka topic is , as (sample) below
:
val kafkaDf = Seq((15,2016, 4, 100.5,"USD","2021-01-20","EUR",221.4)
).toDF("companyId", "year","quarter","sales","code","calc_date","c_code","prev_sales")
To calculate "prev_sales" , I need get its "c_code" 's respective "rate_value" which is nearest to the "calc_date" i.e. rate_date"
Which i am doing as following
val w2 = Window.orderBy(col("rate_date") desc)
val rateJoinResultDf = kafkaDf.as("k").join(ratesMetaDataDf.as("e"))
.where( ($"k.c_code" === $"e.base_code") &&
($"rate_date" < $"calc_date")
).orderBy($"rate_date" desc)
.withColumn("row",row_number.over(w2))
.where($"row" === 1).drop("row")
.withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast(DoubleType))
.select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
In the above to get nearest record (i.e. "5/10/2019" from ratesMetaDataDf ) for given "rate_date" I am using window and row_number function and sorting the records by "desc".
But in the spark-sql streaming it is causing the error as below
"
Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;"
So how to fetch first record to join in the above.
Replace your last code part with below code. This code will do left join and calculate date difference calc_date & rate_date. Next Window function we will pick nearest date and calculate prev_sales by using same your calculation.
Please note I have added one filter condition filter(col("diff") >=0),
which will handle a case of calc_date < rate_date. I have added few
more records for better understanding of this case.
scala> ratesMetaDataDf.show
+---------+----------+----------+-----------+
|base_code| rate_date|rate_value|target_code|
+---------+----------+----------+-----------+
| EUR|2019-05-10| 1.130657| USD|
| EUR|2019-05-09| 1.12088| USD|
| EUR|2019-12-20| 1.1584| USD|
+---------+----------+----------+-----------+
scala> kafkaDf.show
+---------+----+-------+-----+----+----------+------+----------+
|companyId|year|quarter|sales|code| calc_date|c_code|prev_sales|
+---------+----+-------+-----+----+----------+------+----------+
| 15|2016| 4|100.5| USD|2021-01-20| EUR| 221.4|
| 15|2016| 4|100.5| USD|2019-06-20| EUR| 221.4|
+---------+----+-------+-----+----+----------+------+----------+
scala> val W = Window.partitionBy("companyId","year","quarter","sales","code","calc_date","c_code","prev_sales").orderBy(col("diff"))
scala> val rateJoinResultDf= kafkaDf.alias("k").join(ratesMetaDataDf.alias("r"), col("k.c_code") === col("r.base_code"), "left")
.withColumn("diff",datediff(col("calc_date"), col("rate_date")))
.filter(col("diff") >= 0)
.withColumn("closedate", row_number.over(W))
.filter(col("closedate") === 1)
.drop("diff", "closedate")
.withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast("Decimal(14,5)"))
.select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
scala> rateJoinResultDf.show
+---------+----+-------+-----+----+----------+----------+
|companyId|year|quarter|sales|code| calc_date|prev_sales|
+---------+----+-------+-----+----+----------+----------+
| 15|2016| 4|100.5| USD|2021-01-20| 256.46976|
| 15|2016| 4|100.5| USD|2019-06-20| 250.32746|
+---------+----+-------+-----+----+----------+----------+

Replacing blanks with Null in PySpark

I am working on a Hive table on Hadoop and doing Data wrangling with PySpark. I read the dataset:
dt = sqlContext.sql('select * from db.table1')
df.select("var1").printSchema()
|-- var1: string (nullable = true)
have some empty values in the dataset that Spark seems to be unable to recognize! I can easily find Null values by
df.where(F.isNull(F.col("var1"))).count()
10163101
but when I use
df.where(F.col("var1")=='').count()
it gives me zero however when I check in sql, I have 6908 empty values.
Here are SQL queries and their results:
SELECT count(*)
FROM [Y].[dbo].[table1]
where var1=''
6908
And
SELECT count(*)
FROM [Y].[dbo].[table1]
where var1 is null
10163101
the counts for SQL and Pyspark table are the same:
df.count()
10171109
and
SELECT count(*)
FROM [Y].[dbo].[table1]
10171109
And when I try to find blanks by using length or size, I get an error:
dt.where(F.size(F.col("var1")) == 0).count()
AnalysisException: "cannot resolve 'size(var1)' due to data type
mismatch: argument 1 requires (array or map) type, however, 'var1'
is of string type.;"
How should I address this issue? My Spark version is '1.6.3'
Thanks
I tried regexp and finally was able to find those blanks!!
dtnew = dt.withColumn('test',F.regexp_replace(F.col('var1') , '\s+|,',''))
dtnew.where(F.col('test')=='').count()
6908

Spark: subset a few columns and remove null rows

I am running spark 2.1 on windows 10, I have fetched data from MySQL to spark using JDBC and the table looks like this
x y z
------------------
1 a d1
Null v ed
5 Null Null
7 s Null
Null bd Null
I want to create a new spark dataset with only x and y columns from the above table and I wan't to keep only those rows which do not have null in either of those 2 columns. My resultant table should look like this
x y
--------
1 a
7 s
The following is the code:
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
val filter_DF = load_DF.select($"x".isNotNull,$"y".isNotNull).rdd
// lets print first 5 values of filter_DF
filter_DF.take(5)
res0: Array[org.apache.spark.sql.Row] = Array([true,true], [false,true], [true,false], [true,true], [false,true])
As shown, the above result doesn't give me actual values but it returns Boolean values (true when value is not Null and false when value is Null)
Try this;
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
Now;
load_DF.select($"x",$"y").filter("x !== null").filter("y !== null")
Spark provides DataFrameNaFunctions for this purpose of dropping null values, etc.
In your example above you just need to call the following on a DataSet that you load
val noNullValues = load_DF.na.drop("all", Seq("x", "y"))
This will drop records where nulls occur in either field x or y but not z. You can read up on DataFrameNaFunctions for further options to fill in data, or translate values if required.
Apply "any" in na.drop:
df = df.select("x", "y")
.na.drop("any", Seq("x", "y"))
You are simply applying a function (in this case isNotNull) to the values when you do a select - instead you need to replace select with filter.
val filter_DF = load_DF.filter($"x".isNotNull && $"y".isNotNull)
or if you prefer:
val filter_DF = load_DF.filter($"x".isNotNull).filter($"y".isNotNull)

Result display showing weird with sql : Spark

I am trying to do some analysis with spark. I tried the same query with foreach which shows the results correctly but if I use show or in sql it is weird, it is not showing anything.
sqlContext.sql("select distinct device from TestTable1 where id = 23233").collect.foreach(println)
[ipad]
[desktop]
[playstation]
[iphone]
[android]
[smarTv]
gives proper device but if I use just show or any sql :
sqlContext.sql("select distinct device from TestTable1 where id = 23233").show()
%sql
select distinct device from TestTable1 where id = 23233
+-----------+
|device |
+-----------+
| |
| |
|ion|
| |
| |
| |
+-----------+
I need graph and charts, so I would like to use %sql. But this is giving weird results with $sql. Does any one have any idea why I am getting like this ?
show is a formatted output of your data, whereas collect.foreach(println) is merely printing the Row data. They are two different things. If you want to format your data in a specific way, then stick with foreach...keeping in mind you are printing a sequence of Row. You'll have to pull the data out of the row if you want to get your own formatting for each column.
I can probably provide more specific information if you provide the version of spark and zeppelin that you are using.
You stated that you are using %sql because you need Zeppelin's graphs and charts--i.e. you wouldn't be swapping to %sql if you didn't have to.
You can just stick with using Spark dataframes by using z.show(), for example:
%pyspark
df = sqlContext.createDataFrame([
(23233, 'ipad'),
(23233, 'ipad'),
(23233, 'desktop'),
(23233, 'playstation'),
(23233, 'iphone'),
(23233, 'android'),
(23233, 'smarTv'),
(12345, 'ipad'),
(12345, 'palmPilot'),
], ('id', 'device'))
foo = df.filter('id = 23233').select('device').distinct()
z.show(foo)
In the above, z.show(foo) renders the default Zeppelin table view, with options for the other chart types.

Resources