I have a table with a column with one row with where clause.
from pyspark.sql.types import *
where_clause_df=spark.createDataFrame([('A > 1',)],schema=StructType([StructField("a_where", StringType(), nullable=True)]))
where_clause_df.createOrReplaceTempView("where_clause").show()
spark.sql("select * from where_clause").show()
+-------+
|a_where|
+-------+
| A > 1|
+-------+
With another table,
sample_df=spark.createDataFrame([(1,)],schema=StructType([StructField("A", IntegerType(), nullable=True)]))
sample_df.createOrReplaceTempView("sample")
spark.sql("select * from sample").show()
I want to use this a_where to apply with the table sample. Something like:
spark.sql("""
select * from sample where (select a_where from where_clause)
""").show()
Is it possible with Spark SQL ?
tl;dr Use collect on the where_clause table.
Think of data as something available (almost) always on executors where you are not allowed to execute queries from. That's by design.
Since you want to execute queries you should have all you need on the driver and so you need to bring this extra metadata for your queries (like where clauses) to the driver. Bingo! That's exactly collect.
Mind though that the data you can "download" to the driver using collect has to be within the amount of memory available for this one single driver process (and that's likely the case).
You are trying to extract the where clause string from your TempView, hence getting the error
You can modify your code slightly to achieve this
where_clause_df=sql.createDataFrame([('A > 1',)],schema=StructType([StructField("a_where", StringType(), nullable=True)]))
where_clause_df.createOrReplaceTempView("where_clause")
sql.sql("select * from where_clause").show()
sample_df=sql.createDataFrame([(1,)],schema=StructType([StructField("A", IntegerType(), nullable=True)]))
sample_df.createOrReplaceTempView("sample")
sql.sql("select * from sample").show()
### Contains 'A > 1'
where_clause = sql.sql("select a_where from where_clause").collect()[0][0]
query = f"""
select *
from sample
where {where_clause}
"""
sql.sql(query).show()
+---+
| A|
+---+
+---+
Further if there are multiple conditions , you can iterate over them and modify the query in each iteration to extract the results
Related
I have a spark dataframe that has a list of timestamps (partitioned by uid, ordered by timestamp). Now, I'd like to query the dataframe to get either previous or next record.
df = myrdd.toDF().repartition("uid").sort(desc("timestamp"))
df.show()
+------------------------+-------+
|uid |timestamp |
+------------------------+-------+
|Peter_Parker|2020-09-19 02:14:40|
|Peter_Parker|2020-09-19 01:07:38|
|Peter_Parker|2020-09-19 00:04:39|
|Peter_Parker|2020-09-18 23:02:36|
|Peter_Parker|2020-09-18 21:58:40|
So for example if I were to query:
ts=datetime.datetime(2020, 9, 19, 0, 4, 39)
I want to get the previous record on (2020-09-18 23:02:36), and only that one.
How can I get the previous one?
It's possible to do it using withColumn() and diff, but is there a smarter more efficient way of doing that? I really really don't need to calculate diff for ALL events, since it is already ordered. I just want prev/next record.
You can use a filter and order by, and then limit the results to 1 row:
df2 = (df.filter("uid = 'Peter_Parker' and timestamp < timestamp('2020-09-19 00:04:39')")
.orderBy('timestamp', ascending=False)
.limit(1)
)
df2.show()
+------------+-------------------+
| uid| timestamp|
+------------+-------------------+
|Peter_Parker|2020-09-18 23:02:36|
+------------+-------------------+
Or by using row_number after filtering :
from pyspark.sql import Window
from pyspark.sql import functions as F
df1 = df.filter("timestamp < '2020-09-19 00:04:39'") \
.withColumn("rn", F.row_number().over(Window.orderBy(F.desc("timestamp")))) \
.filter("rn = 1").drop("rn")
df1.show()
#+------------+-------------------+
#| uid| timestamp|
#+------------+-------------------+
#|Peter_Parker|2020-09-18 23:02:36|
#+------------+-------------------+
My column is having data as,
col
---
abc|#|pqr|#|xyz
aaa|#|sss|#|sdf
It is delemeted by |#| (pipe ,# , pipe).
How to split this with spark sql.
I am trying spark.sql("select split(col,'|#|')").show() but it is not giving me proper result.
I tried escaping \ but still no luck.
Can anyone knows what is going on here..
Note: I need solution for spark sql only.
I am not sure if I have understood your problem statement properly or not but to split a string by its delimiter is fairly simple and can be done in a variety of ways.
One of the methods is to use SUBSTRING_INDEX -
val data = Seq(("abc|#|pqr|#|xyz"),("aaa|#|sss|#|sdf")).toDF("col1")
data.createOrReplaceTempView("testSplit")
followed by -
%sql
select *,substring_index(col1,'|#|',1) as value1, substring_index(col1,'|#|',2) as value2, substring_index(col1,'|#|',3) as value3 from testSplit
Result -
OR - Split Function Documentation
%sql
select *,SPLIT(col1,'\\|#\\|') as SplitString from testSplit
Result -
Do let me know if this fulfills your requirement or not .
Check below code.
scala> adf.withColumn("split_data",split($"data","\\|#\\|")).show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[aaa, sss, sdf]|
+---------------+---------------+
scala> spark.sql("select * from split_data").show(false)
+---------------+
|data |
+---------------+
|abc|#|pqr|#|xyz|
|aaa|#|sss|#|sdf|
+---------------+
scala> spark.sql("""select data,split('abc|#|pqr|#|xyz', '\\|\\#\\|') as split_data from split_data""").show(false)
+---------------+---------------+
|data |split_data |
+---------------+---------------+
|abc|#|pqr|#|xyz|[abc, pqr, xyz]|
|aaa|#|sss|#|sdf|[abc, pqr, xyz]|
+---------------+---------------+
Note: inside spark.sql function pass your select query between """ """ & escape special symbols with \\.
I'm having trouble with Spark SQL. I tried to import a CSV file into spark DB. My columns are separated by semicolons. I have tried to separate the columns by using sep to do so, but to my dismay, the columns are not separated properly.
Is this how Spark SQL works or is there a difference the conventional Spark SQL and the one in DataBricks. I am new to SparkSQL, a whole new environment from the original SQL language, sorry pardon me for my knowledge for SparkSQL.
USE CarSalesP1935727;
CREATE TABLE IF NOT EXISTS Products
USING CSV
OPTIONS (path "/FileStore/tables/Products.csv", header "true", inferSchema
"true", sep ";");
SELECT * FROM Products LIMIT 10
Not sure about the problem, working well -
Please note that the env is not databricks
val path = getClass.getResource("/csv/test2.txt").getPath
println(path)
/**
* file data
* -----------
* id;sequence;sequence
* 1;657985;657985
* 2;689654;685485
*/
spark.sql(
s"""
|CREATE TABLE IF NOT EXISTS Products
|USING CSV
|OPTIONS (path "$path", header "true", inferSchema
|"true", sep ";")
""".stripMargin)
spark.sql("select * from Products").show(false)
/**
* +---+---------+---------+
* |id |sequence1|sequence2|
* +---+---------+---------+
* |1 |657985 |657985 |
* |2 |689654 |685485 |
* +---+---------+---------+
*/
Is there any provision of doing "INSERT IF NOT EXISTS ELSE UPDATE" in Spark SQL.
I have Spark SQL table "ABC" that has some records.
And then i have another batch of records that i want to Insert/update in this table based on whether they exist in this table or not.
is there a SQL command that i can use in SQL query to make this happen?
In regular Spark this could be achieved with a join followed by a map like this:
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List(("id1", "orginal"), ("id2", "original"))).toDF("df1_id", "df1_status")
val df2 = spark.sparkContext.parallelize(List(("id1", "new"), ("id3","new"))).toDF("df2_id", "df2_status")
val df3 = df1
.join(df2, 'df1_id === 'df2_id, "outer")
.map(row => {
if (row.isNullAt(2))
(row.getString(0), row.getString(1))
else
(row.getString(2), row.getString(3))
})
This yields:
scala> df3.show
+---+--------+
| _1| _2|
+---+--------+
|id3| new|
|id1| new|
|id2|original|
+---+--------+
You could also use select with udfs instead of map, but in this particular case with null-values, I personally prefer the map variant.
you can use spark sql like this :
select * from (select c.*, row_number() over (partition by tac order by tag desc) as
TAG_NUM from (
select
a.tac
,a.name
,0 as tag
from tableA a
union all
select
b.tac
,b.name
,1 as tag
from tableB b) c ) d where TAG_NUM=1
tac is column you want to insert/update by.
I know it's a bit late to share my code, but to add or update my database, i did a fuction that looks like this :
import pandas as pd
#Returns a spark dataframe with added and updated datas
#key parameter is the primary key of the dataframes
#The two parameters dfToUpdate and dfToAddAndUpdate are spark dataframes
def AddOrUpdateDf(dfToUpdate,dfToAddAndUpdate,key):
#Cast the spark dataframe dfToUpdate to pandas dataframe
dfToUpdatePandas = dfToUpdate.toPandas()
#Cast the spark dataframe dfToAddAndUpdate to pandas dataframe
dfToAddAndUpdatePandas = dfToAddAndUpdate.toPandas()
#Update the table records with the latest records, and adding new records if there are new records.
AddOrUpdatePandasDf = pd.concat([dfToUpdatePandas,dfToAddAndUpdatePandas]).drop_duplicates([key], keep = 'last').sort_values(key)
#Cast back to get a spark dataframe
AddOrUpdateDf = spark.createDataFrame(AddOrUpdatePandasDf)
return AddOrUpdateDf
As you can see, we need to cast the spark dataframes to pandas dataframe to be able to do the pd.concat and especially the drop_duplicates with the "keep = 'last'", then we cast back to spark dataframe and return it.
I don't think this is the best way to handle the AddOrUpdate, but at least, it works.
The problem arises when I call describe function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe() function?
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
You can select from the dataframe:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
You can also register it as a table and query the table:
val t = x.describe()
t.registerTempTable("dt")
%sql
select * from dt
Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:
myDataFrame.selectExpr('MIN(count)').head()[0]
myDataFrame.describe().filter($"summary"==="stddev").show()
This worked quite nicely on Spark 2.3.0