The result of applying "limit" to spark SQL is not as expected - apache-spark

var data = Seq[(String, Int)]()
for (i <- 1 until 10000) {
val str = f"value: ${i}"
data = data :+ (str, i)
}
val df = spark.sparkContext.parallelize(data).toDF()
df.createOrReplaceTempView("v_logs")
val a = spark.sql(
f"""
SELECT * FROM v_logs limit 20 <---- query
"""
)
a.show() <----- 1
a.show() <----- 2
a.show() <----- 3
a.select(col("_2")).show() <-----4
a.select(col("_2")).show() <-----5
a.select(col("_2")).show() <-----6
It's some spark code using scala.
I expected the results of 1,2,3 to be the same and 4,5,6 to be the same, but it wasn't.
Of course, adding "order by _2" to the query gives the expected result.I think it's because of the inner workings of spark, but I'm not sure. Could you please elaborate on this?

a.select(col("_2")) doesn't order the column
I tried your code but get expected results:
1,2,3 are all listing:
+---------+---+
| _1| _2|
+---------+---+
| value: 1| 1|
| value: 2| 2|
| value: 3| 3|
| value: 4| 4|
| value: 5| 5|
| value: 6| 6|
| value: 7| 7|
| value: 8| 8|
| value: 9| 9|
|value: 10| 10|
|value: 11| 11|
|value: 12| 12|
|value: 13| 13|
|value: 14| 14|
|value: 15| 15|
|value: 16| 16|
|value: 17| 17|
|value: 18| 18|
|value: 19| 19|
|value: 20| 20|
+---------+---+

Related

window function on a subset of data

I have a table like the below. I want to calculate an average of median but only for Q=2 and Q=3. I don't want to include other Qs but still preserve the data.
df = spark.createDataFrame([('2018-03-31',6,1),('2018-03-31',27,2),('2018-03-31',3,3),('2018-03-31',44,4),('2018-06-30',6,1),('2018-06-30',4,3),('2018-06-30',32,2),('2018-06-30',112,4),('2018-09-30',2,1),('2018-09-30',23,4),('2018-09-30',37,3),('2018-09-30',3,2)],['date','median','Q'])
+----------+--------+---+
| date| median | Q |
+----------+--------+---+
|2018-03-31| 6| 1|
|2018-03-31| 27| 2|
|2018-03-31| 3| 3|
|2018-03-31| 44| 4|
|2018-06-30| 6| 1|
|2018-06-30| 4| 3|
|2018-06-30| 32| 2|
|2018-06-30| 112| 4|
|2018-09-30| 2| 1|
|2018-09-30| 23| 4|
|2018-09-30| 37| 3|
|2018-09-30| 3| 2|
+----------+--------+---+
Expected output:
+----------+--------+---+------------+
| date| median | Q |result |
+----------+--------+---+------------+
|2018-03-31| 6| 1| null|
|2018-03-31| 27| 2| 15|
|2018-03-31| 3| 3| 15|
|2018-03-31| 44| 4| null|
|2018-06-30| 6| 1| null|
|2018-06-30| 4| 3| 18|
|2018-06-30| 32| 2| 18|
|2018-06-30| 112| 4| null|
|2018-09-30| 2| 1| null|
|2018-09-30| 23| 4| null|
|2018-09-30| 37| 3| 20|
|2018-09-30| 3| 2| 20|
+----------+--------+---+------------+
OR
+----------+--------+---+------------+
| date| median | Q |result |
+----------+--------+---+------------+
|2018-03-31| 6| 1| 15|
|2018-03-31| 27| 2| 15|
|2018-03-31| 3| 3| 15|
|2018-03-31| 44| 4| 15|
|2018-06-30| 6| 1| 18|
|2018-06-30| 4| 3| 18|
|2018-06-30| 32| 2| 18|
|2018-06-30| 112| 4| 18|
|2018-09-30| 2| 1| 20|
|2018-09-30| 23| 4| 20|
|2018-09-30| 37| 3| 20|
|2018-09-30| 3| 2| 20|
+----------+--------+---+------------+
I tried the following code but when I include the where statement it drops Q=1 and Q=4.
window = (
Window
.partitionBy("date")
.orderBy("date")
)
df_avg = (
df
.where(
(F.col("Q") == 2) |
(F.col("Q") == 3)
)
.withColumn("result", F.avg("median").over(window))
)
For both of your expected output, you can use conditional aggregation, use avg with when (otherwise).
If you want the 1st expected output.
window = (
Window
.partitionBy("date", F.col("Q").isin([2, 3]))
)
df_avg = (
df.withColumn("result", F.when(F.col("Q").isin([2, 3]), F.avg("median").over(window)))
)
For the 2nd expected output.
window = (
Window
.partitionBy("date")
)
df_avg = (
df.withColumn("result", F.avg(F.when(F.col("Q").isin([2, 3]), F.col("median"))).over(window))
)
Alternatively, since you are really aggregating a (small?) subset, replace window with auto-join:
>>> df_avg = df.where(col("Q").isin([2,3])).groupBy("date","Q").agg(avg("median").alias("result"))
>>> df_result = df.join(df_avg,["date","Q"],"left")
Might turn out to be faster than using window.

how to split one column and keep other columns in pyspark dataframe?

I have data like this:
>>> data = sc.parallelize([[1,5,10,0,[1,2,3,4,5,6]],[0,10,20,1,[2,3,4,5,6,7]],[1,15,25,0,[3,4,5,6,7,8]],[0,30,40,1,[4,5,6,7,8,9]]]).toDF(('a','b','c',"d","e"))
>>> data.show()
+---+---+---+---+------------------+
| a| b| c| d| e|
+---+---+---+---+------------------+
| 1| 5| 10| 0|[1, 2, 3, 4, 5, 6]|
| 0| 10| 20| 1|[2, 3, 4, 5, 6, 7]|
| 1| 15| 25| 0|[3, 4, 5, 6, 7, 8]|
| 0| 30| 40| 1|[4, 5, 6, 7, 8, 9]|
+---+---+---+---+------------------+
# colums should be kept in result
keep_cols = ["a","b"]
# column 'e' should be split into split_e_cols
split_e_cols = ["one","two","three","four","five","six"]
# I hope the result dataframe has keep_cols + split_res_cols
I want to split column e into multiple columns and keep columns a and b at the same time.
I have tried:
data.select(*(col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(len(split_e_cols)))))
and
data.select("e").rdd.flatMap(lambda x:x).toDF(split_e_cols)
neither can keep columns a and b.
Could anyone help me? Thanks.
Try this:
select_cols = [col(c) for c in keep_cols] + [col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
data.select(*select_cols).show()
#+---+---+---+---+-----+----+----+---+
#| a| b|one|two|three|four|five|six|
#+---+---+---+---+-----+----+----+---+
#| 1| 5| 1| 2| 3| 4| 5| 6|
#| 0| 10| 2| 3| 4| 5| 6| 7|
#| 1| 15| 3| 4| 5| 6| 7| 8|
#| 0| 30| 4| 5| 6| 7| 8| 9|
#+---+---+---+---+-----+----+----+---+
Or using for loop and withColumn:
data = data.select(keep_cols + ["e"])
for i in range(len(split_e_cols)):
data = data.withColumn(split_e_cols[i], col("e").getItem(i))
data.drop("e").show()
You can concatenate the lists using +:
from pyspark.sql.functions import col
data.select(
keep_cols +
[col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
).show()
+---+---+---+---+-----+----+----+---+
| a| b|one|two|three|four|five|six|
+---+---+---+---+-----+----+----+---+
| 1| 5| 1| 2| 3| 4| 5| 6|
| 0| 10| 2| 3| 4| 5| 6| 7|
| 1| 15| 3| 4| 5| 6| 7| 8|
| 0| 30| 4| 5| 6| 7| 8| 9|
+---+---+---+---+-----+----+----+---+
A more pythonic way is to use enumerate instead of range(len()):
from pyspark.sql.functions import col
data.select(
keep_cols +
[col("e").getItem(i).alias(c) for (i, c) in enumerate(split_e_cols)]
).show()
+---+---+---+---+-----+----+----+---+
| a| b|one|two|three|four|five|six|
+---+---+---+---+-----+----+----+---+
| 1| 5| 1| 2| 3| 4| 5| 6|
| 0| 10| 2| 3| 4| 5| 6| 7|
| 1| 15| 3| 4| 5| 6| 7| 8|
| 0| 30| 4| 5| 6| 7| 8| 9|
+---+---+---+---+-----+----+----+---+

create unique id for combination of a pair of values from two columns in a spark dataframe

I have a spark dataframe of six columns say (col1, col2,...col6). I want to create a unique id for each combination of values from "col1" and "col2" and add it to the dataframe. Can someone help me with some pyspark code on how to do it?
You can achieve it using monotonically_increasing_id(pyspark >1.6) or monotonicallyIncreasingId(pyspark <1.6)
>>> from pyspark.sql.functions import monotonically_increasing_id
>>> rdd=sc.parallelize([[12,23,3,4,5,6],[12,23,56,67,89,20],[12,23,0,0,0,0],[12,2,12,12,12,23],[1,2,3,4,56,7],[1,2,3,4,56,7]])
>>> df = rdd.toDF(['col_1','col_2','col_3','col_4','col_5','col_6'])
>>> df.show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
| 12| 23| 3| 4| 5| 6|
| 12| 23| 56| 67| 89| 20|
| 12| 23| 0| 0| 0| 0|
| 12| 2| 12| 12| 12| 23|
| 1| 2| 3| 4| 56| 7|
| 1| 2| 3| 4| 56| 7|
+-----+-----+-----+-----+-----+-----+
>>> df_1=df.groupBy(df.col_1,df.col_2).count().withColumn("id", monotonically_increasing_id()).select(['col_1','col_2','id'])
>>> df_1.show()
+-----+-----+-------------+
|col_1|col_2| id|
+-----+-----+-------------+
| 12| 23| 34359738368|
| 1| 2|1434519076864|
| 12| 2|1554778161152|
+-----+-----+-------------+
>>> df.join(df_1,(df.col_1==df_1.col_1) & (df.col_2==df_1.col_2)).drop(df_1.col_1).drop(df_1.col_2).show()
+-----+-----+-----+-----+-----+-----+-------------+
|col_3|col_4|col_5|col_6|col_1|col_2| id|
+-----+-----+-----+-----+-----+-----+-------------+
| 3| 4| 5| 6| 12| 23| 34359738368|
| 56| 67| 89| 20| 12| 23| 34359738368|
| 0| 0| 0| 0| 12| 23| 34359738368|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 12| 12| 12| 23| 12| 2|1554778161152|
+-----+-----+-----+-----+-----+-----+-------------+
If you really need to generate the unique ID from col1 and col2 you can also create a hash value leveraging the sha2 function of Spark.
First let's generate some dummy data with:
from random import randint
max_range = 10
df1 = spark.createDataFrame(
[(x, x * randint(1, max_range), x * 10 * randint(1, max_range)) for x in range(1, max_range)],
['C1', 'C2', 'C3'])
>>> df1.show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 1| 1| 60|
| 2| 14|180|
| 3| 21|270|
| 4| 16|360|
| 5| 35|250|
| 6| 30|480|
| 7| 28|210|
| 8| 80|320|
| 9| 45|360|
+---+---+---+
Then create a new uid column from columns C2 and C3 with the next code:
from pyspark.sql.functions import col, sha2, concat
df1.withColumn("uid", sha2(concat(col("C2"), col("C3")), 256)).show(10, False)
And the output:
+---+---+---+--------------------+
| C1| C2| C3| uid|
+---+---+---+--------------------+
| 1| 1| 60|a512db2741cd20693...|
| 2| 14|180|2f6543dc6c0e06e4a...|
| 3| 21|270|bd3c65ddde4c6f733...|
| 4| 16|360|c7a1e8c59fc9dcc21...|
| 5| 35|250|cba1aeb7a72d9ae27...|
| 6| 30|480|ad7352ff8927cf790...|
| 7| 28|210|ea7bc25aa7cd3503f...|
| 8| 80|320|02e1d953517339552...|
| 9| 45|360|b485cf8f710a65755...|
+---+---+---+--------------------+

pyspark sqlfunction expr function not working as expected?

pyspark sqlfunction expr not working as expected.
my test1.txt contains
101|10|4
101|12|1
101|13|3
101|14|2
my test2.txt contains
101|10|4
101|11|1
101|13|3
101|14|2
I have created two dataframes using above data like below code.
df3 = spark.createDataFrame(sc.textFile("C://Users//cravi//Desktop//test1.txt").map( lambda x: x.split("|")[:3]),["cid","pid","pr"])
df4 = spark.createDataFrame(sc.textFile("C://Users//cravi//Desktop//test2.txt").map( lambda x: x.split("|")[:3]),["cid","pid","p"])
df5=df4.withColumnRenamed("p", "p")\
.join(df3.withColumnRenamed("pr", "Pr")\
, ["cid", "pid"], "outer")\
.na.fill(0)
tt=df5.withColumn('flag', sf.expr("case when p>0 and pr=='null' then 'N'\
when p=0 and Pr>0 then 'D'\
when p=Pr then 'R'\
else 'U' end"))
tt.show()
I am getting output like below
+---+---+----+----+----+
|cid|pid| p| Pr|flag|
+---+---+----+----+----+
|101| 14| 2| 2| R|
|101| 10| 4| 4| R|
|101| 11| 1|null| U|
|101| 12|null| 1| U|
|101| 13| 3| 3| R|
+---+---+----+----+----+
pyspark sqlfunction expr not working as expected.
if p and pr is same then my falg will be 'R'.
if p some value and pr is null my flag will be 'N'
if p is null and pr is some value my flag is 'D'
other case my flag is 'U'
In this case expected output is :
+---+---+----+----+----+
|cid|pid| p| Pr|flag|
+---+---+----+----+----+
|101| 14| 2| 2| R|
|101| 10| 4| 4| R|
|101| 11| 1|null| N|
|101| 12|null| 1| D|
|101| 13| 3| 3| R|
+---+---+----+----+----+
isNull and isNotNull inbuilt functions should solve your issue which can be used in query as
tt=df5.withColumn('flag', sf.expr("case when isNotNull(`p`) and isNull(`pr`) then 'N'\
when isNull(`p`) and isNotNull(`Pr`) then 'D'\
when p=Pr then 'R'\
else 'U' end"))
Thus you should get
+---+---+----+----+----+
|cid|pid| p| Pr|flag|
+---+---+----+----+----+
|101| 14| 2| 2| R|
|101| 10| 4| 4| R|
|101| 11| 1|null| N|
|101| 12|null| 1| D|
|101| 13| 3| 3| R|
+---+---+----+----+----+
Note: na.fill(0) is useless as it is not applied since the columns are StringType()
I hope the answer is helpful

How to use variable arguments _* in udf with Scala/Spark?

I have a dataframe where the number of column is variable. Every column type is Int and I want to get sum of all column. thought of using :_* ,this is my code:
val arr = Array(1,4,3,2,5,7,3,5,4,18)
val input=new ArrayBuffer[(Int,Int)]()
for(i<-0 until 10){
input.append((i,arr(i%10)))
}
var df=sc.parallelize(input,3).toDF("value1","value2")
val cols=new ArrayBuffer[Column]()
val colNames=df.columns
for(name<-colNames){
cols.append(col(name))
}
val func = udf((s: Int*) => s.sum)
df.withColumn("sum",func(cols:_*)).show()
But I get a error:
Error:(101, 27) ')' expected but identifier found.
val func = udf((s: Int*) => s.sum)
Error:(101, 27) ')' expected but identifier found.
val func = udf((s: Int*) => s.sum)
how to use :_* in udf?
my except result is:
+------+------+---+
|value1|value2|sum|
+------+------+---+
| 0| 1| 1|
| 1| 4| 5|
| 2| 3| 5|
| 3| 2| 5|
| 4| 5| 9|
| 5| 7| 12|
| 6| 3| 9|
| 7| 5| 12|
| 8| 4| 12|
| 9| 18| 27|
+------+------+---+
This may what you expect
val func = udf((s: Seq[Int]) => s.sum)
df.withColumn("sum", func(array(cols: _*))).show()
where array is org.apache.spark.sql.functions.array which
Creates a new array column. The input columns must all have the same data type.
Spark UDF does not supports variable length arguments,
Here is a solution for your problem.
import spark.implicits._
val input = Array(1,4,3,2,5,7,3,5,4,18).zipWithIndex
var df=spark.sparkContext.parallelize(input,3).toDF("value2","value1")
df.withColumn("total", df.columns.map(col(_)).reduce(_ + _))
Output:
+------+------+-----+
|value2|value1|total|
+------+------+-----+
| 1| 0| 1|
| 4| 1| 5|
| 3| 2| 5|
| 2| 3| 5|
| 5| 4| 9|
| 7| 5| 12|
| 3| 6| 9|
| 5| 7| 12|
| 4| 8| 12|
| 18| 9| 27|
+------+------+-----+
Hope this helps
you can try VectorAssembler
import org.apache.spark.ml.feature.VectorAssembler
import breeze.linalg.DenseVector
val assembler = new VectorAssembler().
setInputCols(Array("your column name")).
setOutputCol("allNum")
val assembledDF = assembler.transform(df)
assembledDF.show
+------+------+----------+
|value1|value2| allNum|
+------+------+----------+
| 0| 1| [0.0,1.0]|
| 1| 4| [1.0,4.0]|
| 2| 3| [2.0,3.0]|
| 3| 2| [3.0,2.0]|
| 4| 5| [4.0,5.0]|
| 5| 7| [5.0,7.0]|
| 6| 3| [6.0,3.0]|
| 7| 5| [7.0,5.0]|
| 8| 4| [8.0,4.0]|
| 9| 18|[9.0,18.0]|
+------+------+----------+
def yourSumUDF = udf((allNum:Vector) => new DenseVector(allNum.toArray).sum)
assembledDF.withColumn("sum", yourSumUDF($"allNum")).show
+------+------+----------+----+
|value1|value2| allNum| sum|
+------+------+----------+----+
| 0| 1| [0.0,1.0]| 1.0|
| 1| 4| [1.0,4.0]| 5.0|
| 2| 3| [2.0,3.0]| 5.0|
| 3| 2| [3.0,2.0]| 5.0|
| 4| 5| [4.0,5.0]| 9.0|
| 5| 7| [5.0,7.0]|12.0|
| 6| 3| [6.0,3.0]| 9.0|
| 7| 5| [7.0,5.0]|12.0|
| 8| 4| [8.0,4.0]|12.0|
| 9| 18|[9.0,18.0]|27.0|
+------+------+----------+----+

Resources