bucketing with QuantileDiscretizer using groupBy function in pyspark - apache-spark

I have a large dataset like so:
| SEQ_ID|RESULT|
+-------+------+
|3462099|239.52|
|3462099|239.66|
|3462099|239.63|
|3462099|239.64|
|3462099|239.57|
|3462099|239.58|
|3462099|239.53|
|3462099|239.66|
|3462099|239.63|
|3462099|239.52|
|3462099|239.58|
|3462099|239.52|
|3462099|239.64|
|3462099|239.71|
|3462099|239.64|
|3462099|239.65|
|3462099|239.54|
|3462099| 239.6|
|3462099|239.56|
|3462099|239.67|
The RESULT column is grouped by SEQ_ID column.
I want to bucket/bin the RESULT based on the counts of each group. After applying some aggregations, I have a data frame with the number of buckets that each SEQ_ID must be binned by. like so:
| SEQ_ID|num_buckets|
+-------+----------+
|3760290| 12|
|3462099| 5|
|3462099| 5|
|3760290| 13|
|3462099| 13|
|3760288| 10|
|3760288| 5|
|3461201| 6|
|3760288| 13|
|3718665| 18|
So for example, this tells me that the RESULT values that belong to the 3760290 SEQ_ID must be binned in 12 buckets.
For a single group, I would collect() the num_buckets value and do:
discretizer = QuantileDiscretizer(numBuckets=num_buckets, inputCol='RESULT', outputCol='buckets')
df_binned=discretizer.fit(df).transform(df)
I understand that when using QuantileDiscretizer, each group would result in a separate dataframe, I can then union them all.
But how can I use QuantileDiscretizer to bin the various groups without using a for loop?

Related

How to return the latest rows per group in pyspark structured streaming

I have a stream which I read in pyspark using spark.readStream.format('delta'). The data consists of multiple columns including a type, date and value column.
Example DataFrame;
type
date
value
1
2020-01-21
6
1
2020-01-16
5
2
2020-01-20
8
2
2020-01-15
4
I would like to create a DataFrame that keeps track of the latest state per type. One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported. Another option would look like
stream.groupby('type').agg(last('date'), last('value')).writeStream
but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.
Do you have any suggestions on how to approach this challenge?
simple use the to_timestamp() function that can be import by from pyspark.sql.functions import *
on the date column so that you use the window function.
e.g
from pyspark.sql.functions import *
df=spark.createDataFrame(
data = [ ("1","2020-01-21")],
schema=["id","input_timestamp"])
df.printSchema()
+---+---------------+-------------------+
|id |input_timestamp|timestamp |
+---+---------------+-------------------+
|1 |2020-01-21 |2020-01-21 00:00:00|
+---+---------------+-------------------+
"but using windows on non-timestamp columns is not supported"
are you saying this from stream point of view, because same i am able to do.
Here is the solution to your problem.
windowSpec = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-16| 5| 1|
| 1|2020-01-21| 6| 2|
| 2|2020-01-15| 4| 1|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-21| 6| 2|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+

How could i split a column array from df, into a new one df?

I have a dataframe with some columns, one of them is a an array of hours, and I want to split this array of hours into new columns per index.
For example:
If my array is of 24 hours, I have to create a new df with 24 new columns one by hour
You can try with spark inbuilt functions posexplode,concat,groupBy,pivot for this case.
Example:
#test dataframe
val df=Seq(("rome","escuels",Seq(0,1,2,3,4,5)),
("madrid","farmacia",Seq(0,1,2,3,4,5)))
.toDF("city","institute","monday_hours")
df.selectExpr("posexplode(monday_hours) as (p,c)","*") //pos explode gives position and col value
.selectExpr("concat('monday_',p) as m ","c","city","institute")
.groupBy("city","institute")
.pivot("m") //pivot on m column
.agg(first("c")) //get the first value from c column value.
.show()
Result:
+------+---------+--------+--------+--------+--------+--------+--------+
| city|institute|monday_0|monday_1|monday_2|monday_3|monday_4|monday_5|
+------+---------+--------+--------+--------+--------+--------+--------+
|madrid| farmacia| 0| 1| 2| 3| 4| 5|
| rome| escuels| 0| 1| 2| 3| 4| 5|
+------+---------+--------+--------+--------+--------+--------+--------+

Spark union column order

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.
So we have df1
df1
| a| b|
+---+----+
| 1| asd|
| 2|asda|
| 3| f1f|
+---+----+
df2
| b| a|
+----+---+
| asd| 1|
|asda| 2|
| f1f| 3|
+----+---+
result
| a| b|
+----+----+
| 1| asd|
| 2|asda|
| 3| f1f|
| asd| 1|
|asda| 2|
| f1f| 3|
+----+----+
It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes.
Obviously the solution would be to do df1.union(df2.select(df1.columns))
But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?
code to create test set if anyone wants to try
d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)
The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.
in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question

When i use partitionBy in window,why i get a different result with spark/scala?

I use Window.sum function to get the sum of a value in an RDD, but when I convert the DataFrame to an RDD, I found that the result's has only one partition. When does the repartitioning occur?
val rdd = sc.parallelize(List(1,3,2,4,5,6,7,8), 4)
val df = rdd.toDF("values").
withColumn("csum", sum(col("values")).over(Window.orderBy("values")))
df.show()
println(s"numPartitions ${df.rdd.getNumPartitions}")
// 1
//df is:
// +------+----+
// |values|csum|
// +------+----+
// | 1| 1|
// | 2| 3|
// | 3| 6|
// | 4| 10|
// | 5| 15|
// | 6| 21|
// | 7| 28|
// | 8| 36|
// +------+----+
I add partitionBy in Window ,but the result is error,what should i do?this is my change code:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val df = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.partitionBy("values").orderBy("values")))
df.show()
println(s"numPartitions ${df.rdd.getNumPartitions}")
//1
//df is:
// +------+----+
// |values|csum|
// +------+----+
// | 1| 1|
// | 6| 6|
// | 3| 3|
// | 5| 5|
// | 4| 4|
// | 8| 8|
// | 7| 7|
// | 2| 2|
// +------+----+
Window function has partitionBy api for grouping the dataframe and orderBy to order the grouped rows in ascending or descending order.
In your first case you hadn't defined partitionBy, thus all the values were grouped in one dataframe for ordering purpose and thus shuffling the data into one partition.
But in your second case you had partitionBy defined on values itself. So since each value are distinct, each row is grouped into individual groups.
The partition in second case is 200 as that is the default partitioning defined in spark when you haven't defined partitions and shuffle occurs
To get the same result from your second case as you get with the first case, you need to group your dataframe as in your first case i.e. into one group. For that you will need to create another column with constant value and use that value for partitionBy.
When you create a column as
withColumn("csum", sum(col("values")).over(Window.orderBy("values")))
The Window.orderBy("values") is ordering the values of column "values" in single partition since you haven't defined partitionBy() method to define the partition.
This is changing the number of partition from initial 4 to 1.
The partition is 200 in your second case since the partitionBy()method uses 200 as default partition. if you need the number of partition as 4 you can use methods like repartition(4) or coalesce(4)
Hope you got the point!

Add columns on a Pyspark Dataframe

I have a Pyspark Dataframe with this structure:
+----+----+----+----+---+
|user| A/B| C| A/B| C |
+----+----+-------------+
| 1 | 0| 1| 1| 2|
| 2 | 0| 2| 4| 0|
+----+----+----+----+---+
I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:
+----+----+----+
|user| A/B| C|
+----+----+----+
| 1 | 1| 3|
| 2 | 4| 2|
+----+----+----+
Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?
I have a work around for this
val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)
Now make the Join then the out will contain the Values with different names
Then make the tuple of the list to be combined
val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))
And them Combine the value of the Columns within each tuple and get the desired output !
PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !

Resources