Apache Spark Pivot Query Stuck (PySpark) - apache-spark

I have simple data as:
+--------------------+-----------------+-----+
| timebucket_start| user| hits|
+--------------------+-----------------+-----+
|[2017-12-30 01:02...| Messi| 2|
|[2017-12-30 01:28...| Jordan| 9|
|[2017-12-30 11:12...| Jordan| 462|
+--------------------+-----------------+-----+
I am trying to pivot it such that I get the counts of each user for each of the time buckets,
So, my query in PySaprk is (using dataframes):
user_time_matrix = df.groupBy('timebucket_start').pivot("user").sum('hits')
Now, this query just keeps running all the time. I tried it with a scaled cluster too, doubling my cluster size, but then also same issue.
Is the query wrong? Can it be optimized, why can't spark finish it?

It's the same thing but you can try :
import pyspark.sql.functions as F
user_time_matrix = df.groupBy('timebucket_start').pivot("user").agg(F.sum('hits'))
Let me know if there is any error or infinite loop. Also when you use this code, the users will become the columns link :
Input :
+----+----------+------+
|hits| time| user|
+----+----------+------+
| 2|2017-12-30| Messi|
| 3|2017-12-30|Jordan|
| 462|2017-12-30|Jordan|
| 2|2017-12-31| Messi|
| 2|2017-12-31| Messi|
+----+----------+------+
Output:
+----------+------+-----+
| time|Jordan|Messi|
+----------+------+-----+
|2017-12-31| null| 4|
|2017-12-30| 465| 2|
+----------+------+-----+
I would recommend:
user_time_matrix = df.groupBy('timebucket_start', 'user').sum('hits')
Output :
+----------+------+---------+
| time| user|sum(hits)|
+----------+------+---------+
|2017-12-31| Messi| 4|
|2017-12-30|Jordan| 465|
|2017-12-30| Messi| 2|
+----------+------+---------+

Related

How to return the latest rows per group in pyspark structured streaming

I have a stream which I read in pyspark using spark.readStream.format('delta'). The data consists of multiple columns including a type, date and value column.
Example DataFrame;
type
date
value
1
2020-01-21
6
1
2020-01-16
5
2
2020-01-20
8
2
2020-01-15
4
I would like to create a DataFrame that keeps track of the latest state per type. One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported. Another option would look like
stream.groupby('type').agg(last('date'), last('value')).writeStream
but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.
Do you have any suggestions on how to approach this challenge?
simple use the to_timestamp() function that can be import by from pyspark.sql.functions import *
on the date column so that you use the window function.
e.g
from pyspark.sql.functions import *
df=spark.createDataFrame(
data = [ ("1","2020-01-21")],
schema=["id","input_timestamp"])
df.printSchema()
+---+---------------+-------------------+
|id |input_timestamp|timestamp |
+---+---------------+-------------------+
|1 |2020-01-21 |2020-01-21 00:00:00|
+---+---------------+-------------------+
"but using windows on non-timestamp columns is not supported"
are you saying this from stream point of view, because same i am able to do.
Here is the solution to your problem.
windowSpec = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-16| 5| 1|
| 1|2020-01-21| 6| 2|
| 2|2020-01-15| 4| 1|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-21| 6| 2|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+

Apache Spark: Get the first and last row of each partition

I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this?
In my code I repartition my dataset based on a key column using:
mydf.repartition(keyColumn).sortWithinPartitions(sortKey)
Is there a way to get the first row and last row for each partition?
Thanks
I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.
You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:
import pyspark
import pyspark.sql.functions as f
w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
withColumn("rn_asc", f.row_number().over(w_asc)). \
withColumn("rn_desc", f.row_number().over(w_desc)). \
where("rn_asc = 1 or rn_desc = 1")
The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.
Scala: I think the repartition is not by come key column but it requires the integer how may partition you want to set. I made a way to select the first and last row by using the Window function of the spark.
First, this is my test data.
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 3|
| 3| 5|
+---+-----+
Then, I use the Window function twice, because I cannot know the last row easily but the reverse is quite easy.
import org.apache.spark.sql.expressions.Window
val a = Window.partitionBy("id").orderBy("value")
val d = Window.partitionBy("id").orderBy(col("value").desc)
val df = spark.read.option("header", "true").csv("test.csv")
df.withColumn("marker", when(rank.over(a) === 1, "Y").otherwise("N"))
.withColumn("marker", when(rank.over(d) === 1, "Y").otherwise(col("marker")))
.filter(col("marker") === "Y")
.drop("marker").show
The final result is then,
+---+-----+
| id|value|
+---+-----+
| 3| 5|
| 3| 1|
| 1| 4|
| 1| 1|
| 2| 3|
| 2| 1|
+---+-----+
Here is another approach using mapPartitions from RDD API. We iterate over the elements of each partition until we reach the end. I would expect this iteration to be very fast since we skip all the elements of the partition except the two edges. Here is the code:
df = spark.createDataFrame([
["Tom", "a"],
["Dick", "b"],
["Harry", "c"],
["Elvis", "d"],
["Elton", "e"],
["Sandra", "f"]
], ["name", "toy"])
def get_first_last(it):
first = last = next(it)
for last in it:
pass
# Attention: if first equals last by reference return only one!
if first is last:
return [first]
return [first, last]
# coalesce here is just for demonstration
first_last_rdd = df.coalesce(2).rdd.mapPartitions(get_first_last)
spark.createDataFrame(first_last_rdd, ["name", "toy"]).show()
# +------+---+
# | name|toy|
# +------+---+
# | Tom| a|
# | Harry| c|
# | Elvis| d|
# |Sandra| f|
# +------+---+
PS: Odd positions will contain the first partition element and the even ones the last item. Also note that the number of results will be (numPartitions * 2) - numPartitionsWithOneItem which I expect to be relatively small therefore you shouldn't bother about the cost of the new createDataFrame statement.

Spark SQL , doesn´t respect the Dataframe format

I m analyzing Twitter Files with the scope to take the trending topic, in json format with Spark SQL
After to take all the text form a Tweet and split the words, my dataFrame look like this
+--------------------+--------------------+
| line| words|
+--------------------+--------------------+
|[RT, #ONLYRPE:, #...| RT|
|[RT, #ONLYRPE:, #...| #ONLYRPE:|
|[RT, #ONLYRPE:, #...| #tlrp|
|[RT, #ONLYRPE:, #...| followan?|
I just need the column words, I coconvert my table to a temView.
df.createOrReplaceTempView("Twitter_test_2")
With the help of spark sql should be very easy to take the trending topic, I just need a query in sql using in the where condition operator "Like". words like "#%"
spark.sql("select words,
count(words) as count
from words_Twitter
where words like '#%'
group by words
order by count desc limit 10").show(20,False)
but I m getting some strange results that I can't find an explanation for them.
+---------------------+---+
|words |cnt|
+---------------------+---+
|#izmirescort |211|
|#PRODUCE101 |101|
|#VeranoMTV2017 |91 |
|#سلمان_يدق_خشم_العايل|89 |
|#ALDUBHomeAgain |67 |
|#BTS |32 |
|#سود_الله_وجهك_ياتميم|32 |
|#NowPlaying |32 |
for some reason the #89 and the #32 the twoo thar have arab characteres are no where they should been. The text had been exchanged with the counter.
others times I am confrontig tha kind of format.
spark.sql("select words, lang,count(words) count from Twitter_test_2 group by words,lang order by count desc limit 10 ").show()
After that Query to my dataframe, it look like so strange
+--------------------+----+-----+
| words|lang|count|
+--------------------+----+-----+
| #VeranoMTV2017| pl| 6|
| #umRei| pt| 2|
| #Virgem| pt| 2|
| #rt
2| pl| 2|
| #rt
gazowaną| pl| 1|
| #Ziobro| pl| 1|
| #SomosPorto| pt| 1|
+--------------------+----+-----+
Why is happening that, and how can avoid it ?

How to rename duplicated columns after join? [duplicate]

This question already has answers here:
How to avoid duplicate columns after join?
(10 answers)
Closed 4 years ago.
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:
result_df = (aa_df.join(bb_df, 'id', 'left')
.join(cc_df, 'id', 'left')
.withColumnRenamed(bb_df.status, 'user_status'))
Please note that status column is in two dataframes, i.e. aa_df and bb_df.
The above doesn't work. I also tried to use withColumn, but the new column is created, and the old column is still existed.
If you are trying to rename the status column of bb_df dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes
That's a fine use case for aliasing a Dataset using alias or as operators.
alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set. Same as as.
as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set.
(And honestly I did only now see the Symbol-based variants.)
NOTE There are two as operators, as for aliasing and as for type mapping. Consult the Dataset API.
After you've aliases a Dataset, you can reference columns using [alias].[columnName] format. This is particularly handy with joins and star column dereferencing using *.
val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
so I want to drop some columns like below
My general recommendation is not to drop columns, but select what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select.
So, as you asked and I showed in the above example, the result has two columns of the same name id. The question is how to have only one.
There are at least two answers with using the variant of join operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?
Given I prefer select (over drop), I'd do the following to have a single id column:
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
.select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed when there are two matching columns (after join).
Let's assume you ended up with the following query and so you've got two id columns (per join side).
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
withColumnRenamed won't work for this use case since it does not accept aliased column names.
scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
You could select the columns you're interested in as follows:
scala> q.select("one.id").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> q.select("two.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Please see the docs : withColumnRenamed()
You need to pass the name of the existing column and the new name to the function. Both of these should be strings.
result_df = aa_df.join(bb_df,'id', 'left').join(cc_df, 'id', 'left').withColumnRenamed('status', 'user_status')
If you have 'status' columns in 2 dataframes, you can use them in the join as aa_df.join(bb_df, ['id','status'], 'left') assuming aa_df and bb_df have the common column. This way you will not end up having 2 'status' columns.

SparkSQL - got duplicate rows after join & groupBy

I have 2 dataframes with columns as shown below.
Note: Column uid is not a unique key, and there're duplicate rows with the same uid in the dataframes.
val df1 = spark.read.parquet(args(0)).drop("sv")
val df2 = spark.read.parquet(args(1))
scala> df1.orderBy("uid").show
+----+----+---+
| uid| hid| sv|
+----+----+---+
|uid1|hid2| 10|
|uid1|hid1| 10|
|uid1|hid3| 10|
|uid2|hid1| 2|
|uid3|hid2| 10|
|uid4|hid2| 3|
|uid5|hid3| 5|
+----+----+---+
scala> df2.orderBy("uid").show
+----+----+---+
| uid| pid| sv|
+----+----+---+
|uid1|pid2| 2|
|uid1|pid1| 1|
|uid2|pid1| 2|
|uid3|pid1| 3|
|uid3|pidx|999|
|uid3|pid2| 4|
|uidx|pid1| 2|
+----+----+---+
scala> df1.drop("sv")
.join(df2, "uid")
.groupBy("hid", "pid")
.agg(count("*") as "xcnt", sum("sv") as "xsum", avg("sv") as "xavg")
.orderBy("hid").show
+----+----+----+----+-----+
| hid| pid|xcnt|xsum| xavg|
+----+----+----+----+-----+
|hid1|pid1| 2| 3| 1.5|
|hid1|pid2| 1| 2| 2.0|
|hid2|pid2| 2| 6| 3.0|
|hid2|pidx| 1| 999|999.0|
|hid2|pid1| 2| 4| 2.0|
|hid3|pid1| 1| 1| 1.0|
|hid3|pid2| 1| 2| 2.0|
+----+----+----+----+-----+
In this demo case, everything looks good.
But when I apply the same operations on the production large data, the final output contains many duplicate rows (of same (hid, pid) pair).
I though the groupBy operator would be like select distinct hid, pid from ..., but obviously not.
So what's wrong with my operation? Should I repartition the dataframe by hid, pid?
Thanks!
-- Update
And if I add .drop("uid") once I join the dataframes, then some rows are missed from the final output.
scala> df1.drop("sv")
.join(df2, "uid").drop("uid")
.groupBy("hid", "pid")
.agg(count("*") as "xcnt", sum("sv") as "xsum", avg("sv") as "xavg")
.orderBy("hid").show
To be honest I think that there are problems with the data, not the code. Of course there shouldn't be any duplicates if pid and hid are truly different (I've seen some rogue Cyrillic symbols in data before).
To debug this issue you can try and see what combinations of 'uid' and sv values represent each duplicate row.
df1.drop( "sv" )
.join(df2, "uid")
.groupBy( "hid", "pid" )
.agg( collect_list( "uid" ), collect_list( "sv" ) )
.orderBy( "hid" )
.show
After that you'll have some start point to assess your data. Or, if the lists of uid (and 'sv') are the same, file a bug.
I think I might have found the root cause.
Maybe this is caused by AWS S3 consistency model.
The background is, I submitted 2 Spark jobs to create 2 tables, and submitted a third task to join the two tables (I split them in case any of them fails and I don't need to re-run them).
I put these 3 spark-submit in a shell script running in sequence, and got the result with duplicated rows.
When I re-ran the last job just now, the result seems good.

Resources