I m analyzing Twitter Files with the scope to take the trending topic, in json format with Spark SQL
After to take all the text form a Tweet and split the words, my dataFrame look like this
+--------------------+--------------------+
| line| words|
+--------------------+--------------------+
|[RT, #ONLYRPE:, #...| RT|
|[RT, #ONLYRPE:, #...| #ONLYRPE:|
|[RT, #ONLYRPE:, #...| #tlrp|
|[RT, #ONLYRPE:, #...| followan?|
I just need the column words, I coconvert my table to a temView.
df.createOrReplaceTempView("Twitter_test_2")
With the help of spark sql should be very easy to take the trending topic, I just need a query in sql using in the where condition operator "Like". words like "#%"
spark.sql("select words,
count(words) as count
from words_Twitter
where words like '#%'
group by words
order by count desc limit 10").show(20,False)
but I m getting some strange results that I can't find an explanation for them.
+---------------------+---+
|words |cnt|
+---------------------+---+
|#izmirescort |211|
|#PRODUCE101 |101|
|#VeranoMTV2017 |91 |
|#سلمان_يدق_خشم_العايل|89 |
|#ALDUBHomeAgain |67 |
|#BTS |32 |
|#سود_الله_وجهك_ياتميم|32 |
|#NowPlaying |32 |
for some reason the #89 and the #32 the twoo thar have arab characteres are no where they should been. The text had been exchanged with the counter.
others times I am confrontig tha kind of format.
spark.sql("select words, lang,count(words) count from Twitter_test_2 group by words,lang order by count desc limit 10 ").show()
After that Query to my dataframe, it look like so strange
+--------------------+----+-----+
| words|lang|count|
+--------------------+----+-----+
| #VeranoMTV2017| pl| 6|
| #umRei| pt| 2|
| #Virgem| pt| 2|
| #rt
2| pl| 2|
| #rt
gazowaną| pl| 1|
| #Ziobro| pl| 1|
| #SomosPorto| pt| 1|
+--------------------+----+-----+
Why is happening that, and how can avoid it ?
Related
I have a stream which I read in pyspark using spark.readStream.format('delta'). The data consists of multiple columns including a type, date and value column.
Example DataFrame;
type
date
value
1
2020-01-21
6
1
2020-01-16
5
2
2020-01-20
8
2
2020-01-15
4
I would like to create a DataFrame that keeps track of the latest state per type. One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported. Another option would look like
stream.groupby('type').agg(last('date'), last('value')).writeStream
but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.
Do you have any suggestions on how to approach this challenge?
simple use the to_timestamp() function that can be import by from pyspark.sql.functions import *
on the date column so that you use the window function.
e.g
from pyspark.sql.functions import *
df=spark.createDataFrame(
data = [ ("1","2020-01-21")],
schema=["id","input_timestamp"])
df.printSchema()
+---+---------------+-------------------+
|id |input_timestamp|timestamp |
+---+---------------+-------------------+
|1 |2020-01-21 |2020-01-21 00:00:00|
+---+---------------+-------------------+
"but using windows on non-timestamp columns is not supported"
are you saying this from stream point of view, because same i am able to do.
Here is the solution to your problem.
windowSpec = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-16| 5| 1|
| 1|2020-01-21| 6| 2|
| 2|2020-01-15| 4| 1|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-21| 6| 2|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
I want to update a row (having index numberInt) values of a given dataset dFIdx using values of another row from another dataset dFInitIdx (the row of the second dataset having different index j). I try in JAVA like the following :
for (String colName : dFInitIdx.columns())
dFIdx = dFIdx.where(col("id").equalTo(numberInt)).withColumn(colName,dFInitIdx.where(col("id").equalTo(j)).col(colName));
But i am getting this error :
Attribute(s) with the same name appear in the operation: id. Please
check if the right attribute(s) are used
How to achieve that update of one row in JAVA (preferably a one liner) ?
Thanks
Since both of your Datasets seem to have the same columns, you can use a join() method to merge them together based on your numberInt and j conditions, in order to select() (at least) the id column value of the first Dataset dFIdx and all the other columns from the second Dataset dFInitIdx.
dFIdx data sample:
+---+--------+---------+
| id|hundreds|thousands|
+---+--------+---------+
| 1| 100| 1000|
| 2| 200| 2000|
| 3| 300| 3000|
+---+--------+---------+
dFInitIdx data sample:
+---+--------+---------+
| id|hundreds|thousands|
+---+--------+---------+
| 1| 101| 1001|
| 2| 201| 2001|
| 3| 301| 3001|
+---+--------+---------+
Let's say that for the given data samples numberInt and j are (hardcoded) set as:
numberInt == 1
j == 2
The solution will look like this:
dFIdx.join(dFInitIdx, dFIdx("id").equalTo(numberInt) && dFInitIdx("id").equalTo(j))
.select(dFIdx("id"), dFInitIdx("hundreds"), dFInitIdx("thousands"))
And we can see the result of the query with show() as seen below:
+---+--------+---------+
| id|hundreds|thousands|
+---+--------+---------+
| 1| 201| 2001|
+---+--------+---------+
Suppose we have a table A and we are doing a left join with a large Table B (to fetch field colB)
Then the output is again left joined with a large table C (to fetch field colC) and finally we left join this with a table D (to fetch field colD)
So above 3 left-joins help to create a final dataset that is shared by multiple consumers.
As a consumer of this code , i do a select colA's , colD from the final dataset (I don't need colB and colC )
Is there a feature which will skip 2 joins with B & C (since colB and colC are not required downstream in my case)
FYI :
I don't want to change the implementation(i.e. 3 joins) since this
method is used by mutiple teams.
I dont want to create my own implementation (avoid code duplication ,
and to stay up to date with the logic that is used across the teams )
PS for clarity:
B,C,D are huge dim tables
A is a fact table (relatively smaller than B,C,D)
I do not think that this is possible without changing the original code. The reason is that even if the final result does not contain columns from tables B and C, the result might still depend on which tables were part of the join chain.
An example: lets assume we have this data and we want to join the four tables with the id column.
Table A Table B Table C Table D
+---+----+ +---+----+ +---+----+ +---+----+
| id|colA| | id|colB| | id|colC| | id|colD|
+---+----+ +---+----+ +---+----+ +---+----+
| 1| A1| | 1| B1| | 1| C1| | 1| D1|
| 2| A2| | 2| B2| | 2| C2| | 2| D2|
+---+----+ +---+----+ | 2| C2b| +---+----+
+---+----+
The important point to note is that the table C contains a duplicate value in the join column.
If the four tables are joined with a left join and the columns A and D are selected, the result would be
+---+----+----+----+----+ +---+----+----+
| id|colA|colB|colC|colD| | id|colA|colD|
+---+----+----+----+----+ +---+----+----+
| 1| A1| B1| C1| D1| ==> | 1| A1| D1|
| 2| A2| B2| C2b| D2| | 2| A2| D2|
| 2| A2| B2| C2| D2| | 2| A2| D2|
+---+----+----+----+----+ +---+----+----+
On the other hand, if only the tables A and D are joined directly without tables B and C, the result would be
+---+----+----+
| id|colA|colD|
+---+----+----+
| 1| A1| D1|
| 2| A2| D2|
+---+----+----+
So even if the final result contains no columns from tables B and C, the result is different if you join A->D or A->B->C->D. So the Spark code cannot skip the joins of the tables B and C.
The good news: if you go the way A->B->C->D and exclude the columns from tables B and C, Spark will only process the join column(s) of tables B and C and skips (for example during a shuffle) all other columns. So at least the amount of data that is processed will be lower when not selecting columns from tables B and C.
I have simple data as:
+--------------------+-----------------+-----+
| timebucket_start| user| hits|
+--------------------+-----------------+-----+
|[2017-12-30 01:02...| Messi| 2|
|[2017-12-30 01:28...| Jordan| 9|
|[2017-12-30 11:12...| Jordan| 462|
+--------------------+-----------------+-----+
I am trying to pivot it such that I get the counts of each user for each of the time buckets,
So, my query in PySaprk is (using dataframes):
user_time_matrix = df.groupBy('timebucket_start').pivot("user").sum('hits')
Now, this query just keeps running all the time. I tried it with a scaled cluster too, doubling my cluster size, but then also same issue.
Is the query wrong? Can it be optimized, why can't spark finish it?
It's the same thing but you can try :
import pyspark.sql.functions as F
user_time_matrix = df.groupBy('timebucket_start').pivot("user").agg(F.sum('hits'))
Let me know if there is any error or infinite loop. Also when you use this code, the users will become the columns link :
Input :
+----+----------+------+
|hits| time| user|
+----+----------+------+
| 2|2017-12-30| Messi|
| 3|2017-12-30|Jordan|
| 462|2017-12-30|Jordan|
| 2|2017-12-31| Messi|
| 2|2017-12-31| Messi|
+----+----------+------+
Output:
+----------+------+-----+
| time|Jordan|Messi|
+----------+------+-----+
|2017-12-31| null| 4|
|2017-12-30| 465| 2|
+----------+------+-----+
I would recommend:
user_time_matrix = df.groupBy('timebucket_start', 'user').sum('hits')
Output :
+----------+------+---------+
| time| user|sum(hits)|
+----------+------+---------+
|2017-12-31| Messi| 4|
|2017-12-30|Jordan| 465|
|2017-12-30| Messi| 2|
+----------+------+---------+
I'm working with Spark 2.2.0.
I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)
table.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE| etc......
+--------------------+-------------------+-----------------+
| W1| HM|
| W2| SM|
| W3| HM|
etc...
I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)
total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))
total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")
total_stores2.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...| BORGO| 1|
| C ATHIS MONS| ATHIS MONS CEDEX| 1|
| CMA BOSC LE HARD| BOSC LE HARD| 1|
The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....
I have no clue why. Everything else works fine.