Suppose we have a table A and we are doing a left join with a large Table B (to fetch field colB)
Then the output is again left joined with a large table C (to fetch field colC) and finally we left join this with a table D (to fetch field colD)
So above 3 left-joins help to create a final dataset that is shared by multiple consumers.
As a consumer of this code , i do a select colA's , colD from the final dataset (I don't need colB and colC )
Is there a feature which will skip 2 joins with B & C (since colB and colC are not required downstream in my case)
FYI :
I don't want to change the implementation(i.e. 3 joins) since this
method is used by mutiple teams.
I dont want to create my own implementation (avoid code duplication ,
and to stay up to date with the logic that is used across the teams )
PS for clarity:
B,C,D are huge dim tables
A is a fact table (relatively smaller than B,C,D)
I do not think that this is possible without changing the original code. The reason is that even if the final result does not contain columns from tables B and C, the result might still depend on which tables were part of the join chain.
An example: lets assume we have this data and we want to join the four tables with the id column.
Table A Table B Table C Table D
+---+----+ +---+----+ +---+----+ +---+----+
| id|colA| | id|colB| | id|colC| | id|colD|
+---+----+ +---+----+ +---+----+ +---+----+
| 1| A1| | 1| B1| | 1| C1| | 1| D1|
| 2| A2| | 2| B2| | 2| C2| | 2| D2|
+---+----+ +---+----+ | 2| C2b| +---+----+
+---+----+
The important point to note is that the table C contains a duplicate value in the join column.
If the four tables are joined with a left join and the columns A and D are selected, the result would be
+---+----+----+----+----+ +---+----+----+
| id|colA|colB|colC|colD| | id|colA|colD|
+---+----+----+----+----+ +---+----+----+
| 1| A1| B1| C1| D1| ==> | 1| A1| D1|
| 2| A2| B2| C2b| D2| | 2| A2| D2|
| 2| A2| B2| C2| D2| | 2| A2| D2|
+---+----+----+----+----+ +---+----+----+
On the other hand, if only the tables A and D are joined directly without tables B and C, the result would be
+---+----+----+
| id|colA|colD|
+---+----+----+
| 1| A1| D1|
| 2| A2| D2|
+---+----+----+
So even if the final result contains no columns from tables B and C, the result is different if you join A->D or A->B->C->D. So the Spark code cannot skip the joins of the tables B and C.
The good news: if you go the way A->B->C->D and exclude the columns from tables B and C, Spark will only process the join column(s) of tables B and C and skips (for example during a shuffle) all other columns. So at least the amount of data that is processed will be lower when not selecting columns from tables B and C.
I have a dataframe with consist of 5 columns . I need to add a new column at 3rd Position . How to achieve this in spark .
df.show()
+---------+--------+---+----------+--------+
|last_name|position|age|salary_inc| segment|
+---------+--------+---+----------+--------+
| george| IT| 10| 2313| one|
| jhon| non-it| 21| 34344| null|
| mark| IT| 11| 16161| third|
| spencer| it| 31| 2322| null|
| spencer| non-it| 41| 2322|Valuable|
+---------+--------+---+----------+--------+
Add new_column at position 3
+---------+--------+-----------+---+----------+--------+
|last_name|position|new_column |age|salary_inc| segment|
+---------+--------+-----------+---+----------+--------+
Can you please help me on this
(
df.withColumn("new_column", ...)
.select("last_name",
"position",
"new_column",
...)
.show()
)
Where first ellipses indicate what you're creating in your new column called "new_column"; for example lit(1) would give you literal (constant) 1 of type IntegerType. Second ellipses indicate remaining columns in the order you wish to select.
I m analyzing Twitter Files with the scope to take the trending topic, in json format with Spark SQL
After to take all the text form a Tweet and split the words, my dataFrame look like this
+--------------------+--------------------+
| line| words|
+--------------------+--------------------+
|[RT, #ONLYRPE:, #...| RT|
|[RT, #ONLYRPE:, #...| #ONLYRPE:|
|[RT, #ONLYRPE:, #...| #tlrp|
|[RT, #ONLYRPE:, #...| followan?|
I just need the column words, I coconvert my table to a temView.
df.createOrReplaceTempView("Twitter_test_2")
With the help of spark sql should be very easy to take the trending topic, I just need a query in sql using in the where condition operator "Like". words like "#%"
spark.sql("select words,
count(words) as count
from words_Twitter
where words like '#%'
group by words
order by count desc limit 10").show(20,False)
but I m getting some strange results that I can't find an explanation for them.
+---------------------+---+
|words |cnt|
+---------------------+---+
|#izmirescort |211|
|#PRODUCE101 |101|
|#VeranoMTV2017 |91 |
|#سلمان_يدق_خشم_العايل|89 |
|#ALDUBHomeAgain |67 |
|#BTS |32 |
|#سود_الله_وجهك_ياتميم|32 |
|#NowPlaying |32 |
for some reason the #89 and the #32 the twoo thar have arab characteres are no where they should been. The text had been exchanged with the counter.
others times I am confrontig tha kind of format.
spark.sql("select words, lang,count(words) count from Twitter_test_2 group by words,lang order by count desc limit 10 ").show()
After that Query to my dataframe, it look like so strange
+--------------------+----+-----+
| words|lang|count|
+--------------------+----+-----+
| #VeranoMTV2017| pl| 6|
| #umRei| pt| 2|
| #Virgem| pt| 2|
| #rt
2| pl| 2|
| #rt
gazowaną| pl| 1|
| #Ziobro| pl| 1|
| #SomosPorto| pt| 1|
+--------------------+----+-----+
Why is happening that, and how can avoid it ?
graph frames has a nice example for stateful motifs.
How can I explicitly return the counts? As you see the output only contains vertices and friends but not the counts.
How can I modify it to not (only) have access to the edges but access to the labels of the vertices as well?
when(relationship === "friend", cnt + 1).otherwise(cnt)
I.e. how could I enhance the count to count
the friends of each vertex with age > 30
the percentage of friendsGreater30 / allFriends
val g = examples.Graphs.friends // get example graph
// Find chains of 4 vertices.
val chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
// Query on sequence, with state (cnt)
// (a) Define method for updating state given the next element of the motif.
def sumFriends(cnt: Column, relationship: Column): Column = {
when(relationship === "friend", cnt + 1).otherwise(cnt)
}
// (b) Use sequence operation to apply method to sequence of elements in motif.
// In this case, the elements are the 3 edges.
val condition = Seq("ab", "bc", "cd").
foldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)("relationship")))
// (c) Apply filter to DataFrame.
val chainWith2Friends2 = chain4.where(condition >= 2)
http://graphframes.github.io/user-guide.html
chainWith2Friends2.show()
Which will output
+-------------+------------+-------------+------------+-------------+------------+--------------+
| a| ab| b| bc| c| cd| d|
+-------------+------------+-------------+------------+-------------+------------+--------------+
|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]| [e,Esther,32]|
|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,b,friend]| [b,Bob,36]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,d,friend]| [d,David,29]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,f,follow]| [f,Fanny,36]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,b,friend]| [b,Bob,36]|[b,c,follow]|[c,Charlie,30]|
| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|
+-------------+------------+-------------+------------+-------------+------------+--------------+
Note that sumFriends returns a Column, so condition is a column. This is why you can access it in a where statement without quotes. So all you have to do is add that column to your dataframe. After running the above code, I can run
chain4.withColumn("condition",condition).select("condition").show
+---------+
|condition|
+---------+
| 1|
| 0|
| 0|
| 0|
| 0|
| 3|
| 3|
| 3|
| 2|
| 2|
| 3|
| 1|
+---------+
you could also use chain4.select(condition)
Hope this helps
I have a Pyspark Dataframe with this structure:
+----+----+----+----+---+
|user| A/B| C| A/B| C |
+----+----+-------------+
| 1 | 0| 1| 1| 2|
| 2 | 0| 2| 4| 0|
+----+----+----+----+---+
I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:
+----+----+----+
|user| A/B| C|
+----+----+----+
| 1 | 1| 3|
| 2 | 4| 2|
+----+----+----+
Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?
I have a work around for this
val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)
Now make the Join then the out will contain the Values with different names
Then make the tuple of the list to be combined
val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))
And them Combine the value of the Columns within each tuple and get the desired output !
PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !