Spark (or pyspark) columns content shuffle with GroupBy

Spark (or pyspark) columns content shuffle with GroupBy - apache-spark

I'm working with Spark 2.2.0.
I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)
table.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE| etc......
+--------------------+-------------------+-----------------+
| W1| HM|
| W2| SM|
| W3| HM|
etc...
I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)
total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))
total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")
total_stores2.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...| BORGO| 1|
| C ATHIS MONS| ATHIS MONS CEDEX| 1|
| CMA BOSC LE HARD| BOSC LE HARD| 1|
The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....
I have no clue why. Everything else works fine.

Related

Spark union column order

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.
So we have df1
df1
| a| b|
+---+----+
| 1| asd|
| 2|asda|
| 3| f1f|
+---+----+
df2
| b| a|
+----+---+
| asd| 1|
|asda| 2|
| f1f| 3|
+----+---+
result
| a| b|
+----+----+
| 1| asd|
| 2|asda|
| 3| f1f|
| asd| 1|
|asda| 2|
| f1f| 3|
+----+----+
It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes.
Obviously the solution would be to do df1.union(df2.select(df1.columns))
But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?
code to create test set if anyone wants to try
d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)

The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.

in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.

what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

Distribution of text by word count in Spark Java

I am new to Spark, sorry if this question seem to easy for you. I'm trying to come up with the Spark-like solution, but can't figure out the way to do it.
My DataSet looks like following:
+----------------------+
|input |
+----------------------+
|debt ceiling |
|declaration of tax |
|decryption |
|sweats |
|ladder |
|definite integral |
I need to calculate distribution of Rows by length, e.g:
1st option:
500 rows contain 1 and more words
120 rows contain 2 and more words
70 rows contain 2 and more words
2nd option:
300 rows contain 1 word
250 rows contain 2 words
220 rows contain 3 words
270 rows contain 4 and more words
Is there a possible solution using Java Spark functions?
All I can think of, is writing some kind of UDF, that would have a broadcasted counter, but I'm likely missing something, since there should be a better way to do this in spark.

Welcome to SO!
Here is a solution in Scala you can easily adapt to Java.
val df = spark.createDataset(Seq(
"debt ceiling", "declaration of tax", "decryption", "sweats"
)).toDF("input")
df.select(size(split('input, "\\s+")).as("words"))
.groupBy('words)
.count
.orderBy('words)
.show
This produces
+-----+-----+
|words|count|
+-----+-----+
| 1| 2|
| 2| 1|
| 3| 1|
+-----+-----+

Spark SQL , doesn´t respect the Dataframe format

I m analyzing Twitter Files with the scope to take the trending topic, in json format with Spark SQL
After to take all the text form a Tweet and split the words, my dataFrame look like this
+--------------------+--------------------+
| line| words|
+--------------------+--------------------+
|[RT, #ONLYRPE:, #...| RT|
|[RT, #ONLYRPE:, #...| #ONLYRPE:|
|[RT, #ONLYRPE:, #...| #tlrp|
|[RT, #ONLYRPE:, #...| followan?|
I just need the column words, I coconvert my table to a temView.
df.createOrReplaceTempView("Twitter_test_2")
With the help of spark sql should be very easy to take the trending topic, I just need a query in sql using in the where condition operator "Like". words like "#%"
spark.sql("select words,
count(words) as count
from words_Twitter
where words like '#%'
group by words
order by count desc limit 10").show(20,False)
but I m getting some strange results that I can't find an explanation for them.
+---------------------+---+
|words |cnt|
+---------------------+---+
|#izmirescort |211|
|#PRODUCE101 |101|
|#VeranoMTV2017 |91 |
|#سلمان_يدق_خشم_العايل|89 |
|#ALDUBHomeAgain |67 |
|#BTS |32 |
|#سود_الله_وجهك_ياتميم|32 |
|#NowPlaying |32 |
for some reason the #89 and the #32 the twoo thar have arab characteres are no where they should been. The text had been exchanged with the counter.
others times I am confrontig tha kind of format.
spark.sql("select words, lang,count(words) count from Twitter_test_2 group by words,lang order by count desc limit 10 ").show()
After that Query to my dataframe, it look like so strange
+--------------------+----+-----+
| words|lang|count|
+--------------------+----+-----+
| #VeranoMTV2017| pl| 6|
| #umRei| pt| 2|
| #Virgem| pt| 2|
| #rt
2| pl| 2|
| #rt
gazowaną| pl| 1|
| #Ziobro| pl| 1|
| #SomosPorto| pt| 1|
+--------------------+----+-----+
Why is happening that, and how can avoid it ?

Subtracting DataFrames by a single ID column - duplicate columns behave differently

I am trying to compare two DataFrames with the same schema (in Spark 1.6.0, using Scala) to determine which rows in the newer table have been added (i.e. are not present in the older table).
I need to do this by ID (i.e. examining a single column, not the whole row, to see what is new). Some rows may have changed between the versions, in that they have the same id in both versions, but the other columns have changed - I do not want these in the output, so I cannot simply subtract the two versions.
Based on various suggestions, I am doing a left-outer join on the chosen ID column, then selecting rows with nulls in columns from the right side of the join (indicating that they were not present in the older version of the table):
def diffBy(field:String, newer:DataFrame, older:DataFrame): DataFrame = {
newer.join(older, newer(field) === older(field), "left_outer")
.select(older(field).isNull)
// TODO just select the leftmost columns, removing the nulls
}
However, this does not work. (row 3 exists only in the newer version, so should be output):
scala> newer.show
+---+-------+
| id| value|
+---+-------+
| 3| three|
| 2|two-new|
+---+-------+
scala> older.show
+---+-------+
| id| value|
+---+-------+
| 1| one|
| 2|two-old|
+---+-------+
scala> diffBy("id", newer, older).show
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
+---+-----+---+-----+
The join is working as expected:
scala> val joined = newer.join(older, newer("id") === older("id"), "left_outer")
scala> joined.show
+---+-------+----+-------+
| id| value| id| value|
+---+-------+----+-------+
| 2|two-new| 2|two-old|
| 3| three|null| null|
+---+-------+----+-------+
So the problem is in the selection of the column for filtering.
joined.where(older("id").isNull).show
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
+---+-----+---+-----+
Perhaps it is due to the duplicate id column names in the join? But if I use the value column (which is also duplicated) instead to detect nulls, it works as expected:
joined.where(older("value").isNull).show
+---+-----+----+-----+
| id|value| id|value|
+---+-----+----+-----+
| 3|three|null| null|
+---+-----+----+-----+
What is going on here - and why is the behaviour different for id and value?

You can solve the problem using a special spark join called "leftanti" . It is equivalent to minus (in Oracle PL SQL).
val joined = newer.join(older, newer("id") === older("id"), "leftanti")
This will only select columns from newer.

I have found a solution to my problem, though not an explanation for why it occurs.
It seems to be necessary to create an alias in order to refer unambiguously to the rightmost id column, and then use a textual WHERE clause so that I can substitute in the qualified column name from the variable field:
newer.join(older.as("o"), newer(field) === older(field), "left_outer")
.where(s"o.$field IS NULL")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark (or pyspark) columns content shuffle with GroupBy - apache-spark

Related

Spark union column order

How to combine and sort different dataframes into one?

Distribution of text by word count in Spark Java

Spark SQL , doesn´t respect the Dataframe format

Subtracting DataFrames by a single ID column - duplicate columns behave differently

Categories

Resources