Distribution of text by word count in Spark Java - apache-spark

I am new to Spark, sorry if this question seem to easy for you. I'm trying to come up with the Spark-like solution, but can't figure out the way to do it.
My DataSet looks like following:
+----------------------+
|input |
+----------------------+
|debt ceiling |
|declaration of tax |
|decryption |
|sweats |
|ladder |
|definite integral |
I need to calculate distribution of Rows by length, e.g:
1st option:
500 rows contain 1 and more words
120 rows contain 2 and more words
70 rows contain 2 and more words
2nd option:
300 rows contain 1 word
250 rows contain 2 words
220 rows contain 3 words
270 rows contain 4 and more words
Is there a possible solution using Java Spark functions?
All I can think of, is writing some kind of UDF, that would have a broadcasted counter, but I'm likely missing something, since there should be a better way to do this in spark.

Welcome to SO!
Here is a solution in Scala you can easily adapt to Java.
val df = spark.createDataset(Seq(
"debt ceiling", "declaration of tax", "decryption", "sweats"
)).toDF("input")
df.select(size(split('input, "\\s+")).as("words"))
.groupBy('words)
.count
.orderBy('words)
.show
This produces
+-----+-----+
|words|count|
+-----+-----+
| 1| 2|
| 2| 1|
| 3| 1|
+-----+-----+

Related

Spark union column order

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.
So we have df1
df1
| a| b|
+---+----+
| 1| asd|
| 2|asda|
| 3| f1f|
+---+----+
df2
| b| a|
+----+---+
| asd| 1|
|asda| 2|
| f1f| 3|
+----+---+
result
| a| b|
+----+----+
| 1| asd|
| 2|asda|
| 3| f1f|
| asd| 1|
|asda| 2|
| f1f| 3|
+----+----+
It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes.
Obviously the solution would be to do df1.union(df2.select(df1.columns))
But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?
code to create test set if anyone wants to try
d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)
The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.
in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.
what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

Spark SQL , doesn´t respect the Dataframe format

I m analyzing Twitter Files with the scope to take the trending topic, in json format with Spark SQL
After to take all the text form a Tweet and split the words, my dataFrame look like this
+--------------------+--------------------+
| line| words|
+--------------------+--------------------+
|[RT, #ONLYRPE:, #...| RT|
|[RT, #ONLYRPE:, #...| #ONLYRPE:|
|[RT, #ONLYRPE:, #...| #tlrp|
|[RT, #ONLYRPE:, #...| followan?|
I just need the column words, I coconvert my table to a temView.
df.createOrReplaceTempView("Twitter_test_2")
With the help of spark sql should be very easy to take the trending topic, I just need a query in sql using in the where condition operator "Like". words like "#%"
spark.sql("select words,
count(words) as count
from words_Twitter
where words like '#%'
group by words
order by count desc limit 10").show(20,False)
but I m getting some strange results that I can't find an explanation for them.
+---------------------+---+
|words |cnt|
+---------------------+---+
|#izmirescort |211|
|#PRODUCE101 |101|
|#VeranoMTV2017 |91 |
|#سلمان_يدق_خشم_العايل|89 |
|#ALDUBHomeAgain |67 |
|#BTS |32 |
|#سود_الله_وجهك_ياتميم|32 |
|#NowPlaying |32 |
for some reason the #89 and the #32 the twoo thar have arab characteres are no where they should been. The text had been exchanged with the counter.
others times I am confrontig tha kind of format.
spark.sql("select words, lang,count(words) count from Twitter_test_2 group by words,lang order by count desc limit 10 ").show()
After that Query to my dataframe, it look like so strange
+--------------------+----+-----+
| words|lang|count|
+--------------------+----+-----+
| #VeranoMTV2017| pl| 6|
| #umRei| pt| 2|
| #Virgem| pt| 2|
| #rt
2| pl| 2|
| #rt
gazowaną| pl| 1|
| #Ziobro| pl| 1|
| #SomosPorto| pt| 1|
+--------------------+----+-----+
Why is happening that, and how can avoid it ?

Spark (or pyspark) columns content shuffle with GroupBy

I'm working with Spark 2.2.0.
I have a DataFrame holding more than 20 columns. In the below example, PERIOD is a week number and type a type of store (Hypermarket or Supermarket)
table.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE| etc......
+--------------------+-------------------+-----------------+
| W1| HM|
| W2| SM|
| W3| HM|
etc...
I want to do a simple groupby (here with pyspark, but Scala or pyspark-sql give the same results)
total_stores = table.groupby("PERIOD", "TYPE").agg(countDistinct("STORE_DESC"))
total_stores2 = total_stores.withColumnRenamed("count(DISTINCT STORE_DESC)", "NB STORES (TOTAL)")
total_stores2.show(10)
+--------------------+-------------------+-----------------+
| PERIOD| TYPE|NB STORES (TOTAL)|
+--------------------+-------------------+-----------------+
|CMA BORGO -SANTA ...| BORGO| 1|
| C ATHIS MONS| ATHIS MONS CEDEX| 1|
| CMA BOSC LE HARD| BOSC LE HARD| 1|
The problem is not in the calculation: the columns got mixed up: PERIOD has STORE NAMES, TYPE has CITY, etc....
I have no clue why. Everything else works fine.

how to index categorical features in another way when using spark ml

The VectorIndexer in spark indexes categorical features according to the frequency of variables. But I want to index the categorical features in a different way.
For example, with a dataset as below, "a","b","c" will be indexed as 0,1,2 if I use the VectorIndexer in spark. But I want to index them according to the label.
There are 4 rows data which are indexed as 1, and among them 3 rows have feature 'a',1 row feautre 'c'. So here I will index 'a' as 0,'c' as 1 and 'b' as 2.
Is there any convienient way to implement this?
label|feature
-----------------
1 | a
1 | c
0 | a
0 | b
1 | a
0 | b
0 | b
0 | c
1 | a
If I understand your question correctly, you are looking to replicate the behaviour of StringIndexer() on grouped data. To do so (in pySpark), we first define an udf that will operate on a List column containing all the values per group. Note that elements with equal counts will be ordered arbitrarily.
from collections import Counter
from pyspark.sql.types import ArrayType, IntegerType
def encoder(col):
# Generate count per letter
x = Counter(col)
# Create a dictionary, mapping each letter to its rank
ranking = {pair[0]: rank
for rank, pair in enumerate(x.most_common())}
# Use dictionary to replace letters by rank
new_list = [ranking[i] for i in col]
return(new_list)
encoder_udf = udf(encoder, ArrayType(IntegerType()))
Now we can aggregate the feature column into a list grouped by the column label using collect_list() , and apply our udf rowwise:
from pyspark.sql.functions import collect_list, explode
df1 = (df.groupBy("label")
.agg(collect_list("feature")
.alias("features"))
.withColumn("index",
encoder_udf("features")))
Consequently, you can explode the index column to get the encoded values instead of the letters:
df1.select("label", explode(df1.index).alias("index")).show()
+-----+-----+
|label|index|
+-----+-----+
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 0|
| 0| 2|
| 1| 0|
| 1| 1|
| 1| 0|
| 1| 0|
+-----+-----+

Resources