PySpark compare two dataframes and find the match count

PySpark compare two dataframes and find the match count - apache-spark

I have 2 pyspark dataframes, after some manipulation consisting of 1 column each, but both are different length. dataframe 1 is an ingredient name, dataframe 2 contains rows of long strings of ingredients.
DATAFRAME 1:
ingcomb.show(10,truncate=False)
+---------------------------------+
|products |
+---------------------------------+
|rebel crunch granola |
|creamed honey |
|mild cheddar with onions & chives|
|berry medley |
|sweet relish made with sea salt |
|spanish peanuts |
|stir fry seasoning mix |
|swiss all natural cheese |
|yellow corn meal |
|shredded wheat |
+---------------------------------+
only showing top 10 rows
DATAFRAME 2:
reging.show(10, truncate=30)
+------------------------------+
| ingredients|
+------------------------------+
|apple bean cookie fruit kid...|
|bake bastille day bon appét...|
|dairy fennel gourmet new yo...|
|bon appétit dairy free dinn...|
|bake bon appétit california...|
|bacon basil bon appétit foo...|
|asparagus boil bon appétit ...|
|cocktail party egg fruit go...|
|beef ginger gourmet quick &...|
|dairy free gourmet ham lunc...|
+------------------------------+
only showing top 10 rows
I need to create a loop (any other suggestions are welcome too!) to loop through dataframe 1 and compare the values to dataframe strings via "like" and give me total count of matches.
Desired outcome:
+--------------------+-----+
| ingredients|count|
+--------------------+-----+
|rebel crunch granola| 183|
|creamed honey | 87|
|berry medley | 67|
|spanish peanuts | 10|
+--------------------+-----+
I know that the following code works:
reging.filter("ingredients like '%sugar%'").count()
and was trying to implement something like
for i in ingcomb:
x = reging.select("ingredients").filter("ingredients like '%i%'").count()
But cannot get pyspark to consider 'i' as a value from ingcomb instead of the character i.
I have tried the solutions from
Spark Compare two dataframe and find the match count
but unfortunately they do not work.
I am running this in GCP and get an error when I try to run toPandas - due to permissions cannot install pandas.

We were actually able to do a work around, where we will get counts within the dataframe first and then match with a join later. Please feel free to give better suggestions. Newbies to coding here.
counts= reging.select(f.explode("array(Ingredients)").alias('col'))
.groupBy('col').count().orderBy("count", ascending=False)

Related

How to avoid using groupBy to find average of a column on a spark dataframe using dataframe APIs?

I have a dataframe with the below data and columns:
sales_df.select('sales', 'monthly_sales').show()
+----------------+--------------------------+
| sales | monthly_sales |
+----------------+--------------------------+
| mid| 50.0|
| low| 21.0|
| low| 25.0|
| high| 70.0|
| mid| 60.0|
| high| 75.0|
| high| 95.0|
|................|..........................|
|................|..........................|
|................|..........................|
| low| 25.0|
| low| 20.0|
+----------------+--------------------------+
I am trying to find the average of each sales type into a dataframe where I only have three rows(one for each sales type) in my final dataframe.
sale & average_sale
I used groupBy to achieve this.
sales_df.groupBy("sales").avg("monthly_sales").alias('average_sales').show()
and I was able to get the average sale as well.
+----------------+-------------------------------+
| sales | average sales |
+----------------+-------------------------------+
| mid| 5.568177828054298|
| high| 1.361184210526316|
| low| 3.014350758853288|
+----------------+-------------------------------+
This ran faster because I am running my logic on test data which has 200 rows. So the code ran in no time. But I have huge data in my actual application and then there is the problem of data shuffle due to groupBy.
Is there any better way to find out the average without using groupBy ?
Could anyone let me know the efficient way to achieve the solution considering huge data size.

groupBy is exactly what you're looking for. Spark is designed to handle big data (in any size, really), so what you should do is configure your Spark application properly (i.e giving it the right amount of memory, increase the number of cores, using more executors, improve parallelism, ...)

Creating a derived field based on df value comparison in python pandas

I have 2 dataframes - one is a data source dataframe and another is reference dataframe.
I want to create an additional column in df1 based on the comparison of those 2 dataframes
df1 - data source
No | Name
213344 | Apple
242342 | Orange
234234 | Pineapple
df2 - reference table
RGE_FROM | RGE_TO | Value
2100 | 2190 | Sweet
2200 | 2322 | Bitter
2400 | 5000 | Neutral
final
if first 4 character of df1.No fall between the range of df2.RGE_FROM to df2.RGE_TO, get df2.Value for the derived column df.DESC. else, blank
No | Name | DESC
213344 | Apple | Sweet
242342 | Orange | Natural
234234 | Pineapple |
Any help is appreciated!
Thank you!

We can create an IntervalIndex from the columns RGE_FROM and RGE_TO, then set this as an index of column Value to create a mapping series, then slice the first four characters in the column No and using Series.map substitute the values from the mapping series.
i = pd.IntervalIndex.from_arrays(df2['RGE_FROM'], df2['RGE_TO'], closed='both')
df1['Value'] = df1['No'].astype(str).str[:4].astype(int).map(df2.set_index(i)['Value'])
No Name Value
0 213344 Apple Sweet
1 242342 Orange Neutral
2 234234 Pineapple NaN

How to find an optimized join between 2 different dataframes in spark

I have a 2 different datasets, I would like to join them, but there is no easy way to do it because they don't have a common column and the crossJoin not good solution when we use a bigdata. I already asked the question on stackoverflow, but really I couldn't find an optimized solution to join them. My question on stackoverflow is: looking if String contain a sub-string in differents Dataframes
I saw these solution bellow but I didn't find a good way for my case.
Efficient string suffix detection
Efficient string suffix detection
Efficient string matching in Apache Spark
Today, I found a funny solution :) I'm not sure if it will be work, but let's try.
I add a new column in df_1 to be contain numbering of lines.
Example df_1:
name | id
----------------
abc | 1232
----------------
azerty | 87564
----------------
google | 374856
----------------
new df_1:
name | id | new_id
----------------------------
abc | 1232 | 1
----------------------------
azerty | 87564 | 2
----------------------------
google | 374856 | 3
----------------------------
explorer| 84763 | 4
----------------------------
The same for df_2:
Example df_2:
adress |
-----------
UK |
-----------
USA |
-----------
EUROPE |
-----------
new df_2:
adress | new_id
-------------------
UK | 1
-------------------
USA | 2
-------------------
EUROPE | 3
-------------------
Now, I have a common column between the 2 dataframes, I can do a left join using a new_id as key.
My question, is this solution efficient ?
How can I add new_id columns in each dataframe with numbering of line ?

As the Spark is Lazy Evaluation ,it means that the execution will not start until an action is triggered .
So what you can do is simply call spark context createdataframe function and pass list of selected columns from df1 and df2 . It will create a new dataframe as you need.
e.g. df3 = spark.createDataframe([df1.select(''),df2.select('')])
Upvote if works

Distribution of text by word count in Spark Java

I am new to Spark, sorry if this question seem to easy for you. I'm trying to come up with the Spark-like solution, but can't figure out the way to do it.
My DataSet looks like following:
+----------------------+
|input |
+----------------------+
|debt ceiling |
|declaration of tax |
|decryption |
|sweats |
|ladder |
|definite integral |
I need to calculate distribution of Rows by length, e.g:
1st option:
500 rows contain 1 and more words
120 rows contain 2 and more words
70 rows contain 2 and more words
2nd option:
300 rows contain 1 word
250 rows contain 2 words
220 rows contain 3 words
270 rows contain 4 and more words
Is there a possible solution using Java Spark functions?
All I can think of, is writing some kind of UDF, that would have a broadcasted counter, but I'm likely missing something, since there should be a better way to do this in spark.

Welcome to SO!
Here is a solution in Scala you can easily adapt to Java.
val df = spark.createDataset(Seq(
"debt ceiling", "declaration of tax", "decryption", "sweats"
)).toDF("input")
df.select(size(split('input, "\\s+")).as("words"))
.groupBy('words)
.count
.orderBy('words)
.show
This produces
+-----+-----+
|words|count|
+-----+-----+
| 1| 2|
| 2| 1|
| 3| 1|
+-----+-----+

excel address as lookup array

first of all, thank you in advance.
the problem I am facing is I have two different values I need to combine when I lookup against a different table, however I do not know which columns those two combinations will be, and they can be different per row. hopefully, the example will help
look up table
ID | Benefit | Option | Tier | Benefit | Option | Tier
123| 1 | 1 | 3 | 2 | 7 |3
456| 2 |3 |1 |1 |3 |2
current table
ID | Benefit |
123 | 1
123 | 2
456 | 1
456 | 2
the example i am giving there is only two posibility it can be in but my actual program is it could be in maybe 20 different location. the one positive i have is that it will always be under the benefit column, so what i was thinking is concat benefit & 04 and using the index match. i would like to dynamically concat based on the row my lookup is on
here is what i got so far but its not working
=INDEX(T3:X4,MATCH(N4,$S$3:$S$4,0),MATCH($O$3&O4,T2:X2&ADDRESS(ROW(INDEX($S$3:$S$4,MATCH(N4,$S$3:$S$4,0))),20):ADDRESS(ROW(INDEX($S$3:$S$4,MATCH(N4,$S$3:$S$4,0))),24),0))
where
ADDRESS(ROW(INDEX($S$3:$S$4,MATCH(N4,$S$3:$S$4,0))),20) does return T3
and ADDRESS(ROW(INDEX($S$3:$S$4,MATCH(N4,$S$3:$S$4,0))),24) returns x3
so i was hoping it would combine benefit&1 and it would see its a match on t 3

I guess you are trying to find a formula to put in P4 to P7 ?
=INDEX($S$2:$X$4,MATCH(N4,$S$2:$S$4,0),SUMPRODUCT(($S$2:$X$2="wtwben")*(OFFSET($S$2:$X$2,MATCH(N4,$S$3:$S$4,0),0)=O4)*(COLUMN($S$2:$X$2)-COLUMN($S$2)+1))+1)

If the values to return are always numeric and there is only one match for each ID/Benefit combination (as it appears in your sample) then you can get the Option value with this formula in P4 copied down
=SUMPRODUCT((S$3:S$4=N4)*(T$2:W$2="Benefit")*(T$3:W$4=O4),U$3:X$4)
[assumes the headers are per the first table shown in your question, i.e. where T2 value is "Benefit"]
Notice how the ranges change
....or to return text values.....or if the ID/Benefit combination repeats this will give you the "first" match, where "first" means by row.
=INDIRECT(TEXT(AGGREGATE(15,6,(ROW(U$3:X$4)*1000+COLUMN(U$3:X$4))/(S$3:S$4=N4)/(T$2:W$2="Benefit")/(T$3:W$4=O4),1),"R0C000"),FALSE)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark compare two dataframes and find the match count - apache-spark

Related

How to avoid using groupBy to find average of a column on a spark dataframe using dataframe APIs?

Creating a derived field based on df value comparison in python pandas

How to find an optimized join between 2 different dataframes in spark

Distribution of text by word count in Spark Java

excel address as lookup array

Categories

Resources