How to convert rows into string values in Apache Spark
I have a spark dataframe like this:
fruit | name
--------------
fruit | apple
fruit | orange
fruit | mango
I want to convert it into this:
fruit | string
----------------------------
fruit | apple, orange, mango
How can I achieve this in Apache Spark?
Look at collect_list
df.groupBy("fruit").agg(collect_list("name"))
it will group values and create array of them as a new column.
If you want to have string, please see this question (thanks #mtoto)
Related
I have 2 dataframes - one is a data source dataframe and another is reference dataframe.
I want to create an additional column in df1 based on the comparison of those 2 dataframes
df1 - data source
No | Name
213344 | Apple
242342 | Orange
234234 | Pineapple
df2 - reference table
RGE_FROM | RGE_TO | Value
2100 | 2190 | Sweet
2200 | 2322 | Bitter
2400 | 5000 | Neutral
final
if first 4 character of df1.No fall between the range of df2.RGE_FROM to df2.RGE_TO, get df2.Value for the derived column df.DESC. else, blank
No | Name | DESC
213344 | Apple | Sweet
242342 | Orange | Natural
234234 | Pineapple |
Any help is appreciated!
Thank you!
We can create an IntervalIndex from the columns RGE_FROM and RGE_TO, then set this as an index of column Value to create a mapping series, then slice the first four characters in the column No and using Series.map substitute the values from the mapping series.
i = pd.IntervalIndex.from_arrays(df2['RGE_FROM'], df2['RGE_TO'], closed='both')
df1['Value'] = df1['No'].astype(str).str[:4].astype(int).map(df2.set_index(i)['Value'])
No Name Value
0 213344 Apple Sweet
1 242342 Orange Neutral
2 234234 Pineapple NaN
I have 2 pyspark dataframes, after some manipulation consisting of 1 column each, but both are different length. dataframe 1 is an ingredient name, dataframe 2 contains rows of long strings of ingredients.
DATAFRAME 1:
ingcomb.show(10,truncate=False)
+---------------------------------+
|products |
+---------------------------------+
|rebel crunch granola |
|creamed honey |
|mild cheddar with onions & chives|
|berry medley |
|sweet relish made with sea salt |
|spanish peanuts |
|stir fry seasoning mix |
|swiss all natural cheese |
|yellow corn meal |
|shredded wheat |
+---------------------------------+
only showing top 10 rows
DATAFRAME 2:
reging.show(10, truncate=30)
+------------------------------+
| ingredients|
+------------------------------+
|apple bean cookie fruit kid...|
|bake bastille day bon appét...|
|dairy fennel gourmet new yo...|
|bon appétit dairy free dinn...|
|bake bon appétit california...|
|bacon basil bon appétit foo...|
|asparagus boil bon appétit ...|
|cocktail party egg fruit go...|
|beef ginger gourmet quick &...|
|dairy free gourmet ham lunc...|
+------------------------------+
only showing top 10 rows
I need to create a loop (any other suggestions are welcome too!) to loop through dataframe 1 and compare the values to dataframe strings via "like" and give me total count of matches.
Desired outcome:
+--------------------+-----+
| ingredients|count|
+--------------------+-----+
|rebel crunch granola| 183|
|creamed honey | 87|
|berry medley | 67|
|spanish peanuts | 10|
+--------------------+-----+
I know that the following code works:
reging.filter("ingredients like '%sugar%'").count()
and was trying to implement something like
for i in ingcomb:
x = reging.select("ingredients").filter("ingredients like '%i%'").count()
But cannot get pyspark to consider 'i' as a value from ingcomb instead of the character i.
I have tried the solutions from
Spark Compare two dataframe and find the match count
but unfortunately they do not work.
I am running this in GCP and get an error when I try to run toPandas - due to permissions cannot install pandas.
We were actually able to do a work around, where we will get counts within the dataframe first and then match with a join later. Please feel free to give better suggestions. Newbies to coding here.
counts= reging.select(f.explode("array(Ingredients)").alias('col'))
.groupBy('col').count().orderBy("count", ascending=False)
I am trying to use the function "SUMIFS" with "dynamic criteria".
See tables below.
In Table Overview Cell B2 I got the formula I try to figure out:
SUMIFS all Fruits (Table Criteria, Column A:A) in Table Data.
And if there is a new product, e.g. apple, I would like to add it in the Table Criteria in A4 as "apple", and my Overview should add the amount of the apples to fruits.
Any ideas?
Table "Overview"
|_| A | B |
|1| **Subject** **Count**
|2| Fruits 10
|3| Vegtables 20
|4|
Table "Criteria"
|_| A | B |
|1| **Fruits** **Vegtables**
|2| Banana Carrot
|3| Kiwi Broccoli
|4|
Table "Data"
|_| A | B |
|1| **Product** **Count**
|2| Banana 2
|3| Kiwi 3
|4| Banana 5
|5| Carrot 5
|6| Broccoli 15
Use:
=SUMPRODUCT(SUMIFS(B:B,A:A,INDEX($D$2:$E$2:INDEX(D:E,MATCH("zzz",INDEX(D:E,0,MATCH(G2,$D$1:$E$1,0))),MATCH(G2,$D$1:$E$1,0)),0,MATCH(G2,$D$1:$E$1,0))))
This is dynamic and will allow the addition of items to both input lists without the need to change the formula and still maintain the fewest iterations. The SUMPRODUCT forces the SUMIFS criteria to iterate and while we can put the full column in it will iterate 1.04 million times and that would slow down the calc.
Now if they are true structured tables in Excel then it can be simplified because the table would limit the iterations:
=SUMPRODUCT(SUMIFS(Data[Count],Data[Product],INDEX(Criteria,0,MATCH([#Subject],Criteria[#Headers],0))))
In the bellow excel data sheet, the values in the third column are manually entered. I would need a formula to automate this.
TYPE | CATEGORY | Expected_Value
fruits | apple | 1
fruits | apple | 2
fruits | apple | 3
fruits | bananna | 1
fruits | bananna | 2
fruits | mango | 1
fruits | mango | 2
fruits | mango | 3
fruits | mango | 4
fruits | mango | 5
Expected_Value represents the n-th duplicate of a given (TYPE, CATEGORY) couple.
Could someone help?
You need to use COUNTIF with an absolute reference and a relative reference. If your Category column spanned from B2 to B12 you could use this:
=COUNTIF($B$2:B2,B2)
I have 3 columns, the first column has data in it, against which the second one has some data.
Now I need to get a list of all the items in the first column in the third column.
The sheet as as below
Name | QTY | ACTIVE
----------------------
Apple | |
----------------------
Oranges | 10 |
----------------------
Pears | 5 |
----------------------
Plums | |
It needs to look like this
Name | QTY | ACTIVE
----------------------
Apple | | Oranges
----------------------
Oranges | 10 | Pears
----------------------
Pears | 5 |
----------------------
Plums | |
How can I do this either using a formula or a script.
What i've put above is just an example, its actually a long list of items against which there may or may not be quantities, therefore I only need a list of the items with quantities against them.
Thanks in advance.
if your using google sheets then you can use a filter function. Enter the formula and all the results will be listed in the cells below. Only items from column A that have data in Column B will be displayed.
=filter(A2:A,NOT(ISBLANK(B2:B)))
How it works
=filter(range,criteria)
You can also pull both the Name and Quantity by widening your range to include column B
=filter(A2:B,NOT(ISBLANK(B2:B)))
Note: If you get a #REF! error then you may not have left enough blank cells below your filter cell for the results to be displayed.
https://support.google.com/docs/answer/3093197?hl=en