pyspark - merge 2 columns of sets - apache-spark

I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings
For Instance I have 2 columns formed from calling collect_set
Fruits | Meat
[Apple,Orange,Pear] [Beef, Chicken, Pork]
How do I turn it into:
Food
[Apple,Orange,Pear, Beef, Chicken, Pork]
Thank you very much for your help in advance

I was also figuring this out in Python, so here is a port of Ramesh's solution to Python:
df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
("Fruits", "Meat"))
df.show(1,False)
from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)
Output:
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
Kudos to Ramesh!
EDIT: Note that you might have to manually specify the column type (not sure why it worked for me only in some cases without explicit type specification - in other cases I was getting a string type column).
from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))

Given that you have dataframe as
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
You can write a udf function to merge the sets of two columns into one.
import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)
And then call the udf function as
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)
You should have your desired final dataframe
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+

Let's say df has
+--------------------+--------------------+
| Fruits| Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+
then
import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()
creates a set of Fruits & Meat combined into one set i.e.
[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]
Hope this helps!

Adding solution here for the definition of a set not containing duplicates. Also avoids any performance concerns with python udfs.
Requires Spark 2.4+
from pyspark.sql import functions as F
df = spark.createDataFrame([(['Chicken','Pork','Beef',"Tuna"], ["Salmon", "Tuna"])],
("Meat", "Fish"))
df.show(1,False)
df_union = df.withColumn("set_union", F.array_distinct(F.array_union("Meat", "Fish")))
df_union.show(1, False)
results in
+---------------------------+--------------+-----------------------------------+
|Meat |Fish |set_union |
+---------------------------+--------------+-----------------------------------+
|[Chicken, Pork, Beef, Tuna]|[Salmon, Tuna]|[Chicken, Pork, Beef, Tuna, Salmon]|
+---------------------------+--------------+-----------------------------------+

Related

Extract specific string from a column and place them in a sequence

I have a dataframe like this:
df = [{'id': 1, 'id1': '859A;'},
{'id': 2, 'id1': '209A/229A/509A;'},
{'id': 3, 'id1': '(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;'},
{'id': 4, 'id1': '111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A;'},
{'id': 5, 'id1': '(651A/851A)+U17/861A;'},
]
df = spark.createDataFrame(df)
I want to split the "id1" column into two columns.
One column needs to only extract strings which end with "A" and put them in a sequence with "/" between strings.
The other column needs to extract the remaining strings and place them in a separate column as shown below.
Taking "id3", "id5" and "id2" as example, the desired output should be:
newcolumn1
(105A1,11A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)
(651A/851A,861A)
(209A,229A/509A)
newcolumn2
+TZ+-494;
+U17;
blank
All series starting with "1" and ending with "A" should be in one group, separated with comma. Every such series should be separated with "/".
Your best bet is to use regex. regexp_extract_all is not yet directly available in Python API, but you can use expr to reach it. You will also need a couple of consecutive aggregations.
from pyspark.sql import functions as F
cols = df.columns
df = df.withColumn('_vals', F.explode(F.expr(r"regexp_extract_all(id1, '\\d+A', 0)")))
df = (df
.groupBy(*cols, F.substring('_vals', 1, 1)).agg(
F.array_join(F.array_sort(F.collect_list('_vals')), ',').alias('_vals')
).groupBy(cols).agg(
F.array_join(F.array_sort(F.collect_list('_vals')), '/').alias('newcolumn1')
).withColumn('newcolumn1', F.format_string('(%s)', 'newcolumn1')
).withColumn('newcolumn2', F.regexp_replace('id1', r'\d+A|/|\(|\)', ''))
)
df.show(truncate=0)
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
# |id |id1 |newcolumn1 |newcolumn2|
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+
# |3 |(105A/111A/121A/131A/201A/205A/211A/221A/231A/509A/801A/805A/811A/821A)+TZ+-494;|(105A,111A,121A,131A/201A,205A,211A,221A,231A/509A/801A,805A,811A,821A)|+TZ+-494; |
# |5 |(651A/851A)+U17/861A; |(651A/851A,861A) |+U17; |
# |2 |209A/229A/509A; |(209A,229A/509A) |; |
# |4 |111A/114A/121A/131A/201A/211A/221A/231A/651A+-(Y05/U17)/801A/804A/821A; |(111A,114A,121A,131A/201A,211A,221A,231A/651A/801A,804A,821A) |+-Y05U17; |
# |1 |859A; |(859A) |; |
# +---+--------------------------------------------------------------------------------+-----------------------------------------------------------------------+----------+

Convert Multiple columns into a single row with a variable amount of columns

I have a spark dataframe containing businesses with their contact numbers in 2 columns, however some of my businesses are repeated with different contact info, for example:
Name:
Phone:
bus1
082...
bus1
087...
bus2
076...
bus3
081...
bus3
084...
bus3
086...
I want to have 3 lines, 1 for each business with varying phone numbers in each, for example:
Name:
Phone1:
Phone2:
Phone3:
bus1
082...
087...
bus2
076...
bus3
081...
084...
086...
I have tried using select('Name','Phone').distinct(), but I don't know how to pivot it to a single row matching on the 'Name' column... please help
First construct the phone array based on name, and then split the array into multiple columns.
df = df.groupBy('Name').agg(F.collect_list('Phone').alias('Phone'))
df = df.select('Name', *[F.col('Phone')[i].alias(f'Phone{str(i+1)}') for i in range(3)])
df.show(truncate=False)
Try something as below -
Input DataFrame
df = spark.createDataFrame([('bus1', '082...'), ('bus1', '087...'), ('bus2', '076...'), ('bus3', '081...'),('bus3', '084...'),('bus3', '086...')], schema=["Name", "Phone"])
df.show()
+----+------+
|Name| Phone|
+----+------+
|bus1|082...|
|bus1|087...|
|bus2|076...|
|bus3|081...|
|bus3|084...|
|bus3|086...|
+----+------+
Collecting all the Phone values into an array using collect_list
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = df.groupBy("Name").agg(collect_list(col("Phone")).alias("Phone")).select( "Name", "Phone")
df1.show(truncate=False)
+----+------------------------+
|Name|Phone |
+----+------------------------+
|bus1|[082..., 087...] |
|bus2|[076...] |
|bus3|[081..., 084..., 086...]|
+----+------------------------+
Splitting Phone into multiple columns
df1.select(['Name'] + [df1.Phone[x].alias(f"Phone{x+1}") for x in range(0,3)]).show(truncate=False)
+----+------+------+------+
|Name|Phone1|Phone2|Phone3|
+----+------+------+------+
|bus1|082...|087...|null |
|bus2|076...|null |null |
|bus3|081...|084...|086...|
+----+------+------+------+

Combine columns into list of key, value pairs (no UDF)

I'd like to create a new column that is a JSON representation of some other columns. key, value pairs in a list.
Source:
origin
destination
count
toronto
ottawa
5
montreal
vancouver
10
What I want:
origin
destination
count
json
toronto
ottawa
5
[{"origin":"toronto"},{"destination","ottawa"}, {"count": "5"}]
montreal
vancouver
10
[{"origin":"montreal"},{"destination","vancouver"}, {"count": "10"}]
(everything can be a string, doesn't matter).
I've tried something like:
df.withColumn('json', to_json(struct(col('origin'), col('destination'), col('count'))))
But it creates the column with all the key:value pairs in one object:
{"origin":"United States","destination":"Romania"}
Is this possible without a UDF?
A way to hack around this:
import pyspark.sql.functions as F
df2 = df.withColumn(
'json',
F.array(
F.to_json(F.struct('origin')),
F.to_json(F.struct('destination')),
F.to_json(F.struct('count'))
).cast('string')
)
df2.show(truncate=False)
+--------+-----------+-----+--------------------------------------------------------------------+
|origin |destination|count|json |
+--------+-----------+-----+--------------------------------------------------------------------+
|toronto |ottawa |5 |[{"origin":"toronto"}, {"destination":"ottawa"}, {"count":"5"}] |
|montreal|vancouver |10 |[{"origin":"montreal"}, {"destination":"vancouver"}, {"count":"10"}]|
+--------+-----------+-----+--------------------------------------------------------------------+
Another way by creating array of maps column before calling to_json:
from pyspark.sql import functions as F
df1 = df.withColumn(
'json',
F.to_json(F.array(*[F.create_map(F.lit(c), F.col(c)) for c in df.columns]))
)
df1.show(truncate=False)
#+--------+-----------+-----+------------------------------------------------------------------+
#|origin |destination|count|json |
#+--------+-----------+-----+------------------------------------------------------------------+
#|toronto |ottawa |5 |[{"origin":"toronto"},{"destination":"ottawa"},{"count":"5"}] |
#|montreal|vancouver |10 |[{"origin":"montreal"},{"destination":"vancouver"},{"count":"10"}]|
#+--------+-----------+-----+------------------------------------------------------------------+

Python Spark: How to join 2 datasets containing >2 elements for each tuple

I'm trying to join data from these two datasets, based on the common "stock" key
stock, sector
GOOG Tech
stock, date, volume
GOOG 2015 5759725
The join method should join these together, however the resulting RDD I got is of the form:
GOOG, (Tech, 2015)
I'm trying to obtain:
(Tech, 2015) 5759726
Additionally, how do I go about reducing the results by the keys (e.g. (Tech, 2015)) in order to obtain a numerical summation for each sector and year?
from pyspark.sql.functions import struct, col, sum
#sample data
df1 = sc.parallelize([['GOOG', 'Tech'],
['AAPL', 'Tech'],
['XOM', 'Oil']]).toDF(["stock","sector"])
df2 = sc.parallelize([['GOOG', '2015', '5759725'],
['AAPL', '2015', '123'],
['XOM', '2015', '234'],
['XOM', '2016', '789']]).toDF(["stock","date","volume"])
#final output
df = df1.join(df2, ['stock'], 'inner').\
withColumn('sector_year', struct(col('sector'), col('date'))).\
drop('stock','sector','date')
df.show()
#numerical summation for each sector and year
df.groupBy('sector_year').agg(sum('volume')).show()
Output is:
+-------+-----------+
| volume|sector_year|
+-------+-----------+
| 123|[Tech,2015]|
| 234| [Oil,2015]|
| 789| [Oil,2016]|
|5759725|[Tech,2015]|
+-------+-----------+
+-----------+-----------+
|sector_year|sum(volume)|
+-----------+-----------+
|[Tech,2015]| 5759848.0|
| [Oil,2015]| 234.0|
| [Oil,2016]| 789.0|
+-----------+-----------+

Apache Spark - Finding Array/List/Set subsets

I have 2 dataframes each one having Array[String] as one of the columns. For each entry in one dataframe, I need to find out subsets, if any, in the other dataframe. An example is here:
DF1:
----------------------------------------------------
id : Long | labels : Array[String]
---------------------------------------------------
10 | [label1, label2, label3]
11 | [label4, label5]
12 | [label6, label7]
DF2:
----------------------------------------------------
item : String | labels : Array[String]
---------------------------------------------------
item1 | [label1, label2, label3, label4, label5]
item2 | [label4, label5]
item3 | [label4, label5, label6, label7]
After the subset operation I described, the expected o/p should be
DF3:
----------------------------------------------------
item : String | id : Long
---------------------------------------------------
item1 | [10, 11]
item2 | [11]
item3 | [11, 12]
It is guaranteed that the DF2, will always have corresponding subsets in DF1, so there won't be any left over elements.
Can someone please help with the right approach here ? It looks like for each element in DF2, I need to scan DF1 and do subset operation (or set subtraction) on the 2nd column until I find all the subsets and exhaust the labels in that row and while doing that accumulate the list of "id" fields. How do I do this in compact and efficient manner ? Any help is greatly appreciated. Realistically, I may have 100s of elements in DF1 and 1000s of elements in DF2.
I'm not aware of any way to perform this kind of operation in an efficient way. However, here is one possible solution using UDF as well as Cartesian join.
The UDF takes two sequences and checks if all strings in the first exists in the second:
val matchLabel = udf((array1: Seq[String], array2: Seq[String]) => {
array1.forall{x => array2.contains(x)}
})
To use Cartesian join, it needs to be enabled as it is computationally expensive.
val spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.crossJoin.enabled", true)
The two dataframes are joined together utilizing the UDF. Afterwards the resulting dataframe is grouped by the item column to collect a list of all ids. Using the same DF1 and DF2 as in the question:
val DF3 = DF2.join(DF1, matchLabel(DF1("labels"), DF2("labels")))
.groupBy("item")
.agg(collect_list("id").as("id"))
The result is as follows:
+-----+--------+
| item| id|
+-----+--------+
|item3|[11, 12]|
|item2| [11]|
|item1|[10, 11]|
+-----+--------+

Resources