create a dataframe with the duplicated and non-duplicated rows - apache-spark

I have a dataframe like this:
column_1 column_2 column_3 column_4 column_5 column_6 column_7
34432 apple banana mango pine lemon j93jk84
98389 grape orange pine kiwi cherry j93jso3
94749 apple banana mango pine lemon ke03jfr
48948 apple banana mango pine lemon 9jef3f4
. . . . . . .
90493 pear apricot papaya plum lemon 93jd30d
90843 grape orange pine kiwi cherry 03nd920
I want to have two dataframes.
Dataframe_1:
I want to ignore column_1 and column_7, and drop all the duplicated data and keep only unique rows based on all other columns.
Dataframe_2:
column_1 column_2 column_3 column_4 column_5 column_6 column_7 type Tag
34432 apple banana mango pine lemon j93jk84 unique 1
98389 grape orange pine kiwi cherry j93jso3 unique 2
94749 apple banana mango pine lemon ke03jfr duplicated 1
48948 apple banana mango pine lemon 9jef3f4 duplicated 1
. . . . . . .
90493 pear apricot papaya plum lemon 93jd30d unique 3
90843 grape orange pine kiwi cherry 03nd920 duplicated 2
As you can see in the example daraframe_2, I need two new column's "type" which specifies if the row is either a unique or duplicated. "tag" to easily identify which is the unique row and other duplicated rows which belongs to that duplicated row
Can someone tell me, how to achieve both these dataframes in pyspark?
Code I tried:
# to drop the duplicates ignoring column_1 and column_7
df_unique = df.dropDuplicates(["column_6","column_2","column_3","column_4","column_5"])
df_duplicates = df.subtract(df_unique)
# adding a type column to both dataframes and concatinating two dataframes
df_unique = df_unique.withColumn("type", F.lit("unique"))
df_duplicates = df_duplicated.withColumn("type", F.lit("duplicate"))
df_combined = df_unique.unionByName(df_duplicates )
# unable to create the tag column
..

If I understood your question correctly, essentially you need to -
Tag the first row as unique
Tag all subsequent rows as duplicate if the values of all columns are the same except column_1 and column_2
Let me know if this is not the case
Using row_number: Use all columns to compare as the partition key and generate Row Number for each partition, if there are more rows with for a set of column values - they'll fall in same set and their will have row_number accordingly. (you can use orderBy to mark specific rows unique if that's a requirement):
df.withColumn("asArray", F.array(*[x for x in df.schema.names if x!="column_1" and x!="column_7"]))\
.withColumn("rn", F.row_number().over(Window.partitionBy("asArray").orderBy(F.lit("dummy"))))\
.withColumn("type", F.when(F.col("rn")==1, "Unique").otherwise("Duplicated"))\
.withColumn("tag", F.dense_rank().over(Window.orderBy(F.col("asArray"))))\
.show(truncate=False)
I've collected the values of all columns to compare in an array to make it easy.
Edit - Output for data similar to your dataset, with duplicates more than 2. Also, corrected tag logic
Input:
Output:

I did something like this, here is input:
import pyspark.sql.functions as F
from pyspark.sql import Window
data = [
{"column_1": 34432, "column_2": "apple", "column_3": "banana", "column_4": "mango", "column_5": "pine", "column_6": "lemon", "column_7": "j93jk84"},
{"column_1": 98389, "column_2": "grape", "column_3": "orange", "column_4": "pine", "column_5": "kiwi", "column_6": "cherry", "column_7": "j93jso3"},
{"column_1": 94749, "column_2": "apple", "column_3": "banana", "column_4": "mango", "column_5": "pine", "column_6": "lemon", "column_7": "ke03jfr"},
{"column_1": 48948, "column_2": "grape", "column_3": "banana", "column_4": "mango", "column_5": "pine", "column_6": "lemon", "column_7": "9jef3f4"},
{"column_1": 90493, "column_2": "pear", "column_3": "apricot", "column_4": "papaya", "column_5": "plum", "column_6": "lemon", "column_7": "93jd30d"},
{"column_1": 90843, "column_2": "grape", "column_3": "orange", "column_4": "pine", "column_5": "kiwi", "column_6": "cherry", "column_7": "03nd920"}
]
df = spark.createDataFrame(data)
Imo for first df you can use dropDuplicates were you can pass sub-set of columns
firstDf = df.dropDuplicates(["column_2","column_3","column_4","column_5","column_6"])
For second df you can do something like this:
windowSpec = Window.partitionBy("column_2","column_3","column_4","column_5","column_6").orderBy("column_1")
secondDf = (
df.withColumn("row_number", F.row_number().over(windowSpec))
.withColumn(
"type",
F.when(F.col("row_number") == F.lit(1), F.lit("unique")).otherwise(
F.lit("duplicated")
),
)
.withColumn("tag", F.first("column_1").over(windowSpec))
.withColumn("tag", F.dense_rank().over(Window.partitionBy().orderBy("tag")))
).drop("row_number").show()
Output
+--------+--------+--------+--------+--------+--------+--------+----------+---+
|column_1|column_2|column_3|column_4|column_5|column_6|column_7| type|tag|
+--------+--------+--------+--------+--------+--------+--------+----------+---+
| 34432| apple| banana| mango| pine| lemon| j93jk84| unique| 1|
| 94749| apple| banana| mango| pine| lemon| ke03jfr|duplicated| 1|
| 48948| grape| banana| mango| pine| lemon| 9jef3f4| unique| 2|
| 90493| pear| apricot| papaya| plum| lemon| 93jd30d| unique| 3|
| 90843| grape| orange| pine| kiwi| cherry| 03nd920| unique| 4|
| 98389| grape| orange| pine| kiwi| cherry| j93jso3|duplicated| 4|
+--------+--------+--------+--------+--------+--------+--------+----------+---+

Related

Get first product ordered by customer

I have an existing table like below. I want to replace the NULLs in first_product column with the first product a customer has ordered.
INPUT
customer_id
product
order_date_id
first_product
C0001
apple
20220224
NULL
C0001
pear
20220101
NULL
C0002
strawberry
20220224
NULL
C0001
apple
20220206
NULL
OUTPUT:
customer_id
product
order_date_id
first_product
C0001
apple
20220224
pear
C0001
pear
20220101
pear
C0002
strawberry
20220224
strawberry
C0001
apple
20220206
pear
I have thought about using row numbers as below, but not sure how to pull it all together.
I have this code so far, but not sure how to update the first_product column using the below code.
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY customer_id, order_date_id) AS first_occurrance
Some pseudo-code:
REPLACE first_product FROM table WITH product WHERE
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY customer_id, order_date_id) AS first_occurrance = 1
Hey you can use first window function and achieve this.
val cust_data = Seq[(String, String, Int, String)](
("C0001", "apple", 20220224, null),
("C0001", "pear", 20220101, null),
("C0002", "strawberry", 20220224, null),
("C0001", "apple", 20220206, null)
).toDF("cust_id", "product", "date_id", "first_prod")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val out_df = cust_data.withColumn("first_prod", first($"product").over(Window.partitionBy($"cust_id").orderBy($"date_id")))
out_df.show()
+-------+----------+--------+----------+
|cust_id| product| date_id|first_prod|
+-------+----------+--------+----------+
| C0001| pear|20220101| pear|
| C0001| apple|20220206| pear|
| C0001| apple|20220224| pear|
| C0002|strawberry|20220224|strawberry|
+-------+----------+--------+----------+

How to find the total length of a column value that has multiple values in different rows for another column

Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry?
df:
ID Fruit
0 ABC Apple <-ABC has Apple and Strawberry
1 ABC Strawberry <-ABC has Apple and Strawberry
2 EFG Apple <-EFG has Apple only
3 XYZ Apple <-XYZ has Apple and Strawberry
4 XYZ Strawberry <-XYZ has Apple and Strawberry
5 CDF Strawberry <-CDF has Strawberry
6 AAA Apple <-AAA has Apple only
Desired output:
Length of IDs that has Apple and Strawberry: 2
Length of IDs that has Apple only: 2
Length of IDs that has Strawberry: 1
Thanks!
If always all values are only Apple or Strawberry in column Fruit you can compare sets per groups and then count ID by sum of Trues values:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print (out)
2
EDIT: If there is many values:
s = df.groupby('ID')['Fruit'].agg(frozenset).value_counts()
print (s)
{Apple} 2
{Strawberry, Apple} 2
{Strawberry} 1
Name: Fruit, dtype: int64
You can use pivot_table and value_counts for DataFrames (Pandas 1.1.0.):
df.pivot_table(index='ID', columns='Fruit', aggfunc='size', fill_value=0)\
.value_counts()
Output:
Apple Strawberry
1 1 2
0 2
0 1 1
Alternatively you can use:
df.groupby(['ID', 'Fruit']).size().unstack('Fruit', fill_value=0)\
.value_counts()

Search (row values) data from another dataframe

I have two dataframes, df1 and df2 respectively.
In one dataframe I have a list of search values (Actually Col1)
Col1 Col2
A1 val1, val2
B2 val4, val1
C3 val2, val5
I have another dataframe where I have a list of items
value items
val1 apples, oranges
val2 honey, mustard
val3 banana, milk
val4 biscuit
val5 chocolate
I want to iterate though the first DF and try to use that val as key to search for items from the second DF
Expected output:
A1 apples, oranges, honey, mustard
B2 biscuit, appleas, oranges
C3 honey, mustard, chocolate
I am able to add the values into dataframe and iterate through 1st DF
for index, row in DF1:
#list to hold all the values
finalList = []
list = df1['col2'].split(',')
for i in list:
print(i)
I just need help to fetch values from the second dataframe.
Would appreciate any help. Thanks.
Idea is use lambda function with split and lookup by dictionary:
d = df2.set_index('value')['items'].to_dict()
df1['Col2'] = df1['Col2'].apply(lambda x: ', '.join(d[y] for y in x.split(', ') if y in d))
print (df1)
Col1 Col2
0 A1 apples, oranges, honey, mustard
1 B2 biscuit, apples, oranges
2 C3 honey, mustard, chocolate
If there are lists in items values solution is changed with flattening:
d = df2.set_index('value')['items'].to_dict()
f = lambda x: ', '.join(z for y in x.split(', ') if y in d for z in d[y])
df1['Col2'] = df1['Col2'].apply(f)
print (df1)
Col1 Col2
0 A1 apples, oranges
1 B2 biscuit, apples, oranges
2 C3 honey, mustard, chocolate

group by to create multiple files

I have written a code using pandas groupby and its is working.
my question is how can I save each group in a excel sheet.
For example is you have group of fruits [ 'apple', 'grapes',.....'mango
']
I want to save apple in an excel and gapes in a different excel
import pandas as pd
df = pd.read_excel('C://Desktop/test/file.xlsx')
g = df.groupby('fruits')
for fruits, fruits_g in g:
print(fruits)
print(fruits_g)
Mango
name id purchase fruits
1 john 877 2 Mango
apple
name id purchase fruits
0 ram 654 5 apple
3 Sam 546 5 apple
BlueB
name id purchase fruits
7 david 767 9 black
grapes
name id purchase fruits
2 Dan 454 1 grapes
4 sys 890 7 grapes
mango
name id purchase fruits
5 baka 786 6 mango
strawB
name id purchase fruits
6 silver 887 9 straw
How Can i Create an excel for each group of fruit?
This can be accomplished using pandas.DataFrame.to_excel:
import pandas as pd
df = pd.DataFrame({
"Fruit": ["apple", "orange", "banana", "apple", "orange"],
"Name": ["John", "Sam", "David", "Rebeca", "Sydney"],
"ID": [877, 546, 767, 887, 890],
"Purchase": [1, 2, 5, 6, 4]
})
grouped = df.groupby("Fruit")
# run this to generate separate Excel files
for fruit, group in grouped:
group.to_excel(excel_writer=f"{fruit}.xlsx", sheet_name=fruit, index=False)
# run this to generate a single Excel file with separate sheets
with pd.ExcelWriter("fruits.xlsx") as writer:
for fruit, group in grouped:
group.to_excel(excel_writer=writer, sheet_name=fruit, index=False)

pyspark - merge 2 columns of sets

I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings
For Instance I have 2 columns formed from calling collect_set
Fruits | Meat
[Apple,Orange,Pear] [Beef, Chicken, Pork]
How do I turn it into:
Food
[Apple,Orange,Pear, Beef, Chicken, Pork]
Thank you very much for your help in advance
I was also figuring this out in Python, so here is a port of Ramesh's solution to Python:
df = spark.createDataFrame([(['Pear','Orange','Apple'], ['Chicken','Pork','Beef'])],
("Fruits", "Meat"))
df.show(1,False)
from pyspark.sql.functions import udf
mergeCols = udf(lambda fruits, meat: fruits + meat)
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(1,False)
Output:
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
Kudos to Ramesh!
EDIT: Note that you might have to manually specify the column type (not sure why it worked for me only in some cases without explicit type specification - in other cases I was getting a string type column).
from pyspark.sql.types import *
mergeCols = udf(lambda fruits, meat: fruits + meat, ArrayType(StringType()))
Given that you have dataframe as
+---------------------+---------------------+
|Fruits |Meat |
+---------------------+---------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|
+---------------------+---------------------+
You can write a udf function to merge the sets of two columns into one.
import org.apache.spark.sql.functions._
def mergeCols = udf((fruits: mutable.WrappedArray[String], meat: mutable.WrappedArray[String]) => fruits ++ meat)
And then call the udf function as
df.withColumn("Food", mergeCols(col("Fruits"), col("Meat"))).show(false)
You should have your desired final dataframe
+---------------------+---------------------+------------------------------------------+
|Fruits |Meat |Food |
+---------------------+---------------------+------------------------------------------+
|[Pear, Orange, Apple]|[Chicken, Pork, Beef]|[Pear, Orange, Apple, Chicken, Pork, Beef]|
+---------------------+---------------------+------------------------------------------+
Let's say df has
+--------------------+--------------------+
| Fruits| Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+
then
import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()
creates a set of Fruits & Meat combined into one set i.e.
[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]
Hope this helps!
Adding solution here for the definition of a set not containing duplicates. Also avoids any performance concerns with python udfs.
Requires Spark 2.4+
from pyspark.sql import functions as F
df = spark.createDataFrame([(['Chicken','Pork','Beef',"Tuna"], ["Salmon", "Tuna"])],
("Meat", "Fish"))
df.show(1,False)
df_union = df.withColumn("set_union", F.array_distinct(F.array_union("Meat", "Fish")))
df_union.show(1, False)
results in
+---------------------------+--------------+-----------------------------------+
|Meat |Fish |set_union |
+---------------------------+--------------+-----------------------------------+
|[Chicken, Pork, Beef, Tuna]|[Salmon, Tuna]|[Chicken, Pork, Beef, Tuna, Salmon]|
+---------------------------+--------------+-----------------------------------+

Resources