Get first product ordered by customer - apache-spark

I have an existing table like below. I want to replace the NULLs in first_product column with the first product a customer has ordered.
INPUT
customer_id
product
order_date_id
first_product
C0001
apple
20220224
NULL
C0001
pear
20220101
NULL
C0002
strawberry
20220224
NULL
C0001
apple
20220206
NULL
OUTPUT:
customer_id
product
order_date_id
first_product
C0001
apple
20220224
pear
C0001
pear
20220101
pear
C0002
strawberry
20220224
strawberry
C0001
apple
20220206
pear
I have thought about using row numbers as below, but not sure how to pull it all together.
I have this code so far, but not sure how to update the first_product column using the below code.
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY customer_id, order_date_id) AS first_occurrance
Some pseudo-code:
REPLACE first_product FROM table WITH product WHERE
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY customer_id, order_date_id) AS first_occurrance = 1

Hey you can use first window function and achieve this.
val cust_data = Seq[(String, String, Int, String)](
("C0001", "apple", 20220224, null),
("C0001", "pear", 20220101, null),
("C0002", "strawberry", 20220224, null),
("C0001", "apple", 20220206, null)
).toDF("cust_id", "product", "date_id", "first_prod")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val out_df = cust_data.withColumn("first_prod", first($"product").over(Window.partitionBy($"cust_id").orderBy($"date_id")))
out_df.show()
+-------+----------+--------+----------+
|cust_id| product| date_id|first_prod|
+-------+----------+--------+----------+
| C0001| pear|20220101| pear|
| C0001| apple|20220206| pear|
| C0001| apple|20220224| pear|
| C0002|strawberry|20220224|strawberry|
+-------+----------+--------+----------+

Related

create a dataframe with the duplicated and non-duplicated rows

I have a dataframe like this:
column_1 column_2 column_3 column_4 column_5 column_6 column_7
34432 apple banana mango pine lemon j93jk84
98389 grape orange pine kiwi cherry j93jso3
94749 apple banana mango pine lemon ke03jfr
48948 apple banana mango pine lemon 9jef3f4
. . . . . . .
90493 pear apricot papaya plum lemon 93jd30d
90843 grape orange pine kiwi cherry 03nd920
I want to have two dataframes.
Dataframe_1:
I want to ignore column_1 and column_7, and drop all the duplicated data and keep only unique rows based on all other columns.
Dataframe_2:
column_1 column_2 column_3 column_4 column_5 column_6 column_7 type Tag
34432 apple banana mango pine lemon j93jk84 unique 1
98389 grape orange pine kiwi cherry j93jso3 unique 2
94749 apple banana mango pine lemon ke03jfr duplicated 1
48948 apple banana mango pine lemon 9jef3f4 duplicated 1
. . . . . . .
90493 pear apricot papaya plum lemon 93jd30d unique 3
90843 grape orange pine kiwi cherry 03nd920 duplicated 2
As you can see in the example daraframe_2, I need two new column's "type" which specifies if the row is either a unique or duplicated. "tag" to easily identify which is the unique row and other duplicated rows which belongs to that duplicated row
Can someone tell me, how to achieve both these dataframes in pyspark?
Code I tried:
# to drop the duplicates ignoring column_1 and column_7
df_unique = df.dropDuplicates(["column_6","column_2","column_3","column_4","column_5"])
df_duplicates = df.subtract(df_unique)
# adding a type column to both dataframes and concatinating two dataframes
df_unique = df_unique.withColumn("type", F.lit("unique"))
df_duplicates = df_duplicated.withColumn("type", F.lit("duplicate"))
df_combined = df_unique.unionByName(df_duplicates )
# unable to create the tag column
..
If I understood your question correctly, essentially you need to -
Tag the first row as unique
Tag all subsequent rows as duplicate if the values of all columns are the same except column_1 and column_2
Let me know if this is not the case
Using row_number: Use all columns to compare as the partition key and generate Row Number for each partition, if there are more rows with for a set of column values - they'll fall in same set and their will have row_number accordingly. (you can use orderBy to mark specific rows unique if that's a requirement):
df.withColumn("asArray", F.array(*[x for x in df.schema.names if x!="column_1" and x!="column_7"]))\
.withColumn("rn", F.row_number().over(Window.partitionBy("asArray").orderBy(F.lit("dummy"))))\
.withColumn("type", F.when(F.col("rn")==1, "Unique").otherwise("Duplicated"))\
.withColumn("tag", F.dense_rank().over(Window.orderBy(F.col("asArray"))))\
.show(truncate=False)
I've collected the values of all columns to compare in an array to make it easy.
Edit - Output for data similar to your dataset, with duplicates more than 2. Also, corrected tag logic
Input:
Output:
I did something like this, here is input:
import pyspark.sql.functions as F
from pyspark.sql import Window
data = [
{"column_1": 34432, "column_2": "apple", "column_3": "banana", "column_4": "mango", "column_5": "pine", "column_6": "lemon", "column_7": "j93jk84"},
{"column_1": 98389, "column_2": "grape", "column_3": "orange", "column_4": "pine", "column_5": "kiwi", "column_6": "cherry", "column_7": "j93jso3"},
{"column_1": 94749, "column_2": "apple", "column_3": "banana", "column_4": "mango", "column_5": "pine", "column_6": "lemon", "column_7": "ke03jfr"},
{"column_1": 48948, "column_2": "grape", "column_3": "banana", "column_4": "mango", "column_5": "pine", "column_6": "lemon", "column_7": "9jef3f4"},
{"column_1": 90493, "column_2": "pear", "column_3": "apricot", "column_4": "papaya", "column_5": "plum", "column_6": "lemon", "column_7": "93jd30d"},
{"column_1": 90843, "column_2": "grape", "column_3": "orange", "column_4": "pine", "column_5": "kiwi", "column_6": "cherry", "column_7": "03nd920"}
]
df = spark.createDataFrame(data)
Imo for first df you can use dropDuplicates were you can pass sub-set of columns
firstDf = df.dropDuplicates(["column_2","column_3","column_4","column_5","column_6"])
For second df you can do something like this:
windowSpec = Window.partitionBy("column_2","column_3","column_4","column_5","column_6").orderBy("column_1")
secondDf = (
df.withColumn("row_number", F.row_number().over(windowSpec))
.withColumn(
"type",
F.when(F.col("row_number") == F.lit(1), F.lit("unique")).otherwise(
F.lit("duplicated")
),
)
.withColumn("tag", F.first("column_1").over(windowSpec))
.withColumn("tag", F.dense_rank().over(Window.partitionBy().orderBy("tag")))
).drop("row_number").show()
Output
+--------+--------+--------+--------+--------+--------+--------+----------+---+
|column_1|column_2|column_3|column_4|column_5|column_6|column_7| type|tag|
+--------+--------+--------+--------+--------+--------+--------+----------+---+
| 34432| apple| banana| mango| pine| lemon| j93jk84| unique| 1|
| 94749| apple| banana| mango| pine| lemon| ke03jfr|duplicated| 1|
| 48948| grape| banana| mango| pine| lemon| 9jef3f4| unique| 2|
| 90493| pear| apricot| papaya| plum| lemon| 93jd30d| unique| 3|
| 90843| grape| orange| pine| kiwi| cherry| 03nd920| unique| 4|
| 98389| grape| orange| pine| kiwi| cherry| j93jso3|duplicated| 4|
+--------+--------+--------+--------+--------+--------+--------+----------+---+

Pandas, combine unique value from two column into one column while preserving order

I have data in four column as shown below. There are some values which are present in column 1, and some value of column 1 is again duplicated in column 3. I would like to combine column 1 with 3, while removing the duplicates from column 3. I would also like to preserve the order of column. Column 1 is associated with column 2 and column 3 is associated with column 4, so it would be nice if I can move column 1 items with column 2 and column 3 items with column 4 during merge. Any help will be appreciated.
Input table:
Item
Price
Item
Price
Car
105
Truck
54822
Chair
20
Pen
1
Cup
2
Car
105
Glass
1
Output table:
Item
Price
Car
105
Chair
20
Cup
2
Truck
54822
Pen
1
Glass
1
Thank you in advance.
After separating the input table into the left and right part, we can concatenate the left hand items with the unduplicated right hand items quite simply with boolean indexing:
import pandas as pd
# this initial section only recreates your sample input table
from io import StringIO
input = pd.read_table(StringIO("""| Item | Price | Item | Price |
|-------|-------|------|-------|
| Car | 105 | Truck| 54822 |
| Chair | 20 | Pen | 1 |
| Cup | 2 | Car | 105 |
| | | Glass| 1 |
"""), ' *\| *', engine='python', usecols=[1,2,3,4], skiprows=[1], keep_default_na=False)
input.columns = list(input.columns[:2])*2
# now separate the input table into the left and right part
left = input.iloc[:,:2].replace("", pd.NA).dropna().set_index('Item')
right = input.iloc[:,2:] .set_index('Item')
# finally construct the output table by concatenating without duplicates
output = pd.concat([left, right[~right.index.isin(left.index)]])
Price
Item
Car 105
Chair 20
Cup 2
Truck 54822
Pen 1
Glass 1

How to find the total length of a column value that has multiple values in different rows for another column

Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry?
df:
ID Fruit
0 ABC Apple <-ABC has Apple and Strawberry
1 ABC Strawberry <-ABC has Apple and Strawberry
2 EFG Apple <-EFG has Apple only
3 XYZ Apple <-XYZ has Apple and Strawberry
4 XYZ Strawberry <-XYZ has Apple and Strawberry
5 CDF Strawberry <-CDF has Strawberry
6 AAA Apple <-AAA has Apple only
Desired output:
Length of IDs that has Apple and Strawberry: 2
Length of IDs that has Apple only: 2
Length of IDs that has Strawberry: 1
Thanks!
If always all values are only Apple or Strawberry in column Fruit you can compare sets per groups and then count ID by sum of Trues values:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print (out)
2
EDIT: If there is many values:
s = df.groupby('ID')['Fruit'].agg(frozenset).value_counts()
print (s)
{Apple} 2
{Strawberry, Apple} 2
{Strawberry} 1
Name: Fruit, dtype: int64
You can use pivot_table and value_counts for DataFrames (Pandas 1.1.0.):
df.pivot_table(index='ID', columns='Fruit', aggfunc='size', fill_value=0)\
.value_counts()
Output:
Apple Strawberry
1 1 2
0 2
0 1 1
Alternatively you can use:
df.groupby(['ID', 'Fruit']).size().unstack('Fruit', fill_value=0)\
.value_counts()

Presto SQL - distinct in selective group by

ID Name Gender Country
1 Arun Male India
2 Akshay Male England
3 Chethna Female India
4 Priya Female China
5 Piyush Male India
6 Arun Male England
7 Tisha Female England
8 Chethna Female China
I want to group them into male/female first, then the country associated.
Query1 : select Gender, count(distinct name) from Table group by Gender
Output:
Gender count(distinct name)
Male 3
Female 3
Copying the result in JSON like this,
result : {male : {count : 3}, female : {count : 3} }
Query2 : select Gender, Country, count(distinct name) from Table group by Gender, Country
Output:
Gender Country count(distinct name)
Male India 2
Male England 2
Female India 1
Female China 2
Female England 1
Adding this result in the above Json,
result : {Male:{count:3,India:{count:2},England:{count:2}},Female:{count:3,India:{count:1},China:{count:2},England:{count:1}}}
So can I achieve this in a single query?
You can compute the counts by gender and by gender+country in a single query by using GROUPING SETS:
WITH data(id, name, gender, country) AS (
VALUES
(1, 'Arun', 'Male' , 'India'),
(2, 'Akshay', 'Male' , 'England'),
(3, 'Chethna', 'Female', 'India'),
(4, 'Priya', 'Female', 'China'),
(5, 'Piyush', 'Male' , 'India'),
(6, 'Arun', 'Male' , 'England'),
(7, 'Tisha', 'Female', 'England'),
(8, 'Chethna', 'Female', 'China'))
SELECT gender, country, count(distinct name)
FROM data
GROUP BY GROUPING SETS ((gender), (gender, country))
which produces:
gender | country | _col2
--------+---------+-------
Male | England | 2
Female | China | 2
Male | NULL | 3
Female | NULL | 3
Female | India | 1
Male | India | 2
Female | England | 1
(7 rows)

Copying data from multiple columns of source hive table to single column of target hive table in different rows

I have a requirement of copying data from one Hive source table to other target table. Below is source table structure with sample data:
source_table
Userid Name Phone1 Phone2 Phone3 Address1 Address2 Address3
123 Jitu 123456 987654 111111 DELHI GURGAON NOIDA
234 Mark 123456 987654 111111 UK USA IND
While copying data from source to target, my requirement is to have Phone1, Phone2, Phone3 along with corresponding Address1, Address2 and Address3
columns in a single column in target table. Below is how data should look like in target table:
Target_table
Userid Name Phone_no Address
123 Jitu 123456 DELHI
123 Jitu 987654 GURGAON
123 Jitu 111111 NOIDA
234 Mark 123456 UK
234 Mark 987654 USA
234 Mark 111111 IND
I know simplest way to do this would be doing multiple inserts into target table for each Phone and address column from source table
using either hive query language or spark dataframes.
Is there any other efficient method I can use to achieve this.
Original dataframe can be selected several times, for each column index, and then selected dataframes combined into one by "union":
val df = Seq(
(123, "Jitu", "123456", "987654", "111111", "DELHI", "GURGAON", "NOIDA"),
(234, "Mark", "123456", "987654", "111111", "UK", "USA", "IND")
).toDF(
"Userid", "Name", "Phone1", "Phone2", "Phone3", "Address1", "Address2", "Address3"
)
val columnIndexes = Seq(1, 2, 3)
val onlyOneIndexDfs = columnIndexes.map(idx =>
df.select(
$"Userid",
$"Name",
col(s"Phone$idx").alias("Phone_no"),
col(s"Address$idx").alias("Address")))
val result = onlyOneIndexDfs.reduce(_ union _)
Output:
+------+----+--------+-------+
|Userid|Name|Phone_no|Address|
+------+----+--------+-------+
|123 |Jitu|123456 |DELHI |
|123 |Jitu|111111 |NOIDA |
|123 |Jitu|987654 |GURGAON|
|234 |Mark|123456 |UK |
|234 |Mark|987654 |USA |
|234 |Mark|111111 |IND |
+------+----+--------+-------+
Just in case, if you are intrested in Hive solution as well,Lateral view yield cartesian product when joining multiple arrays result set.you can achieve the same results using posexplode as shown below:
select Userid,Name,phone,address
from source_table
lateral view posexplode(array(Phone1,Phone2,Phone3)) valphone as x,phone
lateral view posexplode(array(Address1,Address2,Address3)) valaddress as t,address
where x=t
;
hive> set hive.cli.print.header=true;
userid name phone address
123 Jitu 123456 DELHI
123 Jitu 987654 GURGAON
123 Jitu 111111 NOIDA
234 Mark 123456 UK
234 Mark 987654 USA
234 Mark 111111 IND
Time taken: 2.759 seconds, Fetched: 6 row(s)

Resources