I am trying to write a Spark Job in python. I have two csv files containing following information:
File-1) product_prices.csv
product1 10
product2 20
product3 30
File-2 Sales_information.csv
id buyer transaction_date seller sales_data
1 buyer1 2015-1-01 seller1 {"product1":12,"product2":44}
2 buyer2 2015-1-01 seller3 {"product2":12}
3 buyer1 2015-1-01 seller3 {"product3":60,"product1":42}
4 buyer3 2015-1-01 seller2 {"product2":9,"product3":2}
5 buyer3 2015-1-01 seller1 {"product2":8}
Now, on above two files of data, I want to execute Spark job to find two things and have the data outputed to csv file
1) Total sales for each seller needs to be outputted to total_sellers_sales.csv file as
`seller_id total_sales`
`seller1 1160`
2) Ouput buyers list for each seller to sellers_buyers_list.csv as follows:
seller_id buyers
seller1 buyer1, buyer3
So can anyone tell me what is the correct way to do it to write a Spark job.
Note: I need a code in python
Here is my pyspark code in zeppelin 0.7.2.
First I created your sample dataframes manually:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
products = [ ("product1", 10), ("product2",20), ("product3",30)]
dfProducts = sqlContext.createDataFrame(products, ['product', 'price'])
sales = [ (1,"buyer1", "seller1","product1",12), (1,"buyer1", "seller1","product2",44), (2,"buyer2", "seller3","product2",12),(3,"buyer1", "seller3","product3",60),(3,"buyer1", "seller3","product1",42),(4,"buyer3", "seller2","product2",9),(4,"buyer3", "seller2","product3",2),(5,"buyer3", "seller1","product2",8)]
dfSales= sqlContext.createDataFrame(sales, ['id', 'buyer', 'seller','product','countVal'])
Total sales for each seller:
dfProducts.alias('p').join(dfSales.alias('s'),col('p.product')==col('s.product')).groupBy('s.seller').agg(F.sum(dfSales.countVal * dfProducts.price)).show()
Output:
Total sales for each seller:
Buyers list for each seller:
dfSales.groupBy("seller").agg(F.collect_set("buyer")).show()
Output: Buyers list for each seller
You can save results as csv using df.write.csv('filename.csv') method.
Hope this helps.
Related
Suppose we have a very large table that we'd like to process statistics for incrementally.
Date
Amount
Customer
2022-12-20
30
Mary
2022-12-21
12
Mary
2022-12-20
12
Bob
2022-12-21
15
Bob
2022-12-22
15
Alice
We'd like to be able to calculate incrementally how much we made per distinct customer for a date range. So from 12-20 to 12-22 (inclusive), we'd have 3 distinct customers, but 12-20 to 12-21 there are 2 distinct customers.
If we want to run this pipeline once a day and there are many customers, how can we keep a rolling count of distinct customers for an arbitrary date range? Is there a way to do this without storing a huge list of customer names for each day?
We'd like to support a frontend that has a date range filter and can quickly calculate results for that date range. For example:
Start Date
End Date
Average Income Per Customer
2022-12-20
2022-12-21
(30+12+12+15)/2 = 34.5
2022-12-20
2022-12-22
(30+12+12+15+15)/3 = 28
The only approach I can think of is to store a set of customer names for each day, and when viewing the results calculate the size of the joined set of sets to calculate distinct customers. This seems inefficient. In this case we'd store the following table, with the customer column being extremely large.
Date
Total Income
Customers
2022-12-20
42
set(Mary, Bob)
2022-12-21
27
set(Mary, Bob)
2022-12-22
15
set(Alice)
For me the best solution is to do some pre calculations for the existing data, then for the new data that come everyday, do the caclulation only on new data, and add the results to the previous calclulated data, also do partitioning on date column as we filter on dates, this will trigger spark push down filters and accelerate your queries.
There's 2 approach: one to get the sum amount between 2 dates, and other for the distinct customers between 2 dates:
For amout use prefix sum by adding the sum of all previous days to the last day, then to get the difference between the 2 dates you can just substract these 2 days only without looping all dates between.
For distinct customers, the best approach I can think of is to save the date and customer columns in a new file, and partition by dates, that would help to optimize the queries, then use the fast approx_count_distinct.
Here's some code:
spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [
["2022-12-20", 30, "Mary"],
["2022-12-21", 12, "Mary"],
["2022-12-20", 12, "Bob"],
["2022-12-21", 15, "Bob"],
["2022-12-22", 15, "Alice"],
]
df = spark.createDataFrame(data).toDF("Date", "Amount", "Customer")
def init_amout_data(df):
w = Window.orderBy(col("Date"))
amount_sum_df = df.groupby("Date").agg(sum("Amount").alias("Amount")) \
.withColumn("amout_sum", sum(col("Amount")).over(w)) \
.withColumn("prev_amout_sum", lag("amout_sum", 1, 0).over(w)).select("Date", "amout_sum", "prev_amout_sum")
amount_sum_df.write.mode("overwrite").partitionBy("Date").parquet("./path/amount_data_df")
amount_sum_df.show(truncate=False)
# keep only customer data to avoid unecessary data when querying, partitioning by Date will make query faster due to spark filter push down mechanism
def init_customers_data(df):
df.select("Date", "Customer").write.mode("overwrite").partitionBy("Date").parquet("./path/customers_data_df")
# each day update the amount data dataframe (example at midnight), with only yesterday data: by talking the last amout_sum and adding to it the amount of the last day
def update_amount_data(last_partition):
amountDataDf = spark.read.parquet("./path/amount_data_df")
maxDate = getMaxDate("./path/amount_data_df") # implement a hadoop method to get the last partition date
lastMaxPartition = amountDataDf.filter(col("date") == maxDate)
lastPartitionAmountSum = lastMaxPartition.select("amout_sum").first.getLong(0)
yesterday_amount_sum = last_partition.groupby("Date").agg(sum("Amount").alias("amount_sum"))
newPartition = yesterday_amount_sum.withColumn("amount_sum", col("amount_sum") + lastPartitionAmountSum) \
.withColumn("prev_amout_sum", lit(lastPartitionAmountSum))
newPartition.write.mode("append").partitionBy("Date").parquet("./path/amount_data_df")
def update_cusomers_data(last_partition):
last_partition.write.mode("append").partitionBy("Date").parquet("./path/customers_data_df")
def query_amount_date(beginDate, endDate):
amountDataDf = spark.read.parquet("./path/amount_data_df")
endDateAmount = amountDataDf.filter(col("Date") == endDate).select("amout_sum").first.getLong(0)
beginDateDf = amountDataDf.filter(col("Date") == beginDate).select("prev_amout_sum").first.getLong(0)
diff_amount = endDateAmount - beginDateDf
return diff_amount
def query_customers_date(beginDate, endDate):
customersDataDf = spark.read.parquet("./path/customers_data_df")
distinct_customers_nb = customersDataDf.filter(col("date").between(lit(beginDate), lit(endDate))) \
.agg(approx_count_distinct(df.Customer).alias('distinct_customers')).first.getLong(0)
return distinct_customers_nb
# This is should be executed the first time only
init_amout_data(df)
init_customers_data(df)
# This is should be executed everyday at midnight with data of the last day only
last_day_partition = df.filter(col("date") == yesterday_date)
update_amount_data(last_day_partition)
update_cusomers_data(last_day_partition)
# Optimized queries that should be executed with
beginDate = "2022-12-20"
endDate = "2022-12-22"
answer = query_amount_date(beginDate, endDate) / query_customers_date(beginDate, endDate)
print(answer)
If calculating the distinct customer is not fast enough, there's another approach using the same pre sum calculation of all distinct customers and another table for distinct customer, each day if there's a new customer increment the first table and add that customer to the second table, if not don't do anything.
Finally there are some tricks for optimizing the goupBy or window functions using salting oo extended partitioning.
You can achieve this by filtering rows with dates between start_date and end_date then grouping by customer_id and calculating the sum of amounts and then getting avg of these amounts. this approach works for only one start_date and end_date and you should run this code with different parameters to solve with different date ranges
start_date = '2022-12-20'
end_date = '2022-12-21'
(
df
.withColumn('isInRange', F.col('date').between(start_date, end_date))
.filter(F.col('isInRange'))
.groupby('customer')
.agg(F.sum('amount').alias('sum'))
.agg(F.avg('sum').alias('avg income'))
).show()
I have a dataframe - df1
Marketing
Sales
IT
marketing manager
sales lead
software eng
marketing spec
sales mgr
data scientist
And another dataframe - df2 -
Profession
Job Title
IT
data science manager
Marketing
Marketing manager
I need to fuzzy match all the job titles from dataframe 2 (df2) to all the rows and columns in df1 and find fuzzy score of each element .
Sample output - Only for datascience manager (need to create such tables for all the Job Titles in dataframe 1 and find the column which has maximum scores above 90 so as to classify the job title in a profession:
Marketing
Sales
IT
50
0
30
0
0
91
I have two different data frames pertaining to sales analytics. I would like to merge them together to make a new data frame with the columns customer_id, name, and total_spend. The two data frames are as follows:
import pandas as pd
import numpy as np
customers = pd.DataFrame([[100, 'Prometheus Barwis', 'prometheus.barwis#me.com',
'(533) 072-2779'],[101, 'Alain Hennesey', 'alain.hennesey#facebook.com',
'(942) 208-8460'],[102, 'Chao Peachy', 'chao.peachy#me.com',
'(510) 121-0098'],[103, 'Somtochukwu Mouritsen',
'somtochukwu.mouritsen#me.com','(669) 504-8080'],[104,
'Elisabeth Berry', 'elisabeth.berry#facebook.com','(802) 973-8267']],
columns = ['customer_id', 'name', 'email', 'phone'])
orders = pd.DataFrame([[1000, 100, 144.82], [1001, 100, 140.93],
[1002, 102, 104.26], [1003, 100, 194.6 ], [1004, 100, 307.72],
[1005, 101, 36.69], [1006, 104, 39.59], [1007, 104, 430.94],
[1008, 103, 31.4 ], [1009, 104, 180.69], [1010, 102, 383.35],
[1011, 101, 256.2 ], [1012, 103, 930.56], [1013, 100, 423.77],
[1014, 101, 309.53], [1015, 102, 299.19]],
columns = ['order_id', 'customer_id', 'order_total'])
When I group by customer_id and order_id I get the following table:
customer_id order_id order_total
100 1000 144.82
1001 140.93
1003 194.60
1004 307.72
1013 423.77
101 1005 36.69
1011 256.20
1014 309.53
102 1002 104.26
1010 383.35
1015 299.19
103 1008 31.40
1012 930.56
104 1006 39.59
1007 430.94
1009 180.69
This is where I get stuck. I do not know how to sum up all of the orders for each customer_id in order to make a total_spent column. If anyone knows of a way to do this it would be much appreciated!
IIUC, you can do something like below
orders.groupby('customer_id')['order_total'].sum().reset_index(name='Customer_Total')
Output
customer_id Customer_Total
0 100 1211.84
1 101 602.42
2 102 786.80
3 103 961.96
4 104 651.22
You can create an additional table then merge back to your current output.
# group by customer id and order id to match your current output
df = orders.groupby(['customer_id', 'order_id']).sum()
# create a new lookup table called total by customer
totalbycust = orders.groupby('customer_id').sum()
totalbycust = totalbycust.reset_index()
# only keep the columsn you want
totalbycust = totalbycust[['customer_id', 'order_total']]
# merge bcak to your current table
df =df.merge(totalbycust, left_on='customer_id', right_on='customer_id')
df = df.rename(columns = {"order_total_x": "order_total", "order_total_y": "order_amount_by_cust"})
# expect output
df
df_merge = customers.merge(orders, how='left', left_on='customer_id', right_on='customer_id').filter(['customer_id','name','order_total'])
df_merge = df_merge.groupby(['customer_id','name']).sum()
df_merge = df_merge.rename(columns={'order_total':'total_spend'})
df_merge.sort_values(['total_spend'], ascending=False)
Results in:
total_spend
customer_id name
100 Prometheus Barwis 1211.84
103 Somtochukwu Mouritsen 961.96
102 Chao Peachy 786.80
104 Elisabeth Berry 651.22
101 Alain Hennesey 602.42
A step-by-step explanation:
Start by merging your orders table onto your customers table using a left join. For this you will need pandas' .merge() method. Be sure to set the how argument to left because the default merge type is inner (which would ignore customers with no orders).
This step requires some basic understanding of SQL-style merge methods. You can find a good visual overview of the various merge types in this thread.
You can append your merge with the .filter() method to only keep your columns of interest (in your case: customer_id, name and order_total).
Now that you have your merged table, we still need to sum up all the order_total values per customer. To achieve this we need to group all non-numeric columns using .groupby() and then apply an aggregation method on the remaining numeric columns (.sum() in this case).
The .groupby() documentation link above provides some more examples on this. It is also worth knowing that this is a pattern referred to as "split-apply-combine" in the pandas documentation.
Next you will need to rename your numeric column from order_total to total_spend using the .rename() method and setting its column argument.
And last, but not least, sort your customers by your total_spend column using .sort_values().
I hope that helps.
I have two csv files. Contacts and Users.
How I load data into dataframes and merge them
First, I load a dataframe with the name of the users:
import pandas as pd
import numpy as np
df_users= pd.read_csv('./Users_001.csv',sep=',',usecols=[0,2,3])
Then I load the information from contacts of each user
df_contacts = pd.read_csv('./Contacts_001.csv',sep=',',usecols=[0,1,5,48,55,56,57,83,58])
df_users columns name are: user_id, Name, Surname
df_contacts columns name are: Contact ID, id user owner, fullname, qualification, ...
I want to merge both dataframes using user_id and 'id user owner' since they represent the same information. To to this I first change the name of the columns on df_contacts and then I merge
dfcontactos.columns = ['ID de Contacto','user_id','fullname','qualification','accesibility' ... ]
df_us_cont = pd.merge(dfcontactos,df_usuarios,on='user_id')
Now df_us_cont has the information from users and contacts.
What I want to do
There are only 18 user_id but there are 500 contacts. For each user I want to know:
Number of contacts with qualification < 100
For the contacts that have qualification <100
How many contacts have accesibility >= 4
Accesibility is a discrete number (0-5))
Number of contacts with qualification > 100 and < 300
Number of contacts with qualification > 300
-
What I have tried and fail
df_qua_lower100 = df_us_cont[df_us_cont['qualification']<100]
df_qua_lower100['user_id'].value_counts()
So far with this I am able to get the information on how many contacts with qualification<100 has each user_id. But I am unable to look how many have 'accesibility>=4'
I have tried to explain the best I could
First thing you can merge without changing column names
df_us_cont = dfcontactos.merge(dfcontactos,left_on='id user owner',right_on='user_id')
You can add as many conditions as you want if you use loc
df_us_cont.loc[(df_us_cont['qualification']<100) & (df_us_cont['accesibility']>=4),'user_id'].value_counts()
Number of contacts with qualification > 100 and < 300
df_us_cont.loc[(df_us_cont['qualification']>100) &(df_us_cont['qualification']<300) & (df_us_cont['accesibility']>=4),'user_id'].value_counts()
Number of contacts with qualification > 300
df_us_cont.loc[(df_us_cont['qualification']>300) & (df_us_cont['accesibility']>=4),'user_id'].value_counts()
I have two files customer and sales like below
Customer :
cu_id name region city state
1 Rahul ME Vizag AP
2 Raghu SE HYD TS
3 Rohith ME BNLR KA
Sales:
sa_id sales country
2 100000 IND
3 230000 USA
4 240000 UK
Both the files are \t delimited.
I want to join both the files based on the cu_id from customer and sa_id from sales using pyspark with out using sparksql/dataframes.
your help is very much appreciated.
You can definitely use the join methods that Spark has to offer regarding workings with RDD's.
You can do something like:
customerRDD = sc.textFile("customers.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
salesRDD = sc.textFile("sales.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
joinedRDD = customerRDD.join(salesRDD)
And you will get a new RDD that contains the only joined records from both customer and sales files.