Reading parquet files in GCP using wildcards in spark - apache-spark

I am trying to read parquet files using spark,
if I want to read the data for June, I'll do the following:
"gs://bucket/Data/year=2021/month=6/file.parquet"
if I want to read the data for all the months, I'll do the following:
"gs://bucket/Data/year=2021/month=6/file.parquet"
if I want to read the first two days of May:
"gs://bucket/Data/year=2021/month=5/day={1,2}file.parquet"
if I want to read November and December:
"gs://bucket/Data/year=2021/month={11,12}/file.parquet"
you get the idea... but what if I have a dictionary of month, days key, value pairs..
for example {1: [1,2,3], 4: [10,11,12,13]} --> which means that I need to read the days [1,2,3] from January, and the days [10,11,12,13] from April. how would I reflect that as a wildcard to the path.
Thank you

You can pass a list of paths to DataFrameReader:
months_dict = {1: [1, 2, 3], 4: [10, 11, 12, 13]}
paths = [
f"gs://bucket/Data/year=2021/month={k}/day={{{','.join([str(d) for d in v])}}}/*.parquet"
for k, v in months_dict.items()
]
print(paths)
# ['gs://bucket/Data/year=2021/month=1/day={1,2,3}/*.parquet', 'gs://bucket/Data/year=2021/month=4/day={10,11,12,13}/*.parquet']
df = spark.read.parquet(*paths)

Related

pandas groupby performance / combine 2 functions

I am learning python and trying to understand the best practices of data queries.
Here is some dummy data (customer sales) to test
import pandas as pd
df = pd.DataFrame({'Name':['tom', 'bob', 'bob', 'jack', 'jack', 'jack'],'Amount':[3, 2, 5, 1, 10, 100], 'Date':["01.02.2022", "02.02.2022", "03.02.2022", "01.02.2022", "03.02.2022", "05.02.2022"]})
df.Date = pd.to_datetime(df.Date, format='%d.%m.%Y')
I want to investigate 2 kinds of queries:
How long is a person our customer?
What is the period between first
and last purchase.
How can I run the first query without writing loops manually?
What I have done so far for the second part is this
result = df.groupby("Name").max() - df.groupby("Name").min()
Is it possible to combine these two groupby queries into one to improve the performance?
P.S. I am trying to understand pandas and key concepts how to optimize queries. Different approaches and explanations are highly appreciated.
You can use GroupBy.agg with a custom function to get the difference between the max and min date.
df.groupby('Name')['Date'].agg(lambda x: x.max()-x.min())
As you already have datetime type, this will nicely yield a Timedelta object, which by default is shown as a string in the form 'x days'.
You can also save the GroupBy object in a variable and reuse it. This way, computation of the groups occurs only once:
g = df.groupby("Name")['Date']
g.max() - g.min()
output:
Name
bob 1 days
jack 4 days
tom 0 days
Name: Date, dtype: timedelta64[ns]

Setting number format for floats when writing dataframe to Excel

I have a script producing multiple sheets for processing into a database but have strict number formats for certain columns in my dataframes.
I have created a sample dict for based on column headers and number format required and a sample df.
import pandas as pd
df_int_headers=['GrossRevenue', 'Realisation', 'NetRevenue']
df={'ID': [654398,456789],'GrossRevenue': [3.6069109,7.584326], 'Realisation': [1.5129510,3.2659478], 'NetRevenue': [2.0939599,4.3183782]}
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 4}
df=pd.DataFrame.from_dict(df)
def formatter(header):
for key, value in df_formats.items():
for head in header:
return header.round(value).astype(str).astype(float)
df[df_int_headers] = df[df_int_headers].apply(formatter)
df.to_excel('test.xlsx',index=False)
When using current code, all column number formats are returned as 3 .d.p. in my Excel sheet whereas I require different formats for each column.
Look forward to your replies.
For me working pass dictioanry to DataFrame.round, for your original key-value 'NetRevenue': 4 are returned only 3 values, in my opinion there should be 0 in end which is removed, because number:
df={'ID': 654398,'GrossRevenue': 3.6069109,
'Realisation': 1.5129510, 'NetRevenue': 2.0939599}
df = pd.DataFrame(df, index=[0])
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 5}
df_int_headers = list(df_formats.keys())
df[df_int_headers] = df[df_int_headers].round(df_formats)

Dynamic Time Analysis using PySpark

Suppose we have a dataset with the following structure:
df = sc.parallelize([['a','2015-11-27', 1], ['a','2015-12-27',0], ['a','2016-01-29',0], ['b','2014-09-01', 1], ['b','2015-05-01', 1] ]).toDF(("user", "date", "category"))
What I want to analyze is the users' attributes with regard to their lifetime in months. For example, I want to sum up the column "category" for each month of a user's lifetime. For user 'a', this would look like:
output = sc.parallelize([['a',0, 1], ['a',1,0], ['a',2,0]]).toDF(("user", "user_lifetime_in_months", "sum(category)"))
What is the most efficient way in Spark to do that? E.g., window functions?

How to compare one file's data to other file's data using python?

I have 90 files in csv format which have data like this-
PID, STARTED,%CPU,%MEM,COMMAND
1,Wed Sep 12 10:10:21 2018, 0.0, 0.0,init
2,Wed Sep 12 10:10:21 2018, 0.0, 0.0,kthreadd
Now I have to compare in such a way that whether file2 has any repeting data(PID,STARTED,%CPU,%MEM,COMMAND)with file1 or not.
If file2 has repeated data then pick the repeated data with all values(PID,COMMAND,STARTED,%CPU,%MEM) and store it in a seperate files.
Same explained process I have to do with all 90 files.
My code(Approach) is here. Please have a look -
file=open(r"Latest_27_02_2019.csv","r")
pidList=[]
pNameList=[]
memList=[]
startTimeList=[]
df=pd.read_csv(file)
pidList=df.index
df.columns = df.columns.str.strip()
pidList = df['PID']
pNameList=df['COMMAND']
memList=df['%MEM']
startTimeList=df['STARTED']
After that compare one by one.
But since I have large number files. So it will take more time and more iteration.
Somehow I have found that it can be do in easier way with help of python(pandas library) but don't know how? Please help me?
here is the solution for comparing 2 files:
#read file1 to df1
#your header seems no good with blank, so i rename it
df1 = pd.read_csv('file1', sep=',' header=1, names=['PID','STARTED','%CPU','%MEM','COMMAND']])
#df1 is your first file, df2 the second
df_compare = df1.merge(df2.drop_duplicates(), on=['PID','STARTED','%CPU','%MEM','COMMAND'],
how='right', indicator=True)
print(df_compare)
#in result, you'll have a column '_merge' with both or right_only
#right_only means only in df2 and not in df1
#after you just filtered:
filter = df_compare._merge == 'both'
df_compare = df_compare[filter].drop(['_merge'],axis=1)
#in df_compare, you have the repeateted rows from df2 and df1, you could reindex if you want
print(df_compare)
or another solution (better i think):
df_compare = df1[df1.index.isin(df1.merge(df2, how='inner', on=['PID','STARTED','%CPU','%MEM','COMMAND']).index)]
print(df_compare)

pandas sampling from series based on distribution

I currently have this df (res unique values below) of strings and a distribution p =[0.5, 0.33, 0.12, 0.05]
vid res
v1 '1072X1920'
v2 '240X416'
v3 '360X640'
v4 '720X1280'
The series is about 5000+ rows and I need to sample 3000 videos with the above distribution. I know I can do this by splitting the df into 4 parts, one for each res and use df.sample[:p[i] * 3000], like
df1072 = df[df['res'] == '1072X1920']
df1072 = df1072.sample(0.5 * 3000)
but is there a better way to do this? If I have 10 unique res then I would need to create 10 df in memory and that doesn't scale well. I was thinking np.random.choice() can help but not sure at the moment.
For example ,using sample random order your df, then using np.split
df=pd.DataFrame({'A':np.arange(100)})
n=len(df)
df=df.sample(n)
l=np.split(df, [int(0.5*n), int(0.83*n),int(0.95*n)])
Test :
list(map(len,l))
Out[1134]: [50, 33, 12, 5]
pd.concat(l).duplicated().any()
Out[1135]: False
For your example may need a groupby for loop
d={}
for y, x in df.groupby('res'):
n=len(x)
x=x.sample(n)
l=np.split(x, [int(0.5*n), int(0.83*n),int(0.95*n)])
d.append({y:l})

Resources