Reading parquet files in GCP using wildcards in spark

Reading parquet files in GCP using wildcards in spark - apache-spark

I am trying to read parquet files using spark,
if I want to read the data for June, I'll do the following:
"gs://bucket/Data/year=2021/month=6/file.parquet"
if I want to read the data for all the months, I'll do the following:
"gs://bucket/Data/year=2021/month=6/file.parquet"
if I want to read the first two days of May:
"gs://bucket/Data/year=2021/month=5/day={1,2}file.parquet"
if I want to read November and December:
"gs://bucket/Data/year=2021/month={11,12}/file.parquet"
you get the idea... but what if I have a dictionary of month, days key, value pairs..
for example {1: [1,2,3], 4: [10,11,12,13]} --> which means that I need to read the days [1,2,3] from January, and the days [10,11,12,13] from April. how would I reflect that as a wildcard to the path.
Thank you

You can pass a list of paths to DataFrameReader:
months_dict = {1: [1, 2, 3], 4: [10, 11, 12, 13]}
paths = [
f"gs://bucket/Data/year=2021/month={k}/day={{{','.join([str(d) for d in v])}}}/*.parquet"
for k, v in months_dict.items()
]
print(paths)
# ['gs://bucket/Data/year=2021/month=1/day={1,2,3}/*.parquet', 'gs://bucket/Data/year=2021/month=4/day={10,11,12,13}/*.parquet']
df = spark.read.parquet(*paths)

Related

pandas groupby performance / combine 2 functions

I am learning python and trying to understand the best practices of data queries.
Here is some dummy data (customer sales) to test
import pandas as pd
df = pd.DataFrame({'Name':['tom', 'bob', 'bob', 'jack', 'jack', 'jack'],'Amount':[3, 2, 5, 1, 10, 100], 'Date':["01.02.2022", "02.02.2022", "03.02.2022", "01.02.2022", "03.02.2022", "05.02.2022"]})
df.Date = pd.to_datetime(df.Date, format='%d.%m.%Y')
I want to investigate 2 kinds of queries:
How long is a person our customer?
What is the period between first
and last purchase.
How can I run the first query without writing loops manually?
What I have done so far for the second part is this
result = df.groupby("Name").max() - df.groupby("Name").min()
Is it possible to combine these two groupby queries into one to improve the performance?
P.S. I am trying to understand pandas and key concepts how to optimize queries. Different approaches and explanations are highly appreciated.

You can use GroupBy.agg with a custom function to get the difference between the max and min date.
df.groupby('Name')['Date'].agg(lambda x: x.max()-x.min())
As you already have datetime type, this will nicely yield a Timedelta object, which by default is shown as a string in the form 'x days'.
You can also save the GroupBy object in a variable and reuse it. This way, computation of the groups occurs only once:
g = df.groupby("Name")['Date']
g.max() - g.min()
output:
Name
bob 1 days
jack 4 days
tom 0 days
Name: Date, dtype: timedelta64[ns]

Setting number format for floats when writing dataframe to Excel

I have a script producing multiple sheets for processing into a database but have strict number formats for certain columns in my dataframes.
I have created a sample dict for based on column headers and number format required and a sample df.
import pandas as pd
df_int_headers=['GrossRevenue', 'Realisation', 'NetRevenue']
df={'ID': [654398,456789],'GrossRevenue': [3.6069109,7.584326], 'Realisation': [1.5129510,3.2659478], 'NetRevenue': [2.0939599,4.3183782]}
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 4}
df=pd.DataFrame.from_dict(df)
def formatter(header):
for key, value in df_formats.items():
for head in header:
return header.round(value).astype(str).astype(float)
df[df_int_headers] = df[df_int_headers].apply(formatter)
df.to_excel('test.xlsx',index=False)
When using current code, all column number formats are returned as 3 .d.p. in my Excel sheet whereas I require different formats for each column.
Look forward to your replies.

For me working pass dictioanry to DataFrame.round, for your original key-value 'NetRevenue': 4 are returned only 3 values, in my opinion there should be 0 in end which is removed, because number:
df={'ID': 654398,'GrossRevenue': 3.6069109,
'Realisation': 1.5129510, 'NetRevenue': 2.0939599}
df = pd.DataFrame(df, index=[0])
df_formats = {'GrossRevenue': 3, 'Realisation': 6, 'NetRevenue': 5}
df_int_headers = list(df_formats.keys())
df[df_int_headers] = df[df_int_headers].round(df_formats)

Dynamic Time Analysis using PySpark

Suppose we have a dataset with the following structure:
df = sc.parallelize([['a','2015-11-27', 1], ['a','2015-12-27',0], ['a','2016-01-29',0], ['b','2014-09-01', 1], ['b','2015-05-01', 1] ]).toDF(("user", "date", "category"))
What I want to analyze is the users' attributes with regard to their lifetime in months. For example, I want to sum up the column "category" for each month of a user's lifetime. For user 'a', this would look like:
output = sc.parallelize([['a',0, 1], ['a',1,0], ['a',2,0]]).toDF(("user", "user_lifetime_in_months", "sum(category)"))
What is the most efficient way in Spark to do that? E.g., window functions?

How to compare one file's data to other file's data using python?

I have 90 files in csv format which have data like this-
PID, STARTED,%CPU,%MEM,COMMAND
1,Wed Sep 12 10:10:21 2018, 0.0, 0.0,init
2,Wed Sep 12 10:10:21 2018, 0.0, 0.0,kthreadd
Now I have to compare in such a way that whether file2 has any repeting data(PID,STARTED,%CPU,%MEM,COMMAND)with file1 or not.
If file2 has repeated data then pick the repeated data with all values(PID,COMMAND,STARTED,%CPU,%MEM) and store it in a seperate files.
Same explained process I have to do with all 90 files.
My code(Approach) is here. Please have a look -
file=open(r"Latest_27_02_2019.csv","r")
pidList=[]
pNameList=[]
memList=[]
startTimeList=[]
df=pd.read_csv(file)
pidList=df.index
df.columns = df.columns.str.strip()
pidList = df['PID']
pNameList=df['COMMAND']
memList=df['%MEM']
startTimeList=df['STARTED']
After that compare one by one.
But since I have large number files. So it will take more time and more iteration.
Somehow I have found that it can be do in easier way with help of python(pandas library) but don't know how? Please help me?

here is the solution for comparing 2 files:
#read file1 to df1
#your header seems no good with blank, so i rename it
df1 = pd.read_csv('file1', sep=',' header=1, names=['PID','STARTED','%CPU','%MEM','COMMAND']])
#df1 is your first file, df2 the second
df_compare = df1.merge(df2.drop_duplicates(), on=['PID','STARTED','%CPU','%MEM','COMMAND'],
how='right', indicator=True)
print(df_compare)
#in result, you'll have a column '_merge' with both or right_only
#right_only means only in df2 and not in df1
#after you just filtered:
filter = df_compare._merge == 'both'
df_compare = df_compare[filter].drop(['_merge'],axis=1)
#in df_compare, you have the repeateted rows from df2 and df1, you could reindex if you want
print(df_compare)
or another solution (better i think):
df_compare = df1[df1.index.isin(df1.merge(df2, how='inner', on=['PID','STARTED','%CPU','%MEM','COMMAND']).index)]
print(df_compare)

pandas sampling from series based on distribution

I currently have this df (res unique values below) of strings and a distribution p =[0.5, 0.33, 0.12, 0.05]
vid res
v1 '1072X1920'
v2 '240X416'
v3 '360X640'
v4 '720X1280'
The series is about 5000+ rows and I need to sample 3000 videos with the above distribution. I know I can do this by splitting the df into 4 parts, one for each res and use df.sample[:p[i] * 3000], like
df1072 = df[df['res'] == '1072X1920']
df1072 = df1072.sample(0.5 * 3000)
but is there a better way to do this? If I have 10 unique res then I would need to create 10 df in memory and that doesn't scale well. I was thinking np.random.choice() can help but not sure at the moment.

For example ,using sample random order your df, then using np.split
df=pd.DataFrame({'A':np.arange(100)})
n=len(df)
df=df.sample(n)
l=np.split(df, [int(0.5*n), int(0.83*n),int(0.95*n)])
Test :
list(map(len,l))
Out[1134]: [50, 33, 12, 5]
pd.concat(l).duplicated().any()
Out[1135]: False
For your example may need a groupby for loop
d={}
for y, x in df.groupby('res'):
n=len(x)
x=x.sample(n)
l=np.split(x, [int(0.5*n), int(0.83*n),int(0.95*n)])
d.append({y:l})

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reading parquet files in GCP using wildcards in spark - apache-spark

Related

pandas groupby performance / combine 2 functions

Setting number format for floats when writing dataframe to Excel

Dynamic Time Analysis using PySpark

How to compare one file's data to other file's data using python?

pandas sampling from series based on distribution

Categories

Resources