How to groupby after partitioning in pyspark?

How to groupby after partitioning in pyspark? - apache-spark

I have a series of a graph edges in the format of src_node dst_node edge_weigth. I want to to read them from a .txt file and then use them for some algorithms. But before applying the algorithms, I need to clean up and group the data. Also I want to use only RDD's.
The problem with the the input data is that it has loops, first I want to filter the loops out and then re-partition the RDD to 8 partitions. However want to have access to each nodes outgoing edges etc, so for each node I need to store set of (dst,weight)s. Also I want to have a good performance(I'm afraid that a repatriation and a groupby will cause two shuffles)
I have came-up with this code until now but I don't know how to finish it :
links = sc.textFile("./inputs/graph1.txt") \
.map(lambda link: links.split(" "))\
.filter(lambda link: link[0] != link[1])
This is a sample input data:
1 2 5
1 3 2
1 1 2
2 3 1
3 1 2
3 2 4
What should I do?

Related

Csv file split comma separated values into separate rows and dividing the corresponding dollar amount by the number of comma separated values in panda

beginner here!
I have a csv file with comma separated values. I want to split each comma separated value in different rows in pandas. However, the corresponding dollar amounts should be divided by the number of comma separated values in each cell and export the result in a different csv file.
the csv table and the desired output table
I have used df.explode(IDs) but couldn’t figure out how to divide the Dollar_Amount by the number of IDs in the corresponding cells.
import pandas as pd
in_csv = pd.read_csv(‘inputCSV.csv’)
new_csv = df.explode(‘IDs’)
new_csv.to_csv(‘outputCSV.csv’)

You can divide the dollar amount by the number of ids in each row before using explode. This can be done as follows:
# Preprocessing
df['Dollar_Amount'] = df['Dollar_Amount'].str[1:].str.replace(',', '').astype(float)
df['IDs'] = df['IDs'].str.split(",")
# Compute the new dollar amount and explode
df['Dollar_Amount'] = df['Dollar_Amount'] / df['IDs'].str.len()
df = df.explode('IDs')
# Postprocessing
df['Dollar_Amount'] = df['Dollar_Amount'].round(2).apply(lambda x: '${0:,.2f}'.format(x))
With an example input:
IDs Dollar_Amount A
0 1,2,3,4 $100,000.00 4
1 5,6,7 $50,000.00 3
2 9 $20,000.00 1
3 10,11 $20,000.00 2
The result is as follows:
IDs Dollar_Amount A
0 1 $25,000.00 4
0 2 $25,000.00 4
0 3 $25,000.00 4
0 4 $25,000.00 4
1 5 $16,666.67 3
1 6 $16,666.67 3
1 7 $16,666.67 3
2 9 $20,000.00 1
3 10 $10,000.00 2
3 11 $10,000.00 2

There will be a one line way to do this with a lambda function (if you are new, read up on lambda functions!) but as a slightly less new beginner, I think its easier to think about this as two separate operations.
Operation 1 - get the count of ids, Operation 2 - do the division
If you take a look here https://towardsdatascience.com/count-occurrences-of-a-value-pandas-e5dad02303e9 you'll get a good lesson on how to do the group by you need to get the count of ids and join it back to your data frame. I'd read that because its a much more detailed explainer, but if you want a simple line of code consider this Pandas, how to count the occurance within grouped dataframe and create new column?
Once you have it, the divison is as simple as df['new_col'] = df['col1']/df['col2']

Multilevel index of rows of a dataframe using pandas [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!

You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.

For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex

Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])

There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:

The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Combine multiple rows based on Id and other column using pandas python [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!

You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.

For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex

Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])

There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:

The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

How to call a created funcion with pandas apply to all rows (axis=1) but only to some specific rows of a dataframe?

I have a function which sends automated messages to clients, and takes as input all the columns from a dataframe like the one below.
name
phone
status
date
name_1
phone_1
sending
today
name_2
phone_2
sending
yesterday
I iterate through the dataframe with a pandas apply (axis=1) and use the values on the columns of each row as inputs to my function. At the end of it, after sending, it changes the status to "sent". The thing is I only want to send to the clients whose date reference is "today". Now, with pandas.apply(axis=1) this is perfectly doable, but in order to slice the clients with "today" value, I need to:
create a new dataframe with today's value,
remove it from the original, and then
reappend it to the original.
I thought about running through the whole dataframe and ignore the rows which have dates different than "today", but if my dataframe keeps growing, I'm afraid of the whole process becoming slower.
I saw examples of this being done with mask, although usually people only use 1 column, and I need more than just the one. Is there any way to do this with pandas apply?
Thank you.

I think you can use .loc to filter the data and apply func to it.
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.334781 0.521263 0.402030 0.973504 0.903314
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 0.902107 0.226398 0.596697 0.489761 0.535270
if we want double the values of rows where the value in first column > 0.3
Out[16]:
0 1 2 3 4
2 0.334781 0.521263 0.402030 0.973504 0.903314
4 0.902107 0.226398 0.596697 0.489761 0.535270
In [18]: df.loc[df[0] > 0.3] = df.loc[df[0] > 0.3].apply(lambda x: x*2, axis=1)
In [19]: df
Out[19]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.669563 1.042527 0.804061 1.947008 1.806628
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 1.804213 0.452797 1.193394 0.979522 1.070540

How to organise different datasets on Excel into the same layout/order (using pandas)

I have multiple Excel spreadsheets containing the same types of data but they are not in the same order. For example, if file 1 has the results of measurements A, B, C and D from River X printed in columns 1, 2, 3 and 4, respectively but file 2 has the same measurements taken for a different river, River Y, printed in columns 6, 7, 8, and 9 respectively, is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe (i.e. make it so that Sheet2 has the measurements for River Y printed in columns 1, 2, 3 and 4)? Sometimes the data is presented horizontally, not vertically as described above, too. If I have the same measurements for, say, 400 different rivers on 400 separate sheets, but the presentation/layout of data is erratic with regards to each individual file, it would be useful to be able to put a single order on every spreadsheet without having to manually shift columns on Excel.

Is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe?
You can get a list of columns from one of your dataframes and then sort that. Next you can use the sorted order to reorder your remaining dataframes. I've created an example below:
import pandas as pd
import numpy as np
# Create an example of your problem
root = 'River'
suffix = list('123')
cols_1 = [root + '_' + each_suffix for each_suffix in suffix]
cols_2 = [root + '_' + each_suffix for each_suffix in suffix[::]]
data = np.arange(9).reshape(3,3)
df_1 = pd.DataFrame(columns=cols_1, data=data)
df_2 = pd.DataFrame(columns=cols_2, data=data)
df_1
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
df_2
[out] River_3 River_2 River_1
0 0 1 2
1 3 4 5
2 6 7 8
col_list = df_1.columns.to_list() # Get a list of column names use .sort() to sort in place or
sorted_col_list = sorted(col_list, reverse=False) # Use reverse True to invert the order
def rearrange_df_cols(df, target_order):
df = df[target_order]
print(df)
return df
rearrange_df_cols(df_1, sorted_col_list)
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
rearrange_df_cols(df_2, sorted_col_list)
[out] River_1 River_2 River_3
0 2 1 0
1 5 4 3
2 8 7 6
You can write a function based on what's above and apply it to all of your file/sheets provided that all columns names exist (NB the must be written identically).
Sometimes the data is presented horizontally, not vertically as described above, too.
This would be better as a separate question. In principle you should check the dimension of your data e.g. df.shape and based of the shape you can either use df.transpose() and then your function to reorder the columns names or directly use your function to reorder the column names.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to groupby after partitioning in pyspark? - apache-spark

Related

Csv file split comma separated values into separate rows and dividing the corresponding dollar amount by the number of comma separated values in panda

Multilevel index of rows of a dataframe using pandas [duplicate]

Combine multiple rows based on Id and other column using pandas python [duplicate]

How to call a created funcion with pandas apply to all rows (axis=1) but only to some specific rows of a dataframe?

How to organise different datasets on Excel into the same layout/order (using pandas)

Categories

Resources