Groupby without an aggregation function and sort that data - python-3.x

I have customer ID and date of purchase. I need to sort date of purchase for each of the customer ID seperately.
I need a groupby operation but without an aggregation, and sort date of purchase for each customer.
Tried this way
new_data = data.groupby('custID').sort_values('purchase_date')
AttributeError: Cannot access callable attribute 'sort_values' of
'DataFrameGroupBy' objects, try using the 'apply' method
Expected result is like:
custID purchase_date
100 23/01/2019
100 29/01/2019
100 03/04/2019
120 02/05/2018
120 09/03/2019
120 11/05/2019

# import the pandas library
import pandas as pd
data = {
'purchase_date': ['23/01/2019', '19/01/2019', '12/01/2019', '23/01/2019', '11/01/2019', '23/01/2019', '06/05/2019', '05/05/2019', '05/01/2019', '02/07/2019',],
'custID': [100, 160, 100, 110, 160, 110, 110, 110, 110, 160]
}
df = pd.DataFrame(data)
sortedData = df.groupby('custID').apply(
lambda x: x.sort_values(by = 'purchase_date', ascending = True))
sortedData=sortedData.reset_index(drop=True, inplace=False)
OUTPUT:
print(sortedData)
Index custID purchase_date
0 100 12/01/2019
1 100 23/01/2019
2 110 05/01/2019
3 110 05/05/2019
4 110 06/05/2019
5 110 23/01/2019
6 110 23/01/2019
7 160 02/07/2019
8 160 11/01/2019
9 160 19/01/2019
print(sortedData.to_string(index=False))
custID purchase_date
100 12/01/2019
100 23/01/2019
110 05/01/2019
110 05/05/2019
110 06/05/2019
110 23/01/2019
110 23/01/2019
160 02/07/2019
160 11/01/2019
160 19/01/2019

Related

test/train splits in pycaret using a column for grouping rows that should be in the same split

My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET
10 row sample for clarification:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
2 23 34 233
2 623 22 888
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
every unique group_id should be sent to any split in full this way (using 80/20):
TRAIN SET:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
TEST SET:
group_id measure1 measure2 measure3
2 23 34 233
2 623 22 888
You can try the following per the documentation
https://pycaret.readthedocs.io/en/latest/api/classification.html
fold_strategy = "groupkfold"
One solution could look like this:
import numpy as np
import pandas as pd
from itertools import combinations
def is_possible_sum(numbers, n):
for r in range(len(numbers)):
for combo in combinations(numbers, r + 1):
if sum(combo) == n:
return combo
print(f'Desired split not possible')
raise ArithmeticError
def train_test_split(table: pd.DataFrame, train_fraction: float, col_identifier: str):
train_ids = []
occurrences = table[col_identifier].value_counts().to_dict()
required = sum(occurrences.values()) * train_fraction
lengths = is_possible_sum(occurrences.values(), required)
for i in lengths:
for key, value in occurrences.items():
if value == i:
train_ids.append(key)
del occurrences[key] # prevents the same ID from being selected twice
break
train = table[table[col_identifier].isin(train_ids)]
test = table[~table[col_identifier].isin(train_ids)]
return train, test
if __name__ == '__main__':
df = pd.DataFrame()
df['Group_ID'] = np.array([1, 1, 1, 2, 2, 3, 3, 3, 4, 4])
df['Measurement'] = np.random.random(10)
train_part, test_part = train_test_split(df, 0.8, 'Group_ID')
Some remarks:
This is probably the least elegant way to do it...
It uses an ungodly amount of for loops and is probably slow for larger dataframes. It also doesn't randomize the split.
Lots of this is because the dictionary of group_id and the count of the samples with a certain group_id can't be reversed as some entries might be ambiguous. You could probably do this with numpy arrays as well, but I doubt that the overall structure would be much different.
First function taken from here: How to check if a sum is possible in array?

adding reversed columns to dataframe [duplicate]

This question already has an answer here:
Reversing the order of values in a single column of a Dataframe
(1 answer)
Closed 1 year ago.
Trying to add a reversed column to a data frame, but it just adds in normal order. For me, it looks like it is just following the index of the dataframe. Is it possible to reorder the index?
df_reversed = df['Buy'].iloc[::-1]
Data["newColumn"] = df_reversed
Image of the output
Image of df_reversed
This is how I want the output to be
A slight modification from #Chicodelarose, you can reverse just the values and get the result you want as follows:
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df["calories_reversed"] = df["calories"].values[::-1]
print(df)
Output will be:
calories duration
0 420 50
1 380 40
2 390 45
calories duration calories_reversed
0 420 50 390
1 380 40 380
2 390 45 420
You need to call reset_index before assigning the values to the new column so that they are added to the data frame in reverse order:
Example:
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df["calories_reversed"] = df["calories"][::-1].reset_index(drop=True)
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45
calories duration calories_reversed
0 420 50 390
1 380 40 380
2 390 45 420

Pivot a Two Column DataFrame With No Numeric Column To Aggregate On

I have a dataframe with input like this:
df1 = pd.DataFrame(
{'StoreId':
[244, 391, 246, 246, 130, 130] , 'PackageStatus': ['IN TRANSIT','IN TRANSIT','IN TRANSIT', 'IN TRANSIT','IN TRANSIT','COLLECTED',]}
)
StoreId PackageStatus
0 244 IN TRANSIT
1 391 IN TRANSIT
2 246 IN TRANSIT
3 246 IN TRANSIT
4 130 IN TRANSIT
5 130 COLLECTED
The output I'm expecting is to look like this with the package status pivoting to columns and their counts becoming the values:
StoreId IN TRANSIT COLLECTED
244 1 0
391 1 0
246 2 0
130 1 1
All the examples I come across are with a third numeric column with which some aggregation (sum, mean, average etc) is done.
When I tried
df1.pivot_table(index='StoreId',values='PackageStatus', aggfunc='count')
I get the following instead:
PackageStatus
StoreId
130 2
244 1
246 2
391 1
In my case I need a simple transpose/pivot with the count. How to accomplish this? Thank you.
Use columns="PackageStatus" parameter:
print(
df1.pivot_table(
index="StoreId", columns="PackageStatus", aggfunc="size", fill_value=0
)
)
Prints:
PackageStatus COLLECTED IN TRANSIT
StoreId
130 1 1
244 0 1
246 0 2
391 0 1
With .reset_index():
print(
df1.pivot_table(
index="StoreId", columns="PackageStatus", aggfunc="size", fill_value=0
)
.reset_index()
.rename_axis("", axis=1)
)
Prints:
StoreId COLLECTED IN TRANSIT
0 130 1 1
1 244 0 1
2 246 0 2
3 391 0 1

Windowing Data into Rows in Pyspark

I'm preparing a dataset to develop a supervised model to predict a value given the 5 previous values before it. For example given the sample data below, I would predict the 6th column given columns 1:5, or the 8th column given columns 3:7.
id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
a 150 110 130 80 136 150 190 110 150 110 130 136 100 150 190 110
b 100 100 130 100 136 100 160 230 122 130 15 200 100 100 136 100
c 130 122 140 140 122 130 15 200 100 100 130 100 136 100 160 230
To that end, I want to reorganize the sample data above into rows of 6 columns, taking every slice/window of 6 values possible (e.g. 1:6, 2:7, 3:8). How can I do that? Is it possible in PySpark/SQL? Example of output below, index just for clarification:
1 2 3 4 5 6
a[1:6] 150 110 130 80 136 150
a[2:7] 110 130 80 136 150 190
a[3:8] 130 80 136 150 190 110
...
c[1:6] 130 122 140 140 122 130
c[2:7] 122 140 140 122 130 15
...
c[10:16] 130 100 136 100 160 230
You can convert your columns into an array of arrays or array of structs and then explode, for example:
from pyspark.sql.functions import struct, explode, array, col
# all columns except the first
cols = df.columns[1:]
# size of the splits
N = 6
Use array of arrays:
df_new = df.withColumn('dta', explode(array(*[ array(*cols[i:i+N]) for i in range(len(cols)-N+1) ]))) \
.select('id', *[ col('dta')[i].alias(str(i+1)) for i in range(N) ])
df_new.show()
+---+---+---+---+---+---+---+
| id| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+
| a|150|110|130| 80|136|150|
| a|110|130| 80|136|150|190|
| a|130| 80|136|150|190|110|
| a| 80|136|150|190|110|150|
| a|136|150|190|110|150|110|
| a|150|190|110|150|110|130|
| a|190|110|150|110|130|136|
| a|110|150|110|130|136|100|
| a|150|110|130|136|100|150|
| a|110|130|136|100|150|190|
| a|130|136|100|150|190|110|
| b|100|100|130|100|136|100|
+---+---+---+---+---+---+---+
Use array of structs (spark 2.4+):
df_new = df.withColumn('dta', array(*cols)) \
.selectExpr("id", f"""
inline(transform(sequence(0,{len(cols)-N}), i -> ({','.join(f'dta[i+{j}] as `{j+1}`' for j in range(N))})))
""")
the code inside above f-string is the same as the following for N=6:
inline(transform(sequence(0,10), i -> struct(dta[i] as `1`, dta[i+1] as `2`, dta[i+2] as `3`, dta[i+3] as `4`, dta[i+4] as `5`, dta[i+5] as `6`)))
Yes you can use this code (and modify it to get what you need):
partitions = []
for row in df.rdd.toLocalIterator():
row_list = list(row)
num_elements = 6
for i in range(0, len(row_list) - num_elements):
partition = row[i : i+num_elements]
partitions.append(partition)
output_df = spark.createDataFrame(partitions)

pandas remove rows with multiple criteria

Consider the following pandas Data Frame:
df = pd.DataFrame({
'case_id': [1050, 1050, 1050, 1050, 1051, 1051, 1051, 1051],
'elm_id': [101, 102, 101, 102, 101, 102, 101, 102],
'cid': [1, 1, 2, 2, 1, 1, 2, 2],
'fx': [736.1, 16.5, 98.8, 158.5, 272.5, 750.0, 333.4, 104.2],
'fy': [992.0, 261.3, 798.3, 452.0, 535.9, 838.8, 526.7, 119.4],
'fz': [428.4, 611.0, 948.3, 523.9, 880.9, 340.3, 890.7, 422.1]})
When printed looks like this:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
4 1051 1 101 272.5 535.9 880.9
5 1051 1 102 750.0 838.8 340.3
6 1051 2 101 333.4 526.7 890.7
7 1051 2 102 104.2 119.4 422.1
I need to remove rows where 'case_id' = values in a List and 'cid' = values in a List. For simplicity lets just use Lists with a single value: cases = [1051] and ids = [1] respectively. In this scenario I want the NEW Data Frame to have (6) rows of data. It should look like this because there were two rows matching my criteria which should be removed:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
4 1051 2 101 333.4 526.7 890.7
5 1051 2 102 104.2 119.4 422.1
I've tried a few different things like:
df2 = df[(df.case_id != subcase) & (df.cid != commit_id)]
But this returns the inverse of what I was expecting:
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
I've also tried using .query(): df.query('(case_id != 1051) & (cid != 1)')
but got the same (2) rows of results.
Any help and/or explanations would be greatly appreciated.
Your code looks for the rows that meets the criteria, not drop it. You can drop thee rows using .drop()
Use the following:
df.drop(df.loc[(df['case_id'].isin(cases)) & (df['cid'].isin(ids))].index)
Output:
case_id cid elm_id fx fy fz
0 1050 1 101 736.1 992.0 428.4
1 1050 1 102 16.5 261.3 611.0
2 1050 2 101 98.8 798.3 948.3
3 1050 2 102 158.5 452.0 523.9
6 1051 2 101 333.4 526.7 890.7
7 1051 2 102 104.2 119.4 422.1

Resources