Python Dataframe rename column compared to value - python-3.x

With Pandas I'm trying to rename unnamed columns in dataframe with values on the first ligne of data.
My dataframe:
id
store
unnamed: 1
unnamed: 2
windows
unnamed: 3
unnamed: 4
0
B1
B2
B3
B1
B2
B3
1
2
c
12
15
15
14
2
4
d
35
14
14
87
My wanted result:
id
store_B1
store_B3
store_B2
windows_B1
windows_B2
windows_B3
0
B1
B2
B3
B1
B2
B3
1
2
c
12
15
15
14
2
4
d
35
14
14
87
I don't know how I can match the column name with the value in my data. Thanks for your help. Regards

You can use df.columns.where to make unnamed: columns NaN, then convert it to a Series and use ffill:
df.columns = pd.Series(df.columns.where(~df.columns.str.startswith('unnamed:'))).ffill() + np.where(~df.columns.isin(['id','col2']), ('_' + df.iloc[0].astype(str)).tolist(), '')
Output:
>>> df
id store_B1 store_B2 store_B3 windows_B1 windows_B2 windows_B3
0 0 B1 B2 B3 B1 B2 B3
1 1 2 c 12 15 15 14
2 2 4 d 35 14 14 87

Related

Align value counts of two dataframes side by side

If I have two dataframes - df1 (data for current day), and df2 (data for previous day).
Both dataframes have 40 columns, and all columns are object data type.
how do I compare Top 3 value_counts for both dataframes, ideally so that the result is side by side, like the following;
df1 df2
Column a Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
Column b Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
The main idea is to check for data anomalies between the data for the two days.
I only know that for each column per dataframe, I must do something like this -
df1.Column.value_counts().head(3)
But this doesn't show combined results as I want. Please help!
You can compare if same columns names in both DataFrames - first use lambda function with Series.value_counts, top3 and create default index for both DataFrames and then join them with concat and for expected order add DataFrame.stack:
np.random.seed(2022)
df1 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df2 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df11 = df1.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df22 = df2.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
Or use DataFrame.compare:
df = (df11.compare(df22,keep_equal=True)
.rename(columns={'self':'df1','other':'df2'})
.stack(0)
.sort_index(level=1))
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
EDIT: For add categories use f-strings for join indices and values of Series in list comprehension:
np.random.seed(2022)
df1 = 'Cat1' + pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df2 = 'Cat2' + pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df11 = df1.apply(lambda x: [f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df22 = df2.apply(lambda x:[f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 Cat18 - 7 Cat29 - 8
1 c0 Cat11 - 6 Cat24 - 8
2 c0 Cat19 - 6 Cat23 - 6
0 c1 Cat17 - 8 Cat24 - 9
1 c1 Cat10 - 7 Cat26 - 7
2 c1 Cat14 - 7 Cat20 - 7
0 c2 Cat13 - 9 Cat28 - 7
1 c2 Cat11 - 7 Cat25 - 7
2 c2 Cat19 - 7 Cat26 - 6
0 c3 Cat15 - 9 Cat20 - 7
1 c3 Cat18 - 7 Cat24 - 7
2 c3 Cat13 - 7 Cat27 - 6
0 c4 Cat12 - 11 Cat25 - 14
1 c4 Cat13 - 7 Cat20 - 8
2 c4 Cat15 - 7 Cat26 - 7

panda dataframe row wise iteration with referencing previous row values for conditional matching

I have to find out how many times a bike was on overspeed, and in each instances for how long(for simplicity how many kms)
df = pd.DataFrame({'bike':['b1']*15, 'km':list(range(1,16)), 'speed':[20,30,38,33,28,39,26,33,35,46,53,27,37,42,20]})
>>> df
bike km speed
0 b1 1 20
1 b1 2 30
2 b1 3 38
3 b1 4 33
4 b1 5 28
5 b1 6 39
6 b1 7 26
7 b1 8 33
8 b1 9 35
9 b1 10 46
10 b1 11 53
11 b1 12 27
12 b1 13 37
13 b1 14 42
14 b1 15 20
#Expected result is
bike last_OS_loc for_how_long_on_OS
b1 4 2km
b1 11 5km
b1 15 1km
Now Logic-
has to flag the speed >= 30 as Overspeed_Flag
If the speed remains on more than 30 for 1 or 1+km then those continuation are treated as a overspeed session (eg: when b1 was in between 2 to 4 km, 6to11, 13-14km, MARK it was not a overspeed session when b1 was at 6km, as it was only for that row, no continuation on >30 found).
then measure for a session, how long/for how many kms he remains at overspeed limit. Refer expected result tab.
also finding out for a overspeed session what was the last km mark.
Kindly suggest how can i achieve this. And do let me know if anything is not clear in the question.
P:S: i am also trying, but it is little complex for me(Pretty confused on how to mark if it is a continuation of OS_flag or a single instance of OS.), Will get back if successful in doing this. Thanks in ADV.
You can use:
#boolean mask
mask = df['speed'] >= 30
#consecutive groups
df['g'] = mask.ne(mask.shift()).cumsum()
#get size of each group
df['count'] = mask.groupby(df['g']).transform('size')
#filter by mask and remove unique rows
df = df[mask & (df['count'] > 1)]
print (df)
bike km speed g count
1 b1 2 30 2 3
2 b1 3 38 2 3
3 b1 4 33 2 3
7 b1 8 33 6 4
8 b1 9 35 6 4
9 b1 10 46 6 4
10 b1 11 53 6 4
12 b1 13 37 8 2
13 b1 14 42 8 2
#aggregate first and last values
df1 = df.groupby(['bike','g'])['km'].agg([('last_OS_loc', 'last'),
('for_how_long_on_OS','first')])
#substract last with first
df1['for_how_long_on_OS'] = df1['last_OS_loc'] - df1['for_how_long_on_OS']
#data cleaning
df1 = df1.reset_index(level=1, drop=True).reset_index()
print (df1)
bike last_OS_loc for_how_long_on_OS
0 b1 4 2
1 b1 11 3
2 b1 14 1
EDIT:
print (pd.concat([mask,
mask.shift(),
mask.ne(mask.shift()),
mask.ne(mask.shift()).cumsum()], axis=1,
keys=('mask', 'shifted', 'not equal (!=)', 'cumsum')))
mask shifted not equal (!=) cumsum
0 False NaN True 1
1 True False True 2
2 True True False 2
3 True True False 2
4 False True True 3
5 True False True 4
6 False True True 5
7 True False True 6
8 True True False 6
9 True True False 6
10 True True False 6
11 False True True 7
12 True False True 8
13 True True False 8
14 False True True 9
Here is another approach using a couple of helper Series and a lambda func:
os_session = (df['speed'].ge(30) & (df['speed'].shift(-1).ge(30) | df['speed'].shift().ge(30))).astype(int)
groups = (os_session.diff(1) != 0).astype('int').cumsum()
f_how_long = lambda x: x.max() - x.min()
grouped_df = (df.groupby([os_session, groups, 'bike'])['km']
.agg([('last_OS_loc', 'max'),
('for_how_long_on_OS',f_how_long)])
.xs(1, level=0)
.reset_index(level=0, drop=True))
print(grouped_df)
last_OS_loc for_how_long_on_OS
bike
b1 4 2
b1 11 3
b1 14 1

Excel - Sorting based on metric values

I have a dataset which looks like:
Product Metrics C1 C2 C3
A1 Q1 20 30 10
Q2 213123 2312 32123
Q3 454 65 45
Q4 3 4 6
A2 Q1 10 5 1
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A3 Q1 18 6 3
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
Now I want to sort the values based on metric Q1 - From smallest to largest (comparing against the product -A1,A2) then the final dataset should look like,
Product Metrics C1 C2 C3
A2 Q1 10 5 1
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A3 Q1 18 6 3
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A1 Q1 20 30 10
Q2 213123 2312 32123
Q3 454 65 45
Q4 3 4 6
hope this gives a clear picture. Thanks in advance guys
The way I would probably do it is transpose your columns and rows so that you have columns for Q1, Q2, Q3, Q4.
Like this:
Product Metrics Q1 Q2 Q3 Q4
A1 C1 20 213123 454 3
A1 C2 30 2312 65 4
A1 C3 10 32123 45 6
A2 C1 10 123 454 3
A2 C2 5 13 65 45
A2 C3 1 23 45 6
Then you can sort by Q1 using Data>Sort & Filter
CBRF23 already pointed in the right direction but I believe you have to go even a little bit further and flatten each product related sub-array into a single row like
A | B C D | E F G | H I J | K L M
---| Q1 --------| Q2 ------------ | Q3 ------- | Q4 -------
Pr | C1 C2 C3 | C1 C2 C3 | C1 C2 C3 | C1 C2 C3
A1 | 20 30 10 | 213123 2312 32123 | 454 65 45 | 3 4 6
A2 | 10 5 1 | 123 13 23 | 454 65 45 | 3 4 6
A3 | 18 6 3 | 123 13 23 | 454 65 45 | 3 4 6
(The first row just shows the Excel columns, second row the flattened Q1,Q2,Q3 and Q4 sections and third row the sub-headers for each column)
Now you can safely sort by column B. In case you want to sort by the sum of all Q1 metrics you could introduce another column N being the sum of B,C and D and use that for sorting.
Update:
To get your desired output format back there are basically to possibilities:
If the number of records is known and fixed you can set-up a "results" page in your excel folder with a list of small "sub-tables". The fields of each sub-array then directly reference the "transposed" fields in a line of the sorted master results array.
If the number of results is variable you will have to construct/reconstruct the results page mentioned above using a suitable vba script. The vba generated page can of course also consist of the sorted values directly rather than referencing the values in the sorted master array.

excel:compare 2 columns and copy data on other columns

need help.. i trying to compare 2 columns and copy data in other columns..
Columns:
A B C D
1 3 10
2 4 20
3 1 30
4 2 40
5 0 50
i want to compare column A to B to find its duplicate and copy data from column C if column A has a duplicate at column B...
Result must be:
A B C D
1 3 10 0
2 4 20 40
3 6 30 10
4 2 40 20
5 0 50 0
thanks in advance...
An answer as I understand the question (assuming the change in col B is just a typo):
Input
A B C D
1 3 10
2 4 20
3 6 30
4 2 40
5 0 50
Output
A B C D
1 3 10 0
2 4 20 40
3 6 30 10
4 2 40 20
5 0 50 0
Formula in D2 (filled down): =IF(COUNTIF(B$2:B$6, $A2)>0, VLOOKUP($A2,$B$2:$C$6, 2, FALSE), 0).
COUNTIF(B$2:B$6, $A2) returns the number of times the value in A2 appears in the array B2:B6. If this value is greater than 0 (meaning that A2 is in B2:B6), the IF() function looks looks up A2 in col B and returns the value in the 2nd row (col C); if A2 is not in B2:B6, the formula returns 0.

Excel SUMIF based on array using text string

Is there a way to substitute the cell address containing a text string as the array criteria in the following formula?
=SUM(SUMIF(A5:A10,{1,22,3},E5:E10))
So instead of {1,22,3}, "1, 22, 3" is entered in cell A2 the formula becomes
=SUM(SUMIF(A5:A10,A2,E5:E10))
I have tried but get 0 as a result (refer C16)
A B C D E F G H
1 Tree
2 {1,22,3} 1
3 22
4 Tree Profit 3
5 1 105
6 2 96
7 1 105
8 1 75
9 2 76.8
10 1 45
11
12 330 =SUM(SUMIF(A5:A10,{1,22,3},B5:B10))
13
14 330 =SUMPRODUCT(SUMIF(A5:A10,E2:E3,B5:B10))
15
16 0 =SUM(SUMIF(A5:A10,A2,B5:B10))
17 NB: Custom Format "{"#"}" on Cell A2 I enter 1,22,3 so it displays {1,22,3}
Ok so after some further searching (see Excel string to criteria) and trial and error I have come up with the following solution.
Using Name Manager I created UDF called GetList which Refers to:
=EVALUATE(Sheet1!$A$3) NB: Cell A3 has this formula in it =TEXT(A2,"{#}")
I then used the following formula:
=SUMPRODUCT(SUMIF($A$5:$A$12,GetList,$B$5:$B$12))
which gives the desired result of 321 as per the other two formulas (see D12 below).
If anyone can suggest a better solution then feel free to do so.
Thanks to Dennis to my original post regarding table
A B C D E
1 Tree
2 1,22,3 1
3 {1,22,3} =TEXT(A2,"{#}") 22
4 Tree Profit 3
5 11 105
6 22 96
7 1 105
8 3 75
9 2 76.8
10 1 45
11
12 321 =SUMPRODUCT(SUMIF($A$5:$A$12,GetList,$B$5:$B$12))
13
14 321 =SUM(SUMIF(A5:A10,{1,22,3},B5:B10))
15
16 321 =SUMPRODUCT(SUMIF(A5:A10,E2:E3,B5:B10))
17
18 0 =SUM(SUMIF(A5:A10,A2,B5:B10))
19 NB: Custom Format "{"#"}" on Cell A2 I enter 1,22,3 so it displays {1,22,3}

Resources