Reshape a Dataframe with Pandas

Reshape a Dataframe with Pandas - pivot

How can I reshape this dataframe with Pandas
id | col1 | col2 | col3 | value
-----------------------------------
1 | A1 | B1 | before | 20
2 | A1 | B1 | after | 13
3 | A1 | B2 | before | 11
4 | A1 | B2 | after | 21
5 | A2 | B1 | before | 18
6 | A2 | B1 | after | 22
... into the following format?
col1 | col2 | before | after
-------------------------------
A1 | B1 | 20 | 13
A1 | B2 | 11 | 21
A1 | B1 | 18 | 22
EDIT: A1 in the last line of the second table is supposed to be A2.
As the data is paired (e.g. "before" and "after") I need the columns to be aligned without 'NAs'.
df.pivot(index='col1', columns='col3', values='value')
does not work because col1 does not result in an unique index. I could create an additional column which would result in being unique. Is that the only way to go?

What do you want col1 and col2 to look like after you pivot? Your example output shows A1 and B1 for the final row yet neither of those values are associated with the 18 and 22. I have a couple of options:
In [234]: tmp = DataFrame(
{'id':[1,2,3,4,5,6],
'col1':['A1','A1','A1','A1','A2','A2'],
'col2':['B1','B1','B2','B2','B1','B2'],
'col3':['before','after','before','after','before','after'],
'value':[20,13,11,21,18,22]},
columns=['id','col1','col2','col3','value'])
Option 1:
In [236]: pivoted = pd.pivot_table(tmp, values='value',
rows=['col1','col2'],
cols=['col3'])
In [237]: pivoted
Out[237]:
col3 after before
col1 col2
A1 B1 13 20
B2 21 11
A2 B1 NaN 18
B2 22 NaN
This doesn't sound like the kind of behavior you want.
Option 2:
In [238]: pivoted = pivoted.fillna(method='bfill').dropna()
Out[238]:
col3 after before
col1 col2
A1 B1 13 20
B2 21 11
A2 B1 22 18
In [245]: pivoted.reset_index()
Out[245]:
col3 col1 col2 after before
0 A1 B1 13 20
1 A1 B2 21 11
2 A2 B1 22 18
This gets you pretty close. Again, I'm not sure how you want col1 and col2 to behave, but this has the right values in the before and after columns.

As indicated by your matrix data col1 cannot be an index because, as you said, it "does not result in an unique index".
I think your best best is:
grouped = df.groupby('col3')
pandas.merge(grouped.first(), grouped.last(), on=['col1','col2'])

Related

Excel how to add values in Column C if combination of values in column A and C is unique

I have an Excel table listing rooms, type of windows and how much windows are in that room:
Example:
COL-A| COL-B | Col-C
Row1 | Room | Window | Qty
Row2 | A | W1 | 1
Row3 | A | W2 | 1
Row4 | A | W1 | 1
Row5 | B | W1 | 1
Row6 | B | W1 | 1
Row7 | B | W1 | 1
Row8 | B | W1 | 1
...
I need to get a list telling how many Windows of each typ there is in each room:
COL-A | COL-B | Col-C
Row1 | Room | Window | Qty
Row1 | A | W1 | 2
Row1 | A | W2 | 1
Row1 | B | W1 | 4
...
It means I have to add values in Column C (QTY) if the combination of values in Column A and B are the same.
I have tryed all sort of cformula combinations like =SUMIFS(UNIQUE(A2:A100);AND;UNIQUE(B:100)) However without succes.
Any help would be appreciated

I just inserted a Pivot Table, and clicked on all fields to add to report, this is the screenshot of the result:
(The sigma values "Sum of Quantity is generated automatically, that's how basic this is)

Actually you need pivot table. You can also achieve it by formula. Try the following formula.
=LET(x,UNIQUE(A2:B8),y,SUMIFS(C2:C8,A2:A8,INDEX(x,,1),B2:B8,INDEX(x,,2)),HSTACK(x,y))

How to extract rows with some processing steps using python pandas?

My dataframe:
| query_name | position_description |
|------------|----------------------|
| A1 | [1-10] |
| A1 | [3-5] |
| A2 | [1-20] |
| A3 | [1-15] |
| A4 | [10-20] |
| A4 | [1-15] |
I would like to remove those rows with (i)same query_name and (ii) overlap entirely for the position_description?
Desired output:
| query_name | position_description |
|------------|----------------------|
| A1 | [1-10] |
| A2 | [1-20] |
| A3 | [1-15] |
| A4 | [10-20] |
| A4 | [1-15] |

If there can be no more than one row contained in another we can use:
from ast import literal_eval
df2 = pd.DataFrame(df['position_description'].str.replace('-', ',')
.apply(literal_eval).tolist(),
index=df.index).sort_values(0)
print(df2)
0 1
0 1 10
2 1 20
3 1 15
5 1 15
1 3 5
4 10 20
check = df2.groupby(df['query_name']).shift()
df.loc[~(df2[0].gt(check[0]) & df2[1].lt(check[1]))]
query_name position_description
0 A1 [1-10]
2 A2 [1-20]
3 A3 [1-15]
4 A4 [10-20]
5 A4 [1-15]

This should work for any number of ranges being contained by some ranges:
First, extract the boundaries
df = pd.DataFrame({
'query_name': ['A1', 'A1', 'A2', 'A3', 'A4', 'A4'],
'position_description': ['[1-10]', '[3-5]', '[1-20]', '[1-15]', '[10-20]', '[1-15]'],
})
df[['pos_x', 'pos_y']] = df['position_description'].str.extract(r'\[(\d+)-(\d+)\]').astype(int)
Then we will define the function that can choose what ranges to keep:
def non_contained_ranges(df):
df = df.drop_duplicates('position_description', keep='first') #Duplicated ranges will be seen as being contained by one another and thus all wouldn't pass this check. Drop all but one duplicate here.
range_min = df['pos_x'].min()
range_max = df['pos_y'].max()
range_size = range_max - range_min + 1
b = np.zeros((len(df), range_size))
for i, (x, y) in enumerate(df[['pos_x', 'pos_y']].values - range_min):
b[i, x: y+1] = 1.
b2 = np.logical_and(np.logical_xor(b[:, np.newaxis], b), b).any(axis=2)
np.fill_diagonal(b2, True)
b3 = b2.all(axis=0)
return df[b3]
If there are N ranges within a group (query_name), this function will do N x N comparisons, using boolean array operations.
Then we can do groupby and apply the function to yield the expected result
df.groupby('query_name')\
.apply(non_contained_ranges)\
.droplevel(0, axis=0).drop(columns=['pos_x', 'pos_y'])
Outcome:
query_name position_description
0 A1 [1-10]
2 A2 [1-20]
3 A3 [1-15]
4 A4 [10-20]
5 A4 [1-15]

standardization of pandas dataframe groups/chunks (how to insert rows fast)

I have "objects" (represented each by some rows in a table) which are described in multiple rows. But the problem is, that objects sometimes miss rows. My goal is to have a DataFrame where each object has the same amount of rows (same shape), where missing rows of an object are filled with empty rows.
For example:
object 1
O-ID | key 1 | key 2 | ... | key N | value 1 | value 2 | value N
0 | A 11 | A 21 | ... | key N1 | | |
0 | A 13 | A 23 | ... | key N3 | | |
0 | A 16 | A 26 | ... | key N6 | | |
object 2
O-ID | key 1 | key 2 | ... | key N | value 1 | value 2 | value N
1 | A 12 | A 22 | ... | key N2 | | |
1 | A 13 | A 23 | ... | key N3 | | |
1 | A 14 | A 24 | ... | key N4 | | |
"O-ID" is the Object-ID. We can see that there are 6 different kinds of rows in total. In the end, I want each object to have all 6 rows. key 1 .. key-N are keys in sense of key-value pairs (with value 1 ... value N).
The result should look like this:
object 1:
O-ID | key 1 | key 2 | ... | key N | value 1 | value 2 | value N
0 | A 11 | A 21 | ... | key N1 | | |
0 | A 12 | A 22 | ... | key N2 | Null | Null | Null
0 | A 13 | A 23 | ... | key N3 | | |
0 | A 14 | A 24 | ... | key N4 | Null | Null | Null
0 | A 15 | A 25 | ... | key N5 | Null | Null | Null
0 | A 16 | A 26 | ... | key N6 | | |
object 2:
O-ID | key 1 | key 2 | ... | key N | value 1 | value 2 | value N
1 | A 11 | A 21 | ... | key N1 | Null | Null | Null
1 | A 12 | A 22 | ... | key N2 | | |
1 | A 13 | A 23 | ... | key N3 | | |
1 | A 14 | A 24 | ... | key N4 | | |
1 | A 15 | A 25 | ... | key N5 | Null | Null | Null
1 | A 16 | A 26 | ... | key N6 | Null | Null | Null
I don't know how to do this besides using a slow for-loop...
Do you know a better/faster way to find out which rows are missing, and how to insert "Null"-rows?
I already had the idea of grouping them by "O-ID" and then using a map on the groups. But how do I insert the "null"-rows in the right order in a fast way?
I'm using the latest pandas version and the latest python 3

First we create a multiindex from all the keys we need in the result dataframe res. Then we reindex our dataframe with this new multiindex. In the last step we convert the key tuples back to individual columns and reorder the columns and sort the rows as needed.
import pandas as pd
df = pd.DataFrame( {'O_ID': [0,0,0,1,1,1,2],
'key_1': ['A11', 'A13', 'A16', 'A12', 'A13', 'A14', 'A15'],
'key_2': ['A21', 'A23', 'A26', 'A22', 'A23', 'A24', 'A25'],
'key_n': ['key N1', 'key N3', 'key N6', 'key N2', 'key N3', 'key N4', 'key N5'],
'value_1': [11,12,13,14,15,16,17],
'value_2': [21,22,23,24,25,26,27],
'value_n': [121,122,123,124,125,126,127]
})
keycols = [c for c in df.columns if c.startswith('key')]
valcols = [c for c in df.columns if c.startswith('value')]
# create multiindex of all combinations of O_ID and key tuples
keys = df[keycols].apply(tuple, axis=1)
idx = pd.MultiIndex.from_product([df.O_ID.unique(), keys.unique()], names=['O_ID','key_tuples'])
# set index of O_ID and key tuples and reindex with new multiindex
res = df.set_index(['O_ID',keys]).drop(columns=keycols)
res = res.reindex(idx).reset_index()
# split key tuples back into individual columns and reorder/sort as needed
res = pd.DataFrame(res.key_tuples.to_list(), index=res.index, columns=keycols).join(res).drop(columns=['key_tuples'])
res = res.reindex(columns=['O_ID']+keycols+valcols).sort_values(['O_ID']+keycols)
Result:
O_ID key_1 key_2 key_n value_1 value_2 value_n
0 0 A11 A21 key N1 11.0 21.0 121.0
3 0 A12 A22 key N2 NaN NaN NaN
1 0 A13 A23 key N3 12.0 22.0 122.0
4 0 A14 A24 key N4 NaN NaN NaN
5 0 A15 A25 key N5 NaN NaN NaN
2 0 A16 A26 key N6 13.0 23.0 123.0
6 1 A11 A21 key N1 NaN NaN NaN
9 1 A12 A22 key N2 14.0 24.0 124.0
7 1 A13 A23 key N3 15.0 25.0 125.0
10 1 A14 A24 key N4 16.0 26.0 126.0
11 1 A15 A25 key N5 NaN NaN NaN
8 1 A16 A26 key N6 NaN NaN NaN
12 2 A11 A21 key N1 NaN NaN NaN
15 2 A12 A22 key N2 NaN NaN NaN
13 2 A13 A23 key N3 NaN NaN NaN
16 2 A14 A24 key N4 NaN NaN NaN
17 2 A15 A25 key N5 17.0 27.0 127.0
14 2 A16 A26 key N6 NaN NaN NaN
(I had to add a third object with key A15, otherwise it is unclear from your sample data where this key should come from, i.e. this method uses all existing keys. If you know all the key values in advance and want to build the result dataframe with these keys, no matter if they occur in the input dataframe or not, then you can create your multiindex from these known key values instead of the unique keys present in the input data)

Google Sheets How to find the Top 3 closest columns to a given column

I have a table in google sheets like this one,
-------------------
| A | B | C | D |
-------------------
1 |C1 |C2 |C3 |C4 |
2 | 1 | 2 | 1 | 2 |
3 | 2 | 3 | 4 | 3 |
4 | 5 | 7 | 1 | 6 |
-------------------
My goal is to find which 2 columns C1,C2,C3 are closest to C4,
by calculate the average difference bewteen each column and column C4,
e,g Column C1 will have an averyage of abs( ( (1-2)+(2-3)+(5-6) ) /3 )
which is , abs( ( (A2-D2)+(A3-D3)+(A4-D4) )/(number of rows) )
I'm using ARRYFORMULA to get the average differece for one column and then I drag it horizontally so As will increase to Bs and so on
=ArrayFormula({A1;abs(average( (checks if there is empty cell) ,$D2:$D-(A2:A) )))})
if I use it in cell Z1, Z1 will show 'C1', and Z2 will show the average difference for column C1
but i'm not sure how to use a single nested formula to do it for all columns A:C at once, with out having to drag it
like I if I type =FORMULA(...) in Z1, and a table will show up
Thank you

Try the formula:
=QUERY(ARRAYFORMULA(ABS((ROW(A2:C)*COLUMN(A2:C))^0*D2:D26-A2:C26)),
"select avg(Col"&JOIN("), avg(Col",ArrayFormula(row(INDIRECT("A1:A"&COLUMNS(A2:C)))))&")")
Explanation
(ROW(A2:C)*COLUMN(A2:C))^0*D2:D26 -- copy C4 to compare it with others
"select avg(Col"&JOIN("), avg(Col"... -- compose query to get the average for each column.
Note: in your formula abs(average( must be replaced → average(abs( in order to complete abs function first.

How to refer to values in a structured reference column from the top of the column using a formula on a different row

I have realised that structured references are clean to write when making formulas, however when using them in a formula to refer to values in columns using structured references this only returns the values on the same row as your formula. This is not ideal.
Is there a way to refer to values in a structured reference column from the top of the column with a formula on a different row? I looked into this answer but it was only for the top value in the column.
So for example say I have a Table1 with 3 data columns and then I want to make a Table2 with a calculation column that is as simple as =Table1[Data1]*Table1[Data2], how would I get that formula to work down the column of Table2 even if Table2 is not on the same rows as Table1?
| Data1 | Data2 | Data3 |
|------------------------
| 22 | 7 | Joe |
| 24 | 2 | Bob |
| 44 | 7 | Ben |
| 29 | 8 | Sue |
| Calc1 |
|----------
| Formula | (Formula = 22 x 7)
| Formula | (Formula = 24 x 2)
| Formula | (Formula = 44 x 7)
| Formula | (Formula = 29 x 8)
Edit just so readers understand, this is an oversimplification of what I am trying to achieve, managing to use structured references as stated above will help solve the bigger problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reshape a Dataframe with Pandas - pivot

As indicated by your matrix data col1 cannot be an index because, as you said, it "does not result in an unique index". I think your best best is: grouped = df.groupby('col3') pandas.merge(grouped.first(), grouped.last(), on=['col1','col2'])

Related

Excel how to add values in Column C if combination of values in column A and C is unique

How to extract rows with some processing steps using python pandas?

standardization of pandas dataframe groups/chunks (how to insert rows fast)

Google Sheets How to find the Top 3 closest columns to a given column

How to refer to values in a structured reference column from the top of the column using a formula on a different row

Categories

Resources