Pandas column: List of columns in specific order - python-3.x

I have a dataframe, but I'm trying to add a new column which is a list of the column names in order of their values, for each row.
Searching has proved to be difficult, as the search terms have so much in common with doing a column sort overall. Instead, I'm trying to customize the list for each row.
df = pd.DataFrame([
["a",88,3,78,8,40 ],
["b",100,20,29,13,91 ],
["c",77,92,42,72,58 ],
["d",39,53,69,7,40 ],
["e",26,62,77,33,86 ],
["f",94,5,28,96,7 ]
], columns=['id','x1','x2','x3','x4','x5'])
have = df.set_index('id')
+----+-----+----+----+----+----+----------------------------+
| id | x1 | x2 | x3 | x4 | x5 | ordered_cols |
+----+-----+----+----+----+----+----------------------------+
| a | 88 | 3 | 78 | 8 | 40 | ['x2','x4','x5','x3','x1'] |
| b | 100 | 20 | 29 | 13 | 91 | ['x4','x2','x3','x5','x1'] |
| c | 77 | 92 | 42 | 72 | 58 | … |
| d | 39 | 53 | 69 | 7 | 40 | … |
| e | 26 | 62 | 77 | 33 | 86 | … |
| f | 94 | 5 | 28 | 96 | 7 | … |
+----+-----+----+----+----+----+----------------------------+

try stack with sort_values and groupby
assuming your dataframe is called df
df["sorted_cols"] = (
df.stack().sort_values().reset_index(1).groupby(level=0)["level_1"].agg(list)
)
print(df)
x1 x2 x3 x4 x5 sorted_cols
id
a 88 3 78 8 40 [x2, x4, x5, x3, x1]
b 100 20 29 13 91 [x4, x2, x3, x5, x1]
c 77 92 42 72 58 [x3, x5, x4, x1, x2]
d 39 53 69 7 40 [x4, x1, x5, x2, x3]
e 26 62 77 33 86 [x1, x4, x2, x3, x5]
f 94 5 28 96 7 [x2, x5, x3, x1, x4]

The solution by Manakin will be the fastest option, because it is a vectorized.
Use pandas.DataFrame.apply with axis=1, and a list comprehension to sort the column names by the row values.
The list comprehension is from SO: Sorting list based on values from another list, and does not require importing any additional packages.
import pandas as pd
# add the new column
df['ordered_cols'] = df.apply(lambda y: [x for _, x in sorted(zip(y, df.columns))], axis=1)
# display(df)
x1 x2 x3 x4 x5 ordered_cols
id
a 88 3 78 8 40 [x2, x4, x5, x3, x1]
b 100 20 29 13 91 [x4, x2, x3, x5, x1]
c 77 92 42 72 58 [x3, x5, x4, x1, x2]
d 39 53 69 7 40 [x4, x1, x5, x2, x3]
e 26 62 77 33 86 [x1, x4, x2, x3, x5]
f 94 5 28 96 7 [x2, x5, x3, x1, x4]

Here is a simple one line solution using apply and np.argsort :
import numpy as np
have["ordered_cols"] = have.apply(lambda row: have.columns[np.argsort(row.values)].values, axis=1)
have

Hay,
you can try looping over the rows and sorting the values in each row. The code below will do the trick:
ordered_cols = []
for index, row in have.iterrows():
ordered_cols.append(list(have.sort_values(by=index, ascending=True, axis=1).columns))
have['ordered_cols'] = ordered_cols
have
Output:
x1 x2 x3 x4 x5 ordered_cols
id
a 88 3 78 8 40 [x2, x4, x5, x3, x1]
b 100 20 29 13 91 [x4, x2, x3, x5, x1]
c 77 92 42 72 58 [x3, x5, x4, x1, x2]
d 39 53 69 7 40 [x4, x1, x5, x2, x3]
e 26 62 77 33 86 [x1, x4, x2, x3, x5]
f 94 5 28 96 7 [x2, x5, x3, x1, x4]
I hope this was helpful.
Cheers!

Related

Cannot turn off "sort" function in pandas.concat

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'col3':np.random.randint(1,10,5),'col1':np.random.randint(30,80,5)})
df2 = pd.DataFrame({'col4':np.random.randint(30,80,5),'col5':np.random.randint(100,130,5)})
df3 = pd.DataFrame({'col9':np.random.randint(1,10,5),'col8':np.random.randint(30,80,5)})
x1 = pd.concat([df1,df2,df3],axis=1,sort=False)
x1.columns = pd.MultiIndex.from_product([['I2'],x1.columns])
x2 = pd.concat([df1,df2,df3],axis=1,sort=False)
x2.columns = pd.MultiIndex.from_product([['I3'],x2.columns])
x3 = pd.concat([df1,df2,df3],axis=1,sort=False)
x3.columns = pd.MultiIndex.from_product([['I1'],x3.columns])
pd.concat([x1,x2,x3],axis=0,sort=False)
I was trying to get an aggregated dataframe with exactly the same column order as those of x1, x2 and x3 (which are already the same) as figure 1 shows below:
Figure 1: I was trying to get this
But actually the above codes created a dataframe presented in figure 2 below:
Figure 2: The code actually created this
I am wondering why the "sort=False" param did not successfully handle the sorting behaviour neither in the first level nor the second level of the columns in the pandas.concat() function?
Is there any other way that I can get the dataframe that I want?
Great thanks for your time and intelligence!
You could use join instead of using concat
x1.join(x2,how='outer').join(x3,how='outer')
Result:
I2 I3 I1
col3 col1 col4 col5 col9 col8 col3 col1 col4 col5 col9 col8 col3 col1 col4 col5 col9 col8
0 7 54 42 128 8 79 7 54 42 128 8 79 7 54 42 128 8 79
1 1 56 56 102 1 77 1 56 56 102 1 77 1 56 56 102 1 77
2 9 34 52 108 4 68 9 34 52 108 4 68 9 34 52 108 4 68
3 3 42 51 108 8 75 3 42 51 108 8 75 3 42 51 108 8 75
4 3 34 70 100 5 78 3 34 70 100 5 78 3 34 70 100 5 78

Apply z-score across all attributes by country

I'm trying to clean up a dataset that has data on every country in the world from 2000-2015. The population data by year is quite bad - I want to assign a z scores for each country's population data by year so I can see which data points to drop as outliers. How would I do this? I'm thinking I need to use groupby(), but I'm not sure how to deploy it.
I'm working with this WHO Kaggle dataset: https://www.kaggle.com/kumarajarshi/life-expectancy-who/data#
The data generally looks like this:
Example
Maybe, something like this might work -
import numpy as np, pandas as pd
l1 = ['a'] * 5 + ['b'] * 10 + ['c'] * 8
l2 = list(np.random.randint(10,20,size=5)) + list(np.random.randint(100,150, size=10)) + list(np.random.randint(75,100, size=8))
df = pd.DataFrame({'cat':l1, 'values':l2}) #creating a dummy dataframe
df
cat values
0 a 18
1 a 17
2 a 11
3 a 13
4 a 11
5 b 102
6 b 103
7 b 119
8 b 113
9 b 100
10 b 113
11 b 102
12 b 108
13 b 128
14 b 126
15 c 75
16 c 96
17 c 81
18 c 90
19 c 80
20 c 95
21 c 96
22 c 86
df['z-score'] = df.groupby(['cat'])['values'].apply(lambda x: (x - x.mean())/x.std())
df
cat values z-score
0 a 18 1.206045
1 a 17 0.904534
2 a 11 -0.904534
3 a 13 -0.301511
4 a 11 -0.904534
5 b 102 -0.919587
6 b 103 -0.821759
7 b 119 0.743496
8 b 113 0.156525
9 b 100 -1.115244
10 b 113 0.156525
11 b 102 -0.919587
12 b 108 -0.332617
13 b 128 1.623951
14 b 126 1.428295
15 c 75 -1.520176
16 c 96 1.059516
17 c 81 -0.783121
18 c 90 0.322461
19 c 80 -0.905963
20 c 95 0.936674
21 c 96 1.059516
22 c 86 -0.168908

dataframe columns to multtindex dataframe

I have the following data in an Excel sheet and I want to read it as a multiindex dataframe:
Y1 Y1 Y2 Y2
B H1 H2 H1 H2
1 80 72 79.2 84.744
2 240 216 237.6 254.232
3 40 36 39.6 42.372
4 160 144 158.4 169.488
5 240 216 237.6 254.232
6 0 0 0 0
I am reading it as:
DATA = pd.read_excel('data.xlsx',sheet_name=None)
since I am reading other sheets too.
Question 1:
This data is not read as multi-index data. How do I make it read it as multi-index? Or maybe I should read it as a dataframe and then convert it to multi-index?
Current result of reading as dataframe
DATA['Load']
Y1 Y1.1 Y2 Y2.1
bus H1 H2 H1 H2
1 80 72 79.2 84.744
2 240 216 237.6 254.232
3 40 36 39.6 42.372
4 160 144 158.4 169.488
5 240 216 237.6 254.232
6 0 0 0 0
Question 2 and probably the more fundamental question:
How do I deal with multi-indexing when one or more of the indexes are on the columns side? In this example, I want to access the data by specifying B, Y, H. I know how to work with multi-index when they are all as index, but can't get the hang of it when the indexes are on the columns.
Thank you very much for your help :)
PS:
Another sheet may look like the following:
from to x ratea
1 2 0.4 10
1 4 0.6 80
1 5 0.2 10
2 3 0.2 10
2 4 0.4 10
2 6 0.3 10
3 5 0.2 10
4 6 0.3 10
where I will set from and to as set (set_index(['from','to']) to get a multi-index dataframe.
to read a dataframe like this to a multiindex users the header param in pd.read_excel()
df = pd.read_excel('myFile.xlsx', header=[0,1])
Y1 Y2
B H1 H2 H1 H2
1 80 72 79.2 84.744
2 240 216 237.6 254.232
3 40 36 39.6 42.372
4 160 144 158.4 169.488
5 240 216 237.6 254.232
6 0 0 0.0 0.000
this means that you are telling pandas that you have two header rows 0 and 1
after our conversation:
df = pd.read_excel('Book2.xlsx', header=[0,1])
df2 = df.unstack().to_frame()
idx = df2.swaplevel(0,2).swaplevel(1,2).index.set_names(['B', 'Y', 'H'])
df2.set_index(idx, inplace=True)
0
B Y H
1 Y1 H1 80.000
2 Y1 H1 240.000
3 Y1 H1 40.000
4 Y1 H1 160.000
5 Y1 H1 240.000
6 Y1 H1 0.000
1 Y1 H2 72.000
2 Y1 H2 216.000
3 Y1 H2 36.000
4 Y1 H2 144.000
5 Y1 H2 216.000
6 Y1 H2 0.000
1 Y2 H1 79.200
2 Y2 H1 237.600
3 Y2 H1 39.600
4 Y2 H1 158.400
5 Y2 H1 237.600
6 Y2 H1 0.000
1 Y2 H2 84.744
2 Y2 H2 254.232
3 Y2 H2 42.372
4 Y2 H2 169.488
5 Y2 H2 254.232
6 Y2 H2 0.000

How to get last written data location in a PDF page during PDF creation using iTextSharp?

I am adding hundreds of tables one after the other to the PDF document using iTextSharp. But the problem with this is we don't know when to create new page. And sometimes half of the table goes to the next page and half remain in the current page.Is there any way I can have last written location so that I can decide whether to create new page or not.
I found some codes on StackOverFlow but none worked for me.
I tried to get location using below code before adding new data to the document.
float y = PdfPageHeight;
for(int i=0;i<100;i++)
{
if(y<=document.document.BottomMargin)
{
document.NewPage();
}
mainTableHeader = new PdfPTable(1);
mainTableHeader.SetWidthPercentage(new float[] { PageSize.A4.Width }, PageSize.A4);
AddContent(ref mainTableHeader); //Adding some cells to the table
document.Add(mainTableHeader);
y=writer.GetVerticalPosition(false);
}
Please help me if any one knows how to do this.
I took your code with minor changes (replaced your unknown method AddContent with code actually adding some cells to the table; added some Console outputs):
using (Document document = new Document())
{
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(dest, FileMode.Create, FileAccess.Write));
document.Open();
float y = document.PageSize.Height;
for (int i = 0; i < 100; i++)
{
if (y <= document.BottomMargin)
{
Console.Write("New Page!\n");
document.NewPage();
}
PdfPTable mainTableHeader = new PdfPTable(1);
mainTableHeader.SetWidthPercentage(new float[] { PageSize.A4.Width }, PageSize.A4);
mainTableHeader.AddCell("Test");
mainTableHeader.AddCell(writer.GetVerticalPosition(false).ToString());
mainTableHeader.AddCell(writer.GetVerticalPosition(false).ToString());
document.Add(mainTableHeader);
y = writer.GetVerticalPosition(false);
Console.Write("After table {0} y is at {1}\n", i, y);
}
}
Running this code I see on the console:
After table 0 y is at 758
After table 1 y is at 710
After table 2 y is at 662
After table 3 y is at 614
After table 4 y is at 566
After table 5 y is at 518
After table 6 y is at 470
After table 7 y is at 422
After table 8 y is at 374
After table 9 y is at 326
After table 10 y is at 278
After table 11 y is at 230
After table 12 y is at 182
After table 13 y is at 134
After table 14 y is at 86
After table 15 y is at 38
After table 16 y is at 758
After table 17 y is at 710
After table 18 y is at 662
After table 19 y is at 614
After table 20 y is at 566
After table 21 y is at 518
After table 22 y is at 470
After table 23 y is at 422
After table 24 y is at 374
After table 25 y is at 326
After table 26 y is at 278
After table 27 y is at 230
After table 28 y is at 182
After table 29 y is at 134
After table 30 y is at 86
After table 31 y is at 38
After table 32 y is at 758
After table 33 y is at 710
After table 34 y is at 662
After table 35 y is at 614
After table 36 y is at 566
After table 37 y is at 518
After table 38 y is at 470
After table 39 y is at 422
After table 40 y is at 374
After table 41 y is at 326
After table 42 y is at 278
After table 43 y is at 230
After table 44 y is at 182
After table 45 y is at 134
After table 46 y is at 86
After table 47 y is at 38
After table 48 y is at 758
After table 49 y is at 710
After table 50 y is at 662
After table 51 y is at 614
After table 52 y is at 566
After table 53 y is at 518
After table 54 y is at 470
After table 55 y is at 422
After table 56 y is at 374
After table 57 y is at 326
After table 58 y is at 278
After table 59 y is at 230
After table 60 y is at 182
After table 61 y is at 134
After table 62 y is at 86
After table 63 y is at 38
After table 64 y is at 758
After table 65 y is at 710
After table 66 y is at 662
After table 67 y is at 614
After table 68 y is at 566
After table 69 y is at 518
After table 70 y is at 470
After table 71 y is at 422
After table 72 y is at 374
After table 73 y is at 326
After table 74 y is at 278
After table 75 y is at 230
After table 76 y is at 182
After table 77 y is at 134
After table 78 y is at 86
After table 79 y is at 38
After table 80 y is at 758
After table 81 y is at 710
After table 82 y is at 662
After table 83 y is at 614
After table 84 y is at 566
After table 85 y is at 518
After table 86 y is at 470
After table 87 y is at 422
After table 88 y is at 374
After table 89 y is at 326
After table 90 y is at 278
After table 91 y is at 230
After table 92 y is at 182
After table 93 y is at 134
After table 94 y is at 86
After table 95 y is at 38
After table 96 y is at 758
After table 97 y is at 710
After table 98 y is at 662
After table 99 y is at 614
Thus, your claims in comments that
every time I calculate the "y" value it remains constant
or
every time I add a new table to document I use to do that test. And every time I am getting same value.
cannot be reproduced with your code: y obviously is wildly changing all the time.
Thus, if you want help, please do provide sample code that allows people to reproduce your problem, not to disprove your claims.

curve fitting - linear regression

I have the following values for Xs and the corresponding value of Y
X1 X2 X3 X4 X5 X6 Y
13 14 15 16 16 N/A 25587
13 14 20 22 22 25 19672
16 17 18 23 27 30 9652
23 23 25 26 28 N/A 3603
To find the formula that express the relationship, I have used multiple linear regression
(by using X1, X2, X3, X4, and X5), but unfortunately the error was too big.
Is there another statistical method that I can use to formulate the relationship (which might be not-linear)?

Resources