Sorting and Grouping in Pandas data frame column alphabetically - python-3.x

I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?

I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !

Related

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Select top n columns based on another column

I have a database as the following:
And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:
I know that pandas offers a formula called nlargest:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
but I don't think it is usable for this use case. Is there any workaround?
Thanks so much in advance!
I have mimicked your dataframe as below and provided a way forward to get the desired, hope that will helpful.
Your Dataframe:
>>> df
Date country population
0 2019-12-31 A 100
1 2019-12-31 B 10
2 2019-12-31 C 1000
3 2020-01-01 A 200
4 2020-01-01 B 20
5 2020-01-01 C 3500
6 2020-01-01 D 12
7 2020-02-01 D 2000
8 2020-02-01 E 54
Your Desired Solution:
You can use nlargest method along with set_index ans groupby method.
This is what you will get..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date country
2019-12-31 C 1000
A 100
2020-01-01 C 3500
A 200
2020-02-01 D 2000
E 54
Name: population, dtype: int64
Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54
Another way around:
With groupby and apply function use reset_index with parameter drop=True and level= ..
>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
# df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54

New dataframe row by row from a different format dataframe

I'm wandering if something like this could be achieved with python.
I currently have the following dataframe (df1):
A B C D E F
1.1.1 amba 131 1 50 4
2.2.2 erto 50 7 131 8
3.3.3 gema 131 2 50 5
And I would like to get this output in a new dataframe (df2):
ID User 131 50
1.1.1 amba 1 4
2.2.2 erto 8 7
3.3.3 gema 2 5
Take in mind that df1 has an undetermined number of rows and df2 should have the same number of rows than df1. First and second column do not change and keep the same. Columns C and E in df1 store attribute IDs while columns D and F store attribute's values. For example, in df1 131=1 and 50=4 in the first row. Plus attribute IDs are not always in the same column and the attribute ID could be placed in Column C or column E.
I am thinking on creating df2 using a loop and analyzing rows with lambda but i am currently having issues to make work anything for the moment. Any idea?
I have understood evey part of the code and I am now adding columns but I am wondering if this could be done with a loop or something similar. This is how code looks after adding 4 extra colums:
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO(""" A B C D E F G H I J
1.1.1 amba 131 1 50 4 40 3 150 5
2.2.2 erto 50 7 40 8 150 8 131 2
3.3.3 gema 131 2 150 5 40 1 50 3"""), sep="\s+")
df2 = (pd.concat([df1.drop(columns=["C","D","E","F","G","H"]).rename(columns={"I":"key","J":"val"}),
df1.drop(columns=["C","D","E","F","I","J"]).rename(columns={"G":"key","H":"val"}),
df1.drop(columns=["C","D","G","H","I","J"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F","G","H","I","J"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
print(df2)
And this is the output:
ID User 40 50 131 150
0 1.1.1 amba 3 4 1 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
So yes, everything is working fine but I would like to find a way to make this with a loop instead of having tons of lines (I have about 70 columns per row). Thank you very much for the help. Thanks.
I have just one extra question and I will have everything working fine. In my actual table I have some rows with 60 columns nd other ones with just 30 or so. This means that I have tons of NaN in these rows with less colums, so I am getting an error when try to unstack. I have read about pivot_tables, drop_duplicates, etc, but not sure how to make work some of these options with this code. Thanks!
Logically you have a mix of keys being part of row and part of columns. Construct a df by concat() that has the whole key as part of the row. Then it's a simple case of using unstack() to get what you want
df1 = pd.read_csv(io.StringIO(""" A B C D E F
1.1.1 amba 131 1 50 4
2.2.2 erto 50 7 131 8
3.3.3 gema 131 2 50 5"""), sep="\s+")
df2 = (pd.concat([df1.drop(columns=["C","D"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
output
ID User 50 131
1.1.1 amba 4 1
2.2.2 erto 7 8
3.3.3 gema 5 2
......

Reshape Pandas dataframe based on values in two columns

In Python, I would like to search through all rows in the dataframe with two possible paths (dataframe is populated from csv files). If the 'Group' column for a given row is zero, move that row's data to the next row of a new dataframe using the 'Channel_1' and 'Data_1' columns. If the 'Group' column for a given row is non-zero, then get all three rows with the same 'Group' column value (also identified by 'sub-group' column as 1, 2 or 3) and add to the next row of the new dataframe.
Code to generate dataframe from csv file:
for name in glob.glob(search_string):
r_file = pd.read_csv(name)
Current Data Format:
Channel_Num Group Sub_Group Data
1000 1 1 100
1001 1 2 105
1002 1 3 110
1003 0 0 200
1004 2 1 400
1005 2 2 405
1006 2 3 410
1007 0 0 500
Desired Data Format:
Group Channel_1 Data_1 Channel_2 Data_2 Channel_3 Data_3
1 1000 100 1001 105 1002 110
0 1003 200 NaN NaN NaN NaN
2 1004 400 1005 405 1006 410
0 1007 500 NaN NaN NaN NaN
I've tried the GroupBy and pivot_table methods but without success. After the data is in the desired format, there are other calculations that need run against the newly organized data but getting it in the desired format is the key.
This is more like a pivot problem after create the additional key by using diff and cumsum , so I am using pivot_table and multiple columns flatten
df.loc[df.Sub_Group==0,'Sub_Group']=1
df['newkey']=df.Group.diff().ne(0).cumsum()
s=df.pivot_table(index=['Group','newkey'],columns=['Sub_Group'],values=['Channel_Num','Data'],aggfunc='first').sort_index(level=1,axis=1)
s.columns=s.columns.map('{0[0]}_{0[1]}'.format)
s.reset_index(level=0).sort_index()
Out[25]:
Group Channel_Num_1 Data_1 ... Data_2 Channel_Num_3 Data_3
newkey ...
1 1 1000.0 100.0 ... 105.0 1002.0 110.0
2 0 1003.0 200.0 ... NaN NaN NaN
3 2 1004.0 400.0 ... 405.0 1006.0 410.0
4 0 1007.0 500.0 ... NaN NaN NaN
[4 rows x 7 columns]

Using Pandas filtering non-numeric data from two columns of a Dataframe

I'm loading a Pandas dataframe which has many data types (loaded from Excel). Two particular columns should be floats, but occasionally a researcher entered in a random comment like "not measured." I need to drop any rows where any values in one of two columns is not a number and preserve non-numeric data in other columns. A simple use case looks like this (the real table has several thousand rows...)
import pandas as pd
df = pd.DataFrame(dict(A = pd.Series([1,2,3,4,5]), B = pd.Series([96,33,45,'',8]), C = pd.Series([12,'Not measured',15,66,42]), D = pd.Series(['apples', 'oranges', 'peaches', 'plums', 'pears'])))
Which results in this data table:
A B C D
0 1 96 12 apples
1 2 33 Not measured oranges
2 3 45 15 peaches
3 4 66 plums
4 5 8 42 pears
I'm not clear how to get to this table:
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
I tried dropna, but the types are "object" since there are non-numeric entries.
I can't convert the values to floats without either converting the whole table, or doing one series at a time which loses the relationship to the other data in the row. Perhaps there is something simple I'm not understanding?
You can first create subset with columns B,C and apply to_numeric, check if all values are notnull. Then use boolean indexing:
print df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Next solution use str.isdigit with isnull and xor (^):
print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0 True
1 False
2 True
3 False
4 True
dtype: bool
print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
But solution with to_numeric with isnull and notnull is fastest:
print df[pd.to_numeric(df['B'], errors='coerce').notnull()
^ pd.to_numeric(df['C'], errors='coerce').isnull()]
A B C D
0 1 96 12 apples
2 3 45 15 peaches
4 5 8 42 pears
Timings:
#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)
In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop
In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop
In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.49 ms per loop

Resources