Pandas DataFrame: Groupby.First - Index limitation? - python-3.x

I have below data frame t:
import pandas as pd
t = pd.DataFrame(data = (['AFG','Afghanistan',38928341],
['CHE','Switzerland',8654618],
['SMR','San Marino', 33938]), columns = ['iso_code', 'location', 'population'])
g = t.groupby('location')
g.size()
I can see in each group there's only one record, which is expected.
However if I run below code it didn't populate any error message:
g.first(10)
It shows
population
location
Afghanistan 38928341
San Marino 33938
Switzerland 8654618
My understanding is the first(n) for a group is the nth record for this group but each of my location group has only one record - so how did pandas give me that record?
Thanks

I think you're looking for g.nth(10).
g.first(10) is NOT doing what you think it is. The first (optional) parameter of first is numeric_only and takes a boolean, so you're actually running g.first(numeric_only=True) as bool(10) evaluates to True.

After read the comments from mozway and Henry Ecker/ sammywemmy I finally got it.
t = pd.DataFrame(data = (['AFG','Afghanistan',38928341,'A1'],
['CHE','Switzerland',8654618,'C1'],
['SMR','San Marino', 33938,'S1'],
['AFG','Afghanistan',38928342,'A2'] ,
['AFG','Afghanistan',38928343, 'A3'] ), columns = ['iso_code', 'location', 'population', 'code'])
g = t.groupby('location')
Then
g.nth(0)
g.nth(1)
g.first(True)
g.first(False)
g.first(min_countint=2)
shows the difference

Related

Transpose or consolidate Dataframe

Got a tricky situation. I tried my best via Pivot or other methods but gave up. Please help if possible.
I like to take a value = 1 from each column and populate the Date in that part.
After the above map, the 'Date' field is no more needed. So I am ok to delete that
My sample dataset:
df1 = pd.DataFrame({'Patient': ['John','John','John','Smith','Smith','Smith'],
'Date': [20200101, 20200102, 20200105,20220101, 20220102, 20220105],
'Ibrufen': ['NaN','NaN',1,'NaN','NaN',1],
'Tylenol': [1, 'NaN','NaN',1, 'NaN','NaN'],
})
My desired output:
df2 = pd.DataFrame({'Patient': ['Jonh','Smith'],
'Ibrufen': ['20200105','20220105'],
'Tylenol': ['20200101','20220101'],
'Steroid': ['20200102','20220102'],
})
A possible solution, based on the idea of first creating an auxiliary column containing, for each row, the corresponding medicine:
df1['aux'] = df1.apply(lambda x:
'Ibrufen' if (x['Ibrufen'] == 1) else
'Tylenol' if (x['Tylenol'] == 1) else
'Steroid', axis=1)
(df1.pivot(index='Patient', columns='aux', values='Date')
.reset_index()
.rename_axis(None, axis=1))
Output:
Patient Ibrufen Steroid Tylenol
0 John 20200105 20200102 20200101
1 Smith 20220105 20220102 20220101

Dealing with duplicates in a pandas query

I have the following DataFrame:
data = {'Customer_ID': ['123','2','1010','123'],
'Date_Create': ['12/08/2010','04/10/1998','27/05/2010','12/08/2010'],
'Purchase':[1,1,0,1]
}
df = pd.DataFrame(data, columns = ['Customer_ID', 'Date_Create','Purchase'])
I want to perform this query:
df_2 = df[['Customer_ID','Date_Create','Purchase']].groupby(['Customer_ID'],
as_index=False).sum().sort_values(by='Purchase', ascending=False)
The objective of this query is to sum all purchases(boolean field) and as output a dataframe with 3 columns: 'Customer_ID', 'Date_Create','Purchase
Problem is: the field Date_Create is not in query because it has duplicate as the date_creation of the account does not change.
How can i solve it?
thx
If im understanding it correctly and your source data has some duplicates,
There's a function specifically for this, dataframe.drop_duplicates()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
To only consider some columns in the duplicate check, use subset:
df2 = df.drop_duplicates(subset=['Customer_ID','Date_Create'])
You can add column Date_Create to groupby if same values per Customer_ID:
(df.groupby(['Customer_ID','Date_Create'], as_index=False)['Purchase']
.sum()
.sort_values(by='Purchase', ascending=False))
If not, use some aggregation function - e.g. GroupBy.first for first date per groups:
(df.groupby('Customer_ID')
.agg(Purchase = ('Purchase', 'sum'), Date_Create= ('Date_Create', 'first'))
.reset_index()
.sort_values(by='Purchase', ascending=False))

Group pandas elements according to a column

I have the following pandas dataframe:
import pandas as pd
data = {'Sentences':['Sentence1', 'Sentence2', 'Sentence3', 'Sentences4', 'Sentences5', 'Sentences6','Sentences7', 'Sentences8'],'Time':[1,0,0,1,0,0,1,0]}
df = pd.DataFrame(data)
print(df)
I was wondering how to extract all the "Sentences" according to the "Time" column. I want to gather all the "sentences" from the first "1" to the last "0".
Maybe the expected output explains it better:
[[Sentences1,Sentences2,Sentences3],[Sentences4,Sentences5,Sentences6],[Sentences7,Sentences8]]
Is this somehow possible ? Sorry, I am very new to pandas.
Try this:
s = df['Time'].cumsum()
df.set_index([s, df.groupby(s).cumcount()])['Sentences'].unstack().to_numpy().tolist()
Output:
[['Sentence1', 'Sentence2', 'Sentence3'],
['Sentences4', 'Sentences5', 'Sentences6'],
['Sentences7', 'Sentences8', nan]]
Details:
Use cumsum to group by Time = 1 with following Time = 0.
Next, use groupby with cumcount to increment within each group
Lastly, use set_index and unstack to reshape dataframe.

How do I set an unnamed column as the index?

In all the examples I have found, a column name is usually required to set it as the index
Instead of going into excel to add a column header, I was wondering if it's possible to set an empty header as the index. The column has all the values I want included, but lacks a column name:
My script is currently:
import pandas as pd
data = pd.read_csv('file.csv')
data
You could also just select the column by id with iloc:
data = data.set_index(data.iloc[:, 0])
Or when you call pd.read_csv(), specify index_col:
data = pd.read_csv('path.csv', index_col=0)
You don't need to rename the first column in excel. It's as easy in pandas as well:
new_columns = data.columns.values
new_columns[0] = 'Month'
data.columns = new_columns
Afterwards, you can set the index:
data = data.set_index('Month')
You can do as follows:
import pandas as pd
data = pd.read_csv('file.csv',index_col=0)
data
When I have encountered columns missing names, Pandas always name them 'Unnamed: n', where n = ColumnNumber-1. ie 'Unnamed: 0' for first column, 'Unnamed: 1' for second etc. So I think that in your case the following code should be useful:
# set your column as the dataframe index
data.index = data['Unnamed: 0']
# now delete the column
data.drop('Unnamed: 0', axis=1, inplace=True)
# also delete the index name which was 'Unnamed: 0' obviously
del data.index.name

Adding a columns according to rows found by str.contains(). pandas

How to add a column according to the value found with str.contains? I am looking for names of men and adding a gender.
df[df.loc[:,'name'].str.contains("John|Jon")]['gender'] = 'male'
I think this should work, but then:
df
returns df without columns. What is the best way to do these kind of changes?
Thank you
import pandas as pd
df = pd.DataFrame({"name":["John|Jon", "ABC"], "age":[34, 45]})
df["gender"] = "unknown"
df["gender"][df["name"].str.contains("John")] = "male"

Resources