I know I can drop NaN rows from a DataFrame with df.dropna(). But what if I want to move those NaN rows to a new DataFrame?
Dataframe looks like
FNAME, LNAME, ADDRESS, latitude, logitude, altitude
BOB, JONES, 555 Seaseme Street, 38.00,-91.00,0.0
JOHN, GREEN, 111 Maple Street, 34.00,-75.00,0.0
TOM, SMITH, 100 A Street, 20.00,-80.00,0.0
BETTY, CROCKER, 5 Elm Street, NaN,NaN,NaN
I know I can group and move to a new DataFrame like this
grouped = df.groupby(df.FNAME)
df1 = grouped.get_group("BOB")
and it will give me a new DataFrame with FNAME of BOB but when I try
grouped = df.groupby(df.altitude)
df1 = grouped.get_group("NaN")
I get a KeyError: 'NaN'. So how can I group by Nan or Null values?
Assuming you're satisfied that all 'Nan' values in a column are to be grouped together, what you can do is use DataFrame.fillna() to convert the 'Nan' into something else, to be grouped.
df.fillna(value={'altitude':'null_altitudes'}
This fills every null in the altitude column with the string 'null_altitudes'. If you do a groupby now, all 'null_altitudes' will be together. You can fill multiple columns at once using multiple key value pairs: values = {'col_1':'val_1', 'col_2':'val_2', etc}
You can use isna with any on rows:
# to get rows with NA in a new df
df1 = df[df.isna().any(axis=1)]
Related
Hello
I have 2 dataframes one old and one new. After comparing the 2 dataframes I want output generated with the column names for each id and the only changes in values as shown below.
I could merge the 2 dataframes and find the differences for each column separately like
a=df1.merge(df2, on='Ids')
a[a['ColA_x'] != a['ColA_y']]
But I have 80 columns and I want to get the difference with column names and values as shown in the output. Is there a way to do this?
Stack each dataframe to convert column names into row indexes. Concatenate the dataframes side by side:
combined = pd.concat([df1.stack(), df2.stack()], axis=1)
Now, extract the rows with the values that do not match:
combined[combined[0] != combined[1]]
# 0 1
#Ids
#123 ColA AH AB
#234 ColB GO MO
#456 ColA GH AB
#...
I have a Table in Excel that has this structure:
Country
County
20/01/2020
21/01/2020
Country
County
Value 1
Value 2
I would like to be able to convert the table into the following format so that I could add it in a table in my database.
Country
County
Date
Values
Country
County
20/01/2020
Value 1
Country
County
20/01/2020
Value 2
Is there a quick way to do this or should I iterate over each row and create a dataframe from there? The excel file has millions of entries.
pd.melt is what you're looking for:
pd.melt(df, id_vars = ['Country','County'], var_name = 'Date')
supply the dataframe as the first argument, then id_vars tells the function which columns you want to keep for each row. the rest of the columns will be turned into values in a new column in the melted dataframe. then var_name says what you want that new column to be called.
I have the following data, with each row per ID and DATE. A person with the same ID can occupy multiple rows, hence multiple dates. I want to aggregate it into one person (or ID) per row, and the dates will be aggregated into a list of date
From this
ID DATE
1 2012-03-04
1 2013-04-15
1 2019-01-09
2 2013-04-09
2 2016-01-01
2 2018-05-09
To this
ID DATE
1 [2012-03-04, 2013-04-15, 2019-01-09]
2 [2013-04-09, 2016-01-01, 2018-05-09]
Here is my attempt
df.sort_values(by=['ID', 'DATE'], ascending=True, inplace=True)
df = df[['ID', 'DATE']]
df_pivot = df.groupby('ID').aggregate(lambda tdf: tdf.unique().tolist())
df_pivot = pd.DataFrame(df_pivot.to_records())
The problem is it returns something like this
ID DATE
1 [1375228800000000000, 1411948800000000000, 1484524800000000000]
2 [1524528000000000000, 1529539200000000000, 1529542200000000000]
What kind of date format is this? I can't seem to find the right function to convert it back to the typical date format.
If need unique values in lists use DataFrame.drop_duplicates before aggregate lists:
df = (df.sort_values(by=['ID', 'DATE'], ascending=True)
.drop_duplicates(['ID', 'DATE'])
.groupby('ID')['DATE']
.agg(list))
In your solution should working, but it is slow:
df_pivot = df.groupby('ID')['DATE'].aggregate(lambda tdf: tdf.drop_duplicates().tolist())
What kind of date format is this?
If is native datetimes, called also unix datetime in nanoseconds.
Many ways... agg preferred because apply can be very slow
df.groupby('ID')['DATE'].agg(list)
Or
df.groupby('ID')['DATE'].apply(lambda x: x.to_list())
Simply use groupby() and apply() method:
result=df.groupby('ID')['DATE'].apply(list)
OR
result=df.groupby('ID')['DATE'].agg(list)
Now If you print result you will get your desired output:
ID
1 [ 2012-03-04, 2013-04-15, 2019-01-09]
2 [ 2013-04-09, 2016-01-01, 2018-05-09]
Name: DATE, dtype: object
The above code is giving you Series,If you want Dataframe Then use:
result=df.groupby('ID')['DATE'].apply(list).reset_index()
I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.
In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...
I want to have a dataframe with repeated values with the same id number. But i want to split the repeated rows into colunms.
data = [[10450015,4.4],[16690019 4.1],[16690019,4.0],[16510069 3.7]]
df = pd.DataFrame(data, columns = ['id', 'k'])
print(df)
The resulting dataframe would have n_k (n= repated values of id rows). The repeated id gets a individual colunm and when it does not have repeated id, it gets a 0 in the new colunm.
data_merged = {'id':[10450015,16690019,16510069], '1_k':[4.4,4.1,3.7], '2_k'[0,4.0,0]}
print(data_merged)
Try assiging the column idx ref, using DataFrame.assign and groupby.cumcount then DataFrame.pivot_table. Finally use a list comprehension to sort column names:
df_new = (df.assign(col=df.groupby('id').cumcount().add(1))
.pivot_table(index='id', columns='col', values='k', fill_value=0))
df_new.columns = [f"{x}_k" for x in df_new.columns]
print(df_new)
1_k 2_k
id
10450015 4.4 0
16510069 3.7 0
16690019 4.1 4