sorting multi-column grouped by data frame - python-3.x

I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.

In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...

Related

Getting data in database format from excel file [Python]

I have a Table in Excel that has this structure:
Country
County
20/01/2020
21/01/2020
Country
County
Value 1
Value 2
I would like to be able to convert the table into the following format so that I could add it in a table in my database.
Country
County
Date
Values
Country
County
20/01/2020
Value 1
Country
County
20/01/2020
Value 2
Is there a quick way to do this or should I iterate over each row and create a dataframe from there? The excel file has millions of entries.
pd.melt is what you're looking for:
pd.melt(df, id_vars = ['Country','County'], var_name = 'Date')
supply the dataframe as the first argument, then id_vars tells the function which columns you want to keep for each row. the rest of the columns will be turned into values in a new column in the melted dataframe. then var_name says what you want that new column to be called.

Python Pandas Move Na or Null values to a new dataframe

I know I can drop NaN rows from a DataFrame with df.dropna(). But what if I want to move those NaN rows to a new DataFrame?
Dataframe looks like
FNAME, LNAME, ADDRESS, latitude, logitude, altitude
BOB, JONES, 555 Seaseme Street, 38.00,-91.00,0.0
JOHN, GREEN, 111 Maple Street, 34.00,-75.00,0.0
TOM, SMITH, 100 A Street, 20.00,-80.00,0.0
BETTY, CROCKER, 5 Elm Street, NaN,NaN,NaN
I know I can group and move to a new DataFrame like this
grouped = df.groupby(df.FNAME)
df1 = grouped.get_group("BOB")
and it will give me a new DataFrame with FNAME of BOB but when I try
grouped = df.groupby(df.altitude)
df1 = grouped.get_group("NaN")
I get a KeyError: 'NaN'. So how can I group by Nan or Null values?
Assuming you're satisfied that all 'Nan' values in a column are to be grouped together, what you can do is use DataFrame.fillna() to convert the 'Nan' into something else, to be grouped.
df.fillna(value={'altitude':'null_altitudes'}
This fills every null in the altitude column with the string 'null_altitudes'. If you do a groupby now, all 'null_altitudes' will be together. You can fill multiple columns at once using multiple key value pairs: values = {'col_1':'val_1', 'col_2':'val_2', etc}
You can use isna with any on rows:
# to get rows with NA in a new df
df1 = df[df.isna().any(axis=1)]

Valueerror when filtering and adding a new column at once

I'm getting the error code:
ValueError: Wrong number of items passed 3, placement implies 1.
What i want to do is import a dataset and count the duplicated values, drop the duplicated values and add a column which says that there were x number of duplicates of that number.
This is to try and sort a dataset of 13 000 rows and 45 columns.
I've tried different solutions found online but it seems like it does not help. I'm pretty new to programming and all help is really appreciated
'''import pandas as pd
# Making file ready
data = pd.read_excel(r'Some file.xlsx', header = 0)
data.rename(columns={'Dato': 'Last ordered', 'ArtNr': 'Item No:'}, inplace
= True)
#Formatting dates
pd.to_datetime(data['Last ordered'],
format = '%Y-%m-%d %H:%M:%S')
#Creates new table content and order
df = data[['Item No:','Last ordered', 'Description']]
df['Last ordered'] = df['Last ordered'].dt.strftime('%Y-/%m-/%d')
df = df.sort_values('Last ordered', ascending = False)
#Adds total sold quantity column
df['Quantity'] = df.groupby('Item No:').transform('count')
df2 = df.drop_duplicates('Item No:').reset_index(drop=True)
#Prints to environment and creates new excel file
print(df2)
df2.to_excel(r'New Sorted File.xlsx')'''
I expect it to provide a new excel file with columns:
Item No | Last ordered | Description | Quantity
And i want to be able to add other columns from the original dataset as well if i need to later on.
The problem is at this line:
df['Quantity'] = df.groupby('Item No:').transform('count')
The right side part of the assignment is a dataframe and you are trying to fit it inside a column. You need to select only one of the columns. Something like
df['Quantity'] = df.groupby('Item No:').transform('count')['Description']
should work.

How to read particular row from groupby dataframe?

I am preprocessing my data about car sales where lots of used cars price is 0. I want to replace 0 value with the mean price value of similar kind of cars.
Here, I have found mean values for each car with groupby function:
df2= df1.groupby(['car','body','year','engV'])['price'].mean()
This is my dataframe extracted from actual data with price is zero
rep_price=df[df['price']==0]
I want to assign mean price value from df2['Land Rover'] to rep_price['Land Rover'] which is 0
Since I'm not sure about the columns you have in those dataframes I'm gonna take a wild guess to try to give you a head start, let's say 'Land Rover' is a value in a column called 'Car_Type' in df1, and then you grouped your data like this:
df2= df1.groupby(['Car_Type'])['price'].mean()
In that case something like this should cover your need:
df1.loc[df1['Car_Type']=='Land Rover','price'] = df2['Land Rover']

How to calculate mean with hieraechical index in pandas

I have a pandas dataframe with 1mi rows and hierarchical indexes (country, state, city, in this order) with price observations of a product for each row. How can I calculate de mean and standard deviation (std) for each country, state and city (keeping in mind I am avoinding loops as my df is big)?
For each level of mean and std, I want to save the values in new columns in this dataframe for future access.
Use groupby with the argument levels to group your data and then use mean and std. If you want to have your mean as new column in your existing dataframe, use transform which return a Series with the same index as your df :
grouped = df.groupby(level = ['Country','State', 'City'])
df['Mean'] = grouped['price_observation'].transform('mean')
df['Std'] = grouped['price_observation'].transform('std')
If you want to read more on grouping, you can read the pandas documentation

Resources