How to create and populate new column in 2nd Pandas dataframe based on condition in 1st df? - python-3.x

I have a pandas dataframe which looks like the following.
df1 = {'City' :['London', 'Basel', 'Kairo'], 'Country': ['UK','Switzerland','Egypt']}
City Country
0 London UK
1 Basel Switzerland
2 Kairo Egypt
Now I want to assign a new column called 'Continent' that should be filled based on the values.
So df2 should look like:
City Country Continent
0 London UK Europe
1 Basel Switzerland Europe
2 Kairo Egypt Africa
So df1 inidicates the values for the new column in df2. How do I do this in the easiest way?

With a dictionary of country->continent, you can pass it pd.DataFrame.replace() method, like this:
df['Continent'] = df['Country'].replace(country_dict)

Related

Create a parent child tree dictionary based on two columns in dataframe

Suggest I've a dataframe containing countries and cities looking like this:
data = {'Parent':['Netherlands','Belgium','Germany','France'],'Child':['Amsterdam','Brussels','Berlin', '']}
I want to create a tree dictionary depicting which city belongs to which country.
In this example I the country France has no cities, I don't want to have the empty values as child node in the dictionary.
Can someone point me into the right direction on what to use to reach this solution?
Kind regards
Using pandas:
df = pd.DataFrame(data)
df.transpose()
0 1 2 3
Parent Netherlands Belgium Germany France
Child Amsterdam Brussels Berlin
Or using zip:
dict(zip(*data.values()))
{'Netherlands': 'Amsterdam', 'Belgium': 'Brussels', 'Germany': 'Berlin', 'France': ''}

Getting data in database format from excel file [Python]

I have a Table in Excel that has this structure:
Country
County
20/01/2020
21/01/2020
Country
County
Value 1
Value 2
I would like to be able to convert the table into the following format so that I could add it in a table in my database.
Country
County
Date
Values
Country
County
20/01/2020
Value 1
Country
County
20/01/2020
Value 2
Is there a quick way to do this or should I iterate over each row and create a dataframe from there? The excel file has millions of entries.
pd.melt is what you're looking for:
pd.melt(df, id_vars = ['Country','County'], var_name = 'Date')
supply the dataframe as the first argument, then id_vars tells the function which columns you want to keep for each row. the rest of the columns will be turned into values in a new column in the melted dataframe. then var_name says what you want that new column to be called.

Aggregate a column on rows with condition on another column using groupby

Let's say I have the following Pyspark dataframe:
Country Direction Quantity Price
Belgium In 5 10
Belgium Out 2 8
Belgium Out 3 9
France In 2 3
France Out 3 2
France Out 4 3
Is it possible to groupby this dataframe by column "Country", aggregate average of the "Price" column as normal, but use function "first" for "Quantity" column, only for rows when "Direction" column is "Out"?
I imagine it should be something like this:
df.groupby("Country").agg(F.mean('Price'), F.first(F.col('Quantity').filter(F.col('Direction') == "Out")))
You can mask the Quantity for Direction != 'out' and do a first with ignoreNulls:
df.groupby("Country").agg(
F.mean('Price'),
F.first(
F.when(
F.col('Direction') == "Out",
F.col('Quantity')
),
ignoreNulls=True
)
)

Python Pandas Move Na or Null values to a new dataframe

I know I can drop NaN rows from a DataFrame with df.dropna(). But what if I want to move those NaN rows to a new DataFrame?
Dataframe looks like
FNAME, LNAME, ADDRESS, latitude, logitude, altitude
BOB, JONES, 555 Seaseme Street, 38.00,-91.00,0.0
JOHN, GREEN, 111 Maple Street, 34.00,-75.00,0.0
TOM, SMITH, 100 A Street, 20.00,-80.00,0.0
BETTY, CROCKER, 5 Elm Street, NaN,NaN,NaN
I know I can group and move to a new DataFrame like this
grouped = df.groupby(df.FNAME)
df1 = grouped.get_group("BOB")
and it will give me a new DataFrame with FNAME of BOB but when I try
grouped = df.groupby(df.altitude)
df1 = grouped.get_group("NaN")
I get a KeyError: 'NaN'. So how can I group by Nan or Null values?
Assuming you're satisfied that all 'Nan' values in a column are to be grouped together, what you can do is use DataFrame.fillna() to convert the 'Nan' into something else, to be grouped.
df.fillna(value={'altitude':'null_altitudes'}
This fills every null in the altitude column with the string 'null_altitudes'. If you do a groupby now, all 'null_altitudes' will be together. You can fill multiple columns at once using multiple key value pairs: values = {'col_1':'val_1', 'col_2':'val_2', etc}
You can use isna with any on rows:
# to get rows with NA in a new df
df1 = df[df.isna().any(axis=1)]

sorting multi-column grouped by data frame

I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.
In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...

Resources