Suggest I've a dataframe containing countries and cities looking like this:
data = {'Parent':['Netherlands','Belgium','Germany','France'],'Child':['Amsterdam','Brussels','Berlin', '']}
I want to create a tree dictionary depicting which city belongs to which country.
In this example I the country France has no cities, I don't want to have the empty values as child node in the dictionary.
Can someone point me into the right direction on what to use to reach this solution?
Kind regards
Using pandas:
df = pd.DataFrame(data)
df.transpose()
0 1 2 3
Parent Netherlands Belgium Germany France
Child Amsterdam Brussels Berlin
Or using zip:
dict(zip(*data.values()))
{'Netherlands': 'Amsterdam', 'Belgium': 'Brussels', 'Germany': 'Berlin', 'France': ''}
Related
I have an excel file as below (sample data):
Business Name Business Address Suburb Zip Questions Label Answers
ABC address1 Truga 3000 Employee numbers 5
ABC address1 Truga 3000 Manager name Alan
ABC address1 Truga 3000 Store Type Retail
ABC address1 Truga 3000 Store Location CBD
ABC address1 Truga 3000 Compliant Yes
I would like to have the "Question Label" data converted into columns for the same business name (ABC) and subsequently, the answers should be like this;
Business Address Suburb Zip Employee numbers Manager name Store Type Store Location Compliant
ABC address1 Truga 3000 5 Alan Retail CBD Yes
I tried "transpose" within excel. However, I am not getting the desired output.
Any help would be appreciated !!!
Thanks in advance
import pandas as pd
x = pd.read_csv('test.csv') #reading it as csv for now
columns = x['Questions Label'].tolist() #convert questions label column to list
list1 = x['Answers'].tolist() #convert answer column to list
dictionary = {}
#map them in a dictionary
for i in range(0,len(list1)):
dictionary[columns[i]] = [list1[i]]
# make a dataframe out of these two
df = pd.DataFrame.from_dict(dictionary)
print(df)
Now, you just have to map your older dataframe without questions label and answer columns with this dataframe.
Edit 1:
df = pd.pivot_table(x, values="Answers",columns=["Questions Label"],aggfunc = 'first').reset_index()
This is another way to get what you want.
I know I can drop NaN rows from a DataFrame with df.dropna(). But what if I want to move those NaN rows to a new DataFrame?
Dataframe looks like
FNAME, LNAME, ADDRESS, latitude, logitude, altitude
BOB, JONES, 555 Seaseme Street, 38.00,-91.00,0.0
JOHN, GREEN, 111 Maple Street, 34.00,-75.00,0.0
TOM, SMITH, 100 A Street, 20.00,-80.00,0.0
BETTY, CROCKER, 5 Elm Street, NaN,NaN,NaN
I know I can group and move to a new DataFrame like this
grouped = df.groupby(df.FNAME)
df1 = grouped.get_group("BOB")
and it will give me a new DataFrame with FNAME of BOB but when I try
grouped = df.groupby(df.altitude)
df1 = grouped.get_group("NaN")
I get a KeyError: 'NaN'. So how can I group by Nan or Null values?
Assuming you're satisfied that all 'Nan' values in a column are to be grouped together, what you can do is use DataFrame.fillna() to convert the 'Nan' into something else, to be grouped.
df.fillna(value={'altitude':'null_altitudes'}
This fills every null in the altitude column with the string 'null_altitudes'. If you do a groupby now, all 'null_altitudes' will be together. You can fill multiple columns at once using multiple key value pairs: values = {'col_1':'val_1', 'col_2':'val_2', etc}
You can use isna with any on rows:
# to get rows with NA in a new df
df1 = df[df.isna().any(axis=1)]
I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.
In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...
I have a pandas dataframe which looks like the following.
df1 = {'City' :['London', 'Basel', 'Kairo'], 'Country': ['UK','Switzerland','Egypt']}
City Country
0 London UK
1 Basel Switzerland
2 Kairo Egypt
Now I want to assign a new column called 'Continent' that should be filled based on the values.
So df2 should look like:
City Country Continent
0 London UK Europe
1 Basel Switzerland Europe
2 Kairo Egypt Africa
So df1 inidicates the values for the new column in df2. How do I do this in the easiest way?
With a dictionary of country->continent, you can pass it pd.DataFrame.replace() method, like this:
df['Continent'] = df['Country'].replace(country_dict)
I have two files customer and sales like below
Customer :
cu_id name region city state
1 Rahul ME Vizag AP
2 Raghu SE HYD TS
3 Rohith ME BNLR KA
Sales:
sa_id sales country
2 100000 IND
3 230000 USA
4 240000 UK
Both the files are \t delimited.
I want to join both the files based on the cu_id from customer and sa_id from sales using pyspark with out using sparksql/dataframes.
your help is very much appreciated.
You can definitely use the join methods that Spark has to offer regarding workings with RDD's.
You can do something like:
customerRDD = sc.textFile("customers.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
salesRDD = sc.textFile("sales.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
joinedRDD = customerRDD.join(salesRDD)
And you will get a new RDD that contains the only joined records from both customer and sales files.