Flag repeating entries in pandas time series - python-3.x

I have a data frame that takes this form (but is several millions of rows long):
import pandas as pd
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"],
'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(dict)
A 2000 France
A 2001 Morocco
A 2002 Morocco
B 2000 Germany
B 2001 Germany
B 2003 Germany
C 1999 Japan
C 2000 Australia
C 2001 Japan
D 2000 Canada
D 2000 Mexico
D 2001 China
For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year.
I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:
id year vacation new
A 2000 France 1
A 2001 Morocco 1
A 2002 Morocco 0
B 2001 Germany 1
B 2002 Germany 0
B 2003 Germany 0
C 1999 Japan 1
C 1999 Australia 1
C 2000 Japan 1
D 2000 Canada 1
D 2000 Mexico 1
D 2001 China 1
For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.
I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.
I'd appreciate any input; thanks.

IIUC,
what we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.
hopefully that's clear. let me know if you need anymore help.
df["new_2"] = (
df.groupby(["id", "vacation"])["id", "year"]
.apply(lambda x: x.ne(x.shift()))
.all(axis=1)
.add(0)
)
print(df)
id year vacation new_2
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

Here's one solution I came up with, using groupby and transform:
df = df.sort_values(["id", "vacation", "year"])
df["new"] = (
df.groupby(["id", "vacation"])
.transform(lambda x: x.iloc[0])
.year.eq(df.year)
.astype(int)
)
You'll get
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

Here is a way using groupby+cumcount and series.mask:
df['new']=df.groupby(['id','vacation']).cumcount().add(1).mask(lambda x: x.gt(1),0)
print(df)
id year vacation new
0 A 2000 France 1
1 A 2001 USA 1
2 A 2002 France 0
3 B 2001 Germany 1
4 B 2002 Germany 0
5 B 2003 Germany 0
6 C 1999 Japan 1
7 C 2000 Australia 1
8 C 2001 France 1

Related

How to merge data with duplicates using panda python

I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie
Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38
I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Update multiple columns from another dataframe based on one common column in Pandas

Given the following two dataframes:
df1:
id city district year price
0 1 bjs cyq 2018 12
1 2 bjs cyq 2019 6
2 3 sh hp 2018 4
3 4 shs hpq 2019 3
df2:
id city district year
0 1 bj cy 2018
1 2 bj cy 2019
2 4 sh hp 2019
let's say some values in city and district from df1 have errors, so I need to update city and district values' in df1 with those of df2 based on id, my expected result is like this:
id city district year price
0 1 bj cy 2018 12
1 2 bj cy 2019 6
2 3 sh hp 2018 4
3 4 sh hp 2019 3
How could I do that in Pandas? Thanks.
Update:
Solution 1:
cities = df2.set_index('id')['city']
district = df2.set_index('id')['district']
df1['city'] = df1['id'].map(cities)
df1['district'] = df1['id'].map(district)
Solution 2:
df1[["city","district"]] = pd.merge(df1,df2,on=["id"],how="left")[["city_y","district_y"]]
print(df1)
Out:
id city district year price
0 1 bj cy 2018 12
1 2 bj cy 2019 6
2 3 NaN NaN 2018 4
3 4 sh hp 2019 3
Note the city and district for id is 3 are NaNs, but I want keep the values from df1.
Try combine_first:
df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
Output:
id city district price year
0 1 bj cy 12.0 2018.0
1 2 bj cy 6.0 2019.0
2 3 sh hp 4.0 2018.0
3 4 sh hp 3.0 2019.0
Try this
df1[["city","district"]] = pd.merge(df1,df2,on=["id"],how="left")[["city_y","district_y"]]
IIUC, we can use .map
edit - input changed.
target_cols = ['city','district']
df1.loc[df1['id'].isin(df2['id']),target_cols] = np.nan
cities = df2.set_index('id')['city']
district = df2.set_index('id')['district']
df1['city'] = df1['city'].fillna(df1['id'].map(cities))
df1['district'] = df1['district'].fillna(df1['id'].map(cities))
print(df1)
id city district year price
0 1 bj bj 2018 12
1 2 bj bj 2019 6
2 3 sh hp 2018 4
3 4 sh sh 2019 3

how can i sum a columns in DataFrame with each date in time series data

Here's the example
'''
df = pd.DataFrame({'Country': ['United States', 'China', 'Italy', 'spain'],
'2020-01-01' : [0, 2, 1, 0]
'2020-01-02' : [1, 0, 1, 2]
'2020-01-03' : [0, 3, 2, 0]
df
'''
i want to sum the value of columns by date so that next columns has the added value.__ which means 2020-01-02 has a new added value of (2020-01-01+2020-01-02) and so on..
Convert Country column to index by DataFrame.set_index and use DataFrame.cumsum per rows by axis=1:
df = df.set_index('Country').cumsum(axis=1)
print (df)
2020-01-01 2020-01-02 2020-01-03
Country
United States 0 1 1
China 2 2 5
Italy 1 2 4
spain 0 2 2
Or select all columns without first by DataFrame.iloc before cumsum:
df.iloc[:, 1:] = df.iloc[:, 1:].cumsum(axis=1)
print (df)
Country 2020-01-01 2020-01-02 2020-01-03
0 United States 0 1 1
1 China 2 2 5
2 Italy 1 2 4
3 spain 0 2 2

How to calculate common elements in a dataframe depending on another column

I have a dataframe like this.
sport Country(s)
Foot_ball brazil
Foot_ball UK
Volleyball UK
Volleyball South_Africa
Volleyball brazil
Rugger UK
Rugger South_africa
Rugger Australia
Carrom UK
Carrom Australia
Chess UK
Chess Australia
I want to calculate the number of sports shared by two countries. For a example
Football and Volleyball is common to brazil and Uk. So the number of common sports played by brazil and Uk is 2.
carrom, chess and Rugger are common to australia and Uk. So the number of sports shared by australia and UK is 3.
Like this is there anyway that I can get a count in whole dataframe for
brazil, south_afriaca.
Brazil, Austrlia
SouthAfrica, Uk
e.t.c
Can anybody suggest me how to do this in pandas or any other way.
With the sample data you provided you can generate the desired output with below code:
import pandas as pd
df = pd.DataFrame(
[["Foot_ball", "brazil"],\
["Foot_ball", "UK"],\
["Volleyball", "UK"],\
["Volleyball", "South_Africa"],\
["Volleyball", "brazil"],\
["Rugger", "UK"],\
["Rugger", "South_Africa"],\
["Rugger", "Australia"],\
["Carrom", "UK"],\
["Carrom", "Australia"],\
["Chess", "UK"],\
["Chess", "Australia"]],\
columns = ["sport" , "Country"])
# Function to get the number of sports in common
def countCommonSports(row):
sports1 = df["sport"][df["Country"]==row["Country 1"]]
sports2 = df["sport"][df["Country"]==row["Country 2"]]
return len(list(set(sports1).intersection(sports2)))
# Generate the combinations of countries from original Dataframe
from itertools import combinations
comb = combinations(df["Country"].unique(), 2)
out = pd.DataFrame(list(comb), columns=["Country 1", "Country 2"])
# Find out the sports in common between coutries
out["common Sports count"] = out.apply(countCommonSports, axis = 1)
output is then:
>>> out
Country 1 Country 2 common Sports count
0 brazil UK 2
1 brazil South_Africa 1
2 brazil Australia 0
3 UK South_Africa 2
4 UK Australia 3
5 South_Africa Australia 1
pd.factorize and itertools.combinations
import pandas as pd
import numpy as np
from itertools import combinations, product
# Fix Capitalization
df['Country(s)'] = ['_'.join(map(str.title, x.split('_'))) for x in df['Country(s)']]
c0, c1 = zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in combinations(c, 2)])
i, r = pd.factorize(c0)
j, c = pd.factorize(c1)
n, m = len(r), len(c)
o = np.zeros((n, m), np.int64)
np.add.at(o, (i, j), 1)
result = pd.DataFrame(o, r, c)
result
Australia Uk South_Africa Brazil
Uk 3 0 2 1
Brazil 0 1 0 0
South_Africa 1 0 0 1
Make symmetrical
result = result.align(result.T, fill_value=0)[0]
result
Australia Brazil South_Africa Uk
Australia 0 0 0 0
Brazil 0 0 0 1
South_Africa 1 1 0 0
Uk 3 1 2 0
pd.crosstab
This will be slower... almost certainly.
c0, c1 = map(pd.Series, zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in combinations(c, 2)]))
pd.crosstab(c0, c1).rename_axis(None).rename_axis(None, axis=1).pipe(
lambda d: d.align(d.T, fill_value=0)[0]
)
Australia Brazil South_Africa Uk
Australia 0 0 0 0
Brazil 0 0 0 1
South_Africa 1 1 0 0
Uk 3 1 2 0
Or including all sports within a single country
c0, c1 = map(pd.Series, zip(*[(a, b)
for s, c in df.groupby('sport')['Country(s)']
for a, b in product(c, c)]))
pd.crosstab(c0, c1).rename_axis(None).rename_axis(None, axis=1)
Australia Brazil South_Africa Uk
Australia 3 0 1 3
Brazil 0 2 1 2
South_Africa 1 1 2 2
Uk 3 2 2 5

Resources