My dataset looks like following. I am trying to subset my pandas dataframe such that only the responses by all 3 people will get selected. For example, in below data frame the responses that were answered by all 3 people were "I like to eat" and "You have nice day" . Thus only those should be subsetted. I am not sure how to achieve this in Pandas dataframe.
Note: I am new to Python ,please provide explanation with your code.
DataFrame example
import pandas as pd
data = {'Person':['1', '1','1','2','2','2','2','3','3'],'Response':['I like to eat','You have nice day','My name is ','I like to eat','You have nice day','My name is','This is it','I like to eat','You have nice day'],
}
df = pd.DataFrame(data)
print (df)
Output:
Person Response
0 1 I like to eat
1 1 You have nice day
2 1 My name is
3 2 I like to eat
4 2 You have nice day
5 2 My name is
6 2 This is it
7 3 I like to eat
8 3 You have nice day
IIUC I am using transform with nunique
yourdf=df[df.groupby('Response').Person.transform('nunique')==df.Person.nunique()]
yourdf
Out[463]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day
Method 2
df.groupby('Response').filter(lambda x : pd.Series(df['Person'].unique()).isin(x['Person']).all())
Out[467]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day
Related
I have a dataframe in pandas as mentioned below where list elements in column info is same as unique file in column id:
id text info
1 great ['son','daughter']
1 excellent ['son','daughter']
2 nice ['father','mother','brother']
2 good ['father','mother','brother']
2 bad ['father','mother','brother']
3 awesome nan
All I want to get list elements as row for each file, like:
id text info
1 great son
1 excellent daughter
2 nice father
2 good mother
2 bad brother
3 awesome nan
Let us try explode after drop_duplicates
df['info'] = df['info'].drop_duplicates().explode().values
df
Out[298]:
id text info
0 1 great son
1 1 excellent daughter
2 2 nice father
3 2 good mother
4 2 bad brother
5 3 awesome NaN
New to python and pandas and trying to figure this out.
I'm dealing with a data set that's pretty messy. There are 500 rows and 9 columns. In a few instances, data that should be in coulmn 9 has been indexed into column 8, along with column 8 data.
... Col 8 Col 9
0 2 weeks No. 13
1 1 week No. 2
2 12 weeks, No 1
3 15 weeks No. 8
4 7 weeks, No. 1
How can I separate the data and move to the proper column?
I applied a split(), but don't know how to move it over.
I'm thinking I need to use the apply(), but not sure on how.
Any suggestions?
You can split() with expand=True, then fillna() to fill the missing values:
df[['Col 8', 'Col 9']] = df['Col 8'].str.split(',', expand=True).fillna({1: df['Col 9']})
# Col 8 Col 9
# 0 2 weeks No. 13
# 1 1 week No. 2
# 2 12 weeks No 1
# 3 15 weeks No. 8
# 4 7 weeks No. 1
I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10
I am trying to perform a window operation on the following pandas data frame:
import pandas as pd
df = pd.DataFrame({'visitor_id': ['a','a','a','a','a','a','b','b','b','b','c','c','c','c','c'],
'time_on_site' : [3,5,6,4,5,3,7,6,7,8,1,2,2,1,2],
'site_visit': [1,2,3,4,5,6,1,2,3,4,1,2,3,4,5],
'feature_visit' : [np.nan,np.nan,1,np.nan,2,3,1,2,3,4,np.nan,1,2,3,np.nan]
})
"For each distinct user, calculate the average time they spent on the website and their total number of visits before they interacted with a feature."
The data consists of four columns with the following definitions:
visitor_id is a string that identifies a unique given visitor
time_on_site is the time they spent on the website
site_visit is an incrementing counter of the times they visited the
website.
feature_visit is an incrementing counter of the times they used a specific feature on the site. If a customer visited the site before they interacted with the feature, a NaN is produced. If they visited the site and did not interact with the new feature, a NaN is produced. For each time they visited the site and interacted with the feature, the counter is incremented by one.
visitor_id time_on_site site_visit feature_visit
a 3 1 NaN
a 5 2 NaN
a 6 3 1
a 4 4 NaN
a 5 5 2
a 3 6 3
b 7 1 1
b 6 2 2
b 7 3 3
b 8 4 4
c 1 1 NaN
c 2 2 1
c 2 3 2
c 1 4 3
c 2 5 NaN
The expected output should look like this:
id mean count
a 4 2
b NaN 0
c 1 1
Which was created based on the following logic:
For user a, the expected output is 4, which is the average time_on_site for site_visit 1 and 2, which occurred before the first feature interaction on site_visit 3.
For user b the average time should be NaN because they had no prior visits before their first interaction with the feature.
For user c, their average time is just 1, since they only had one visit before interacting with the new feature.
If a user never used the new feature, their mean and count should be NaN.
Thanks in advance for the help.
Try this:
def summarize(x):
index = x[x['feature_visit'].notnull()].index[0]
return pd.Series({
'mean': x[x.index < index]['time_on_site'].mean(),
'count': x[x.index < index]['site_visit'].count()
})
df.groupby('visitor_id').apply(summarize)
I have a dataframe as follow:
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
The dataframe df means there is a road between two locations. look like:
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
The first row means there is a road between locationID1 and locationID2, however, the second row also encodes this information. The forth and fifth rows also have repeated information. I am trying the remove those repeated by keeping only one row. Any of row is okay.
For example, my expected output is
location1 location2
0 1 2
2 3 4
4 6 8
Any efficient way to do that because I have a large dataframe with lots of repeated rows.
Thanks a lot,
It looks like you want every other row in your dataframe. This should work.
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
print(df)
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
def Every_other_row(a):
return a[::2]
Every_other_row(df)
location1 location2
0 1 2
2 3 4
4 6 8