how to find how many times given value repeating in pandas dataseries if dataseries contains list of string values? - python-3.x

Example 1:
let suppose we have record data series
record
['ABC' ,'GHI']
['ABC' , 'XYZ']
['XYZ','PQR']
if I want to calculate how many times each value is repeating from record data-series like
value Count
'ABC' 2
'XYZ' 2
'GHI' 1
'PQR' 1
In the record series, 'ABC' and 'XYZ' are repeating for 2 times.
'GHI' and 'PQR' repeating for 1 times.
Example 2:
below is the new dataframe.
teams
0 ['Australia', 'Sri Lanka']
1 ['Australia', 'Sri Lanka']
2 ['Australia', 'Sri Lanka']
3 ['Ireland', 'Hong Kong']
4 ['Zimbabwe', 'India']
... ...
1412 ['Pakistan', 'Sri Lanka']
1413 ['Bangladesh', 'India']
1414 ['United Arab Emirates', 'Netherlands']
1415 ['Sri Lanka', 'Australia']
1416 ['Sri Lanka', 'Australia']
Now if I apply
print(new_df.explode('teams').value_counts())
it gives me
teams
['England', 'Pakistan'] 29
['Australia', 'Pakistan'] 26
['England', 'Australia'] 25
['Australia', 'India'] 24
['England', 'West Indies'] 23
... ..
['Namibia', 'Sierra Leone'] 1
['Namibia', 'Scotland'] 1
['Namibia', 'Oman'] 1
['Mozambique', 'Rwanda'] 1
['Afghanistan', 'Bangladesh'] 1
Length: 399, dtype: int64
But I want
team occurrence of team
India ?
England ?
Australia ?
... ...
I want the occurrence of each team from the dataframe.
How to perform this task?

Try explode and value_counts
On Series:
import pandas as pd
s = pd.Series({0: ['ABC', 'GHI'],
1: ['ABC', 'XYZ'],
2: ['XYZ', 'PQR']})
r = s.explode().value_counts()
print(r)
XYZ 2
ABC 2
GHI 1
PQR 1
dtype: int64
On DataFrame:
import pandas as pd
df = pd.DataFrame({'record': {0: ['ABC', 'GHI'],
1: ['ABC', 'XYZ'],
2: ['XYZ', 'PQR']}})
r = df.explode('record')['record'].value_counts()
print(r)
XYZ 2
ABC 2
GHI 1
PQR 1
Name: record, dtype: int64

Related

Pandas: Merging rows into one

I have the following table:
Name
Age
Data_1
Data_2
Tom
10
Test
Tom
10
Foo
Anne
20
Bar
How I can merge this rows to get this output:
Name
Age
Data_1
Data_2
Tom
10
Test
Foo
Anne
20
Bar
I tried this code (and some other related (agg, groupby other fields, et cetera)):
import pandas as pd
data = [['tom', 10, 'Test', ''], ['tom', 10, 1, 'Foo'], ['Anne', 20, '', 'Bar']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Data_1', 'Data_2'])
df = df.groupby("Name").sum()
print(df)
But I only get something like this:
c2
Name
--------
--------------
Anne
Foo
Tom
Bar
Just a groupby and a sum will do.
df.groupby(['Name','Age']).sum().reset_index()
Name Age Data_1 Data_2
0 Anne 20 Bar
1 tom 10 Test Foo
Use this if the empty cells are NaN :
(df.set_index(['Name', 'Age'])
.stack()
.groupby(level=[0, 1, 2])
.apply(''.join)
.unstack()
.reset_index()
)
Otherwise, add this line df.replace('', np.nan, inplace=True) before the code above.
# Output
Name Age Data_1 Data_2
0 Anne 20 NaN Bar
1 Tom 10 Test Foo

De-duplication with merge of data

I have a dataset with duplicates, triplicates and more and I want to keep only one record of each unique with merge of data, for example:
id name address age city
1 Alex 123,blv
1 Alex 13
3 Alex 24 Florida
1 Alex Miami
Merging data using the id field:
Output:
id name address age city
1 Alex 123,blv 13 Miami
3 Alex 24 Florida
I've changed a bit the code from this answer.
Code to create the initial dataframe:
import pandas as pd
import numpy as np
d = {'id': [1,1,3,1],
'name': ["Alex", "Alex", "Alex", "Alex"],
'address': ["123,blv" , None, None, None],
'age': [None, 13, 24, None],
'city': [None, None, "Florida", "Miami"]
}
df = pd.DataFrame(data=d, index=d["id"])
print(df)
Output:
id name address age city
1 1 Alex 123,blv NaN None
1 1 Alex None 13.0 None
3 3 Alex None 24.0 Florida
1 1 Alex None NaN Miami
Aggregation code:
def get_notnull(x):
if x.notnull().any():
return x[x.notnull()]
else:
return np.nan
aggregation_functions = {'name': 'first',
'address': get_notnull,
'age': get_notnull,
'city': get_notnull
}
df = df.groupby(df['id']).aggregate(aggregation_functions)
print(df)
Output:
name address age city
id
1 Alex 123,blv 13.0 Miami
3 Alex NaN 24.0 Florida
(
df
.reset_index(drop=True) # set unique index for eash record
.drop('id', axis=1) # exclude 'id' column from processing
.groupby(df['id']) # group by 'id'
.agg(
# return first non-NA/None value for each column
lambda s: s.get(s.first_valid_index())
)
.reset_index() # get back the 'id' value for each record
)
ps. As an option:
df.replace([None, ''], pd.NA).groupby('id').first().reset_index()

Python pandas move cell value to another cell in same row

I have a dataFrame like this:
id Description Price Unit
1 Test Only 1254 12
2 Data test Fresher 4
3 Sample 3569 1
4 Sample Onces Code test
5 Sample 245 2
I want to move to the left Description column from Price column if not integer then become NaN. I have no specific word to call in or match, the only thing is if Price column have Non-integer value, that string value move to Description column.
I already tried pandas replace and concat but it doesn't work.
Desired output is like this:
id Description Price Unit
1 Test Only 1254 12
2 Fresher 4
3 Sample 3569 1
4 Code test
5 Sample 245 2
This should work
# data
df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Description': ['Test Only', 'Data test', 'Sample', 'Sample Onces', 'Sample'],
'Price': ['1254', 'Fresher', '3569', 'Code test', '245'],
'Unit': [12, 4, 1, np.nan, 2]})
# convert price column to numeric and coerce errors
price = pd.to_numeric(df.Price, errors='coerce')
# for rows where price is not numeric, replace description with these values
df.Description = df.Description.mask(price.isna(), df.Price)
# assign numeric price to price column
df.Price = price
df
Use:
#convert valeus to numeric
price = pd.to_numeric(df['Price'], errors='coerce')
#test missing values
m = price.isna()
#shifted only matched rows
df.loc[m, ['Description','Price']] = df.loc[m, ['Description','Price']].shift(-1, axis=1)
print (df)
id Description Price
0 1 Test Only 1254
1 2 Fresher NaN
2 3 Sample 3569
3 4 Code test NaN
4 5 Sample 245
If need numeric values in ouput Price column:
df = df.assign(Price=price)
print (df)
id Description Price
0 1 Test Only 1254.0
1 2 Fresher NaN
2 3 Sample 3569.0
3 4 Code test NaN
4 5 Sample 245.0

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

Resources