De-duplication with merge of data - python-3.x

I have a dataset with duplicates, triplicates and more and I want to keep only one record of each unique with merge of data, for example:
id name address age city
1 Alex 123,blv
1 Alex 13
3 Alex 24 Florida
1 Alex Miami
Merging data using the id field:
Output:
id name address age city
1 Alex 123,blv 13 Miami
3 Alex 24 Florida

I've changed a bit the code from this answer.
Code to create the initial dataframe:
import pandas as pd
import numpy as np
d = {'id': [1,1,3,1],
'name': ["Alex", "Alex", "Alex", "Alex"],
'address': ["123,blv" , None, None, None],
'age': [None, 13, 24, None],
'city': [None, None, "Florida", "Miami"]
}
df = pd.DataFrame(data=d, index=d["id"])
print(df)
Output:
id name address age city
1 1 Alex 123,blv NaN None
1 1 Alex None 13.0 None
3 3 Alex None 24.0 Florida
1 1 Alex None NaN Miami
Aggregation code:
def get_notnull(x):
if x.notnull().any():
return x[x.notnull()]
else:
return np.nan
aggregation_functions = {'name': 'first',
'address': get_notnull,
'age': get_notnull,
'city': get_notnull
}
df = df.groupby(df['id']).aggregate(aggregation_functions)
print(df)
Output:
name address age city
id
1 Alex 123,blv 13.0 Miami
3 Alex NaN 24.0 Florida

(
df
.reset_index(drop=True) # set unique index for eash record
.drop('id', axis=1) # exclude 'id' column from processing
.groupby(df['id']) # group by 'id'
.agg(
# return first non-NA/None value for each column
lambda s: s.get(s.first_valid_index())
)
.reset_index() # get back the 'id' value for each record
)
ps. As an option:
df.replace([None, ''], pd.NA).groupby('id').first().reset_index()

Related

Pandas: Merging rows into one

I have the following table:
Name
Age
Data_1
Data_2
Tom
10
Test
Tom
10
Foo
Anne
20
Bar
How I can merge this rows to get this output:
Name
Age
Data_1
Data_2
Tom
10
Test
Foo
Anne
20
Bar
I tried this code (and some other related (agg, groupby other fields, et cetera)):
import pandas as pd
data = [['tom', 10, 'Test', ''], ['tom', 10, 1, 'Foo'], ['Anne', 20, '', 'Bar']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Data_1', 'Data_2'])
df = df.groupby("Name").sum()
print(df)
But I only get something like this:
c2
Name
--------
--------------
Anne
Foo
Tom
Bar
Just a groupby and a sum will do.
df.groupby(['Name','Age']).sum().reset_index()
Name Age Data_1 Data_2
0 Anne 20 Bar
1 tom 10 Test Foo
Use this if the empty cells are NaN :
(df.set_index(['Name', 'Age'])
.stack()
.groupby(level=[0, 1, 2])
.apply(''.join)
.unstack()
.reset_index()
)
Otherwise, add this line df.replace('', np.nan, inplace=True) before the code above.
# Output
Name Age Data_1 Data_2
0 Anne 20 NaN Bar
1 Tom 10 Test Foo

how to change data frame row to next row in pandas

I am a noob python user and my purpose is got name and shift to next row
import pandas as pd
import numpy as np
df = pd.DataFrame({"1": ['Alfred', 'car', 'bike','Alex','car'],
"2": [np.nan, 'Ford', 'Giant',np.nan,'Toyota'],
"3": [pd.NaT, pd.Timestamp("2018-01-01"),
pd.Timestamp("2018-07-01"),np.nan,pd.Timestamp("2021-01-01")]})
1 2 3
0 Alfred NaN NaT
1 car Ford 2018-01-01
2 bike Giant 2018-07-01
3 Alex NaN NaT
4 car Toyota 2021-01-01
my goal result like as below
df = pd.DataFrame({"transportation": ['car', 'bike','car'],
"Mark": ['Ford', 'Giant','Toyota'],
"BuyDate":[pd.Timestamp("2018-01-01"),
pd.Timestamp("2018-07-01"),pd.Timestamp("2021-01-01")],
"Name":['Alfred','Alfred','Alex']
})
transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
i'm try to search some method , but cannot solve this
thanks for see my post and help
thanks mozway、jezrael、mcsoini help,it's work and i'm going learning those different method 。
Joseph Assaker
i had a question for your answer , when i run as below code and show error code 。 am i miss something ??
j = 0
for i in range(1, df.shape[0]):
if df.loc[i][1] is np.nan:
running_name = df.loc[i][0]
continue
new_df.loc[j] = list(df.loc[i]) + [running_name]
j += 1
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14216/1012510729.py in <module>
4 running_name = df.loc[i][0]
5 continue
----> 6 new_df.loc[j] = list(df.loc[i]) + [running_name]
7 j += 1
NameError: name 'running_name' is not defined
Idea is forward filling missing values by Mark column to Name column and then filter rows in same mask:
df.columns = ["Transportation", "Mark", "BuyDate"]
m = df["Mark"].notna()
df["Name"] = df["transportation"].mask(m).ffill()
df = df[m].reset_index(drop=True)
print(df)
Transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
You can do this using a helper column and then a forward fill:
# rename columns
df.columns = ["transportation", "Mark", "BuyDate"]
# assumption: the rows where "Mark" is NaN defines the name for the following rows
df["is_name"] = df["Mark"].isna()
# create a new column which is NaN everywhere except for the name rows
df["name"] = np.where(df.is_name, df["transportation"], np.nan)
# do a forward fill to extend the names to all rows
df["name"] = df["name"].fillna(method="ffill")
# filter by non-name rows and drop the temporary is_name column
df = df.loc[~df.is_name].drop("is_name", axis=1)
print(df)
Out:
transportation Mark BuyDate name
1 car Ford 2018-01-01 Alfred
2 bike Giant 2018-07-01 Alfred
4 car Toyota 2021-01-01 Alex
You could use this pipeline:
m = df.iloc[:,1].notna()
(df.assign(Name=df.iloc[:,0].mask(m).ffill()) # add new column
.loc[m] # keep only the columns with info
# below: rework df to fit output
.rename(columns={'1': 'transportation', '2': 'Mark', '3': 'BuyDate'})
.reset_index(drop=True)
)
output:
transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
You can do this like so:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"1": ['Alfred', 'car', 'bike','Alex','car'],
... "2": [np.nan, 'Ford', 'Giant',np.nan,'Toyota'],
... "3": [pd.NaT, pd.Timestamp("2018-01-01"),
... pd.Timestamp("2018-07-01"),np.nan,pd.Timestamp("2021-01-01")]})
>>>
>>> df
1 2 3
0 Alfred NaN NaT
1 car Ford 2018-01-01
2 bike Giant 2018-07-01
3 Alex NaN NaT
4 car Toyota 2021-01-01
>>>
>>> new_df = pd.DataFrame(columns=['Transportation', 'Mark', 'BuyDate', 'Name'])
>>>
>>> j = 0
>>> for i in range(1, df.shape[0]):
... if df.loc[i][1] is np.nan:
... running_name = df.loc[i][0]
... continue
... new_df.loc[j] = list(df.loc[i]) + [running_name]
... j += 1
...
>>> new_df
Transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
>>>

how to find how many times given value repeating in pandas dataseries if dataseries contains list of string values?

Example 1:
let suppose we have record data series
record
['ABC' ,'GHI']
['ABC' , 'XYZ']
['XYZ','PQR']
if I want to calculate how many times each value is repeating from record data-series like
value Count
'ABC' 2
'XYZ' 2
'GHI' 1
'PQR' 1
In the record series, 'ABC' and 'XYZ' are repeating for 2 times.
'GHI' and 'PQR' repeating for 1 times.
Example 2:
below is the new dataframe.
teams
0 ['Australia', 'Sri Lanka']
1 ['Australia', 'Sri Lanka']
2 ['Australia', 'Sri Lanka']
3 ['Ireland', 'Hong Kong']
4 ['Zimbabwe', 'India']
... ...
1412 ['Pakistan', 'Sri Lanka']
1413 ['Bangladesh', 'India']
1414 ['United Arab Emirates', 'Netherlands']
1415 ['Sri Lanka', 'Australia']
1416 ['Sri Lanka', 'Australia']
Now if I apply
print(new_df.explode('teams').value_counts())
it gives me
teams
['England', 'Pakistan'] 29
['Australia', 'Pakistan'] 26
['England', 'Australia'] 25
['Australia', 'India'] 24
['England', 'West Indies'] 23
... ..
['Namibia', 'Sierra Leone'] 1
['Namibia', 'Scotland'] 1
['Namibia', 'Oman'] 1
['Mozambique', 'Rwanda'] 1
['Afghanistan', 'Bangladesh'] 1
Length: 399, dtype: int64
But I want
team occurrence of team
India ?
England ?
Australia ?
... ...
I want the occurrence of each team from the dataframe.
How to perform this task?
Try explode and value_counts
On Series:
import pandas as pd
s = pd.Series({0: ['ABC', 'GHI'],
1: ['ABC', 'XYZ'],
2: ['XYZ', 'PQR']})
r = s.explode().value_counts()
print(r)
XYZ 2
ABC 2
GHI 1
PQR 1
dtype: int64
On DataFrame:
import pandas as pd
df = pd.DataFrame({'record': {0: ['ABC', 'GHI'],
1: ['ABC', 'XYZ'],
2: ['XYZ', 'PQR']}})
r = df.explode('record')['record'].value_counts()
print(r)
XYZ 2
ABC 2
GHI 1
PQR 1
Name: record, dtype: int64

Python - Transpose/Pivot a column based based on a different column

I searched it and indeed I found a lot of similar questions but none of those seemed to answer my case.
I have a pd Dataframe which is a joined table consist of products and the countries in which they are sold.
It's 3000 rows and 50 columns in size.
I'm uploading a photo (only part of the df) of the current situation I'm in now and the expected result I want to achieve.
I want to transpose the 'Country name' column into rows grouped by the 'Product code name. Please note that the new country columns are not limited to a certain amount of countries (some products has 3, some 40).
Thank you!
Use .cumcount() to count the number of countries that a product has.
Then use .pivot() to get your dataframe in the right shape:
df = pd.DataFrame({
'Country': ['NL', 'Poland', 'Spain', 'Sweden', 'China', 'Egypt'],
'Product Code': ['123', '123', '115', '115', '117', '118'],
'Product Name': ['X', 'X', 'Y', 'Y', 'Z', 'W'],
})
df['cumcount'] = df.groupby(['Product Code', 'Product Name'])['Country'].cumcount() + 1
df_pivot = df.pivot(
index=['Product Code', 'Product Name'],
columns='cumcount',
values='Country',
).add_prefix('country_')
Resulting dataframe:
cumcount country_1 country_2
ProductCode Product Name
115 Y Spain Sweden
117 Z China NaN
118 W Egypt NaN
123 X NL Poland
Try this:
df_out = df.set_index(['Product code',
'Product name',
df.groupby('Product code').cumcount() + 1]).unstack()
df_out.columns = [f'Country_{j}' for _, j in df_out.columns]
df_out.reset_index()
Output:
Product code Product name Country_1 Country_2 Country_3
0 AAA115 Y Sweden China NaN
1 AAA117 Z Egypt Greece NaN
2 AAA118 W France Italy NaN
3 AAA123 X Netherlands Poland Spain
Details:
Reshape dataframe with set_index and unstack, using cumcount to create country columns. Then flatten multiindex header using list comprehension.

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Resources