Pandas: Merging rows into one - python-3.x

I have the following table:
Name
Age
Data_1
Data_2
Tom
10
Test
Tom
10
Foo
Anne
20
Bar
How I can merge this rows to get this output:
Name
Age
Data_1
Data_2
Tom
10
Test
Foo
Anne
20
Bar
I tried this code (and some other related (agg, groupby other fields, et cetera)):
import pandas as pd
data = [['tom', 10, 'Test', ''], ['tom', 10, 1, 'Foo'], ['Anne', 20, '', 'Bar']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Data_1', 'Data_2'])
df = df.groupby("Name").sum()
print(df)
But I only get something like this:
c2
Name
--------
--------------
Anne
Foo
Tom
Bar

Just a groupby and a sum will do.
df.groupby(['Name','Age']).sum().reset_index()
Name Age Data_1 Data_2
0 Anne 20 Bar
1 tom 10 Test Foo

Use this if the empty cells are NaN :
(df.set_index(['Name', 'Age'])
.stack()
.groupby(level=[0, 1, 2])
.apply(''.join)
.unstack()
.reset_index()
)
Otherwise, add this line df.replace('', np.nan, inplace=True) before the code above.
# Output
Name Age Data_1 Data_2
0 Anne 20 NaN Bar
1 Tom 10 Test Foo

Related

How to replace a tag %%article%% by letter a

I have this dataframe:
pd.DataFrame({'text': ['I have %%article%% car', '%%article%%fter dawn', 'D%%article%%t%%article%%Fr%%article%%me']})
I am trying to replace %%article%% by letter a to have as output:
pd.DataFrame({'text': ['I have a car', 'after dawn', 'DataFrame']})
I tried to create a dict ={'%%article%%':'a'} and then:
df['text'] = df['text'].map(dict)
But it's not working, it returns NaN
When passing a dict to Series.map, it uses table lookup so that only elements that exactly match '%%article%%' will be replaced by 'a'.
An example from doc:
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0 cat
1 dog
2 NaN
3 rabbit
>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0 kitten
1 puppy
2 NaN
3 NaN
An element with something like 'ccat' will not be replaced. Instead, you can use a function to replace them:
>>> df = pd.DataFrame({'text': ['I have %%article%% car', '%%article%%fter dawn', 'D%%article%%t%%article%%Fr%%article%%me']})
>>> df.text = df.text.map(lambda i: i.replace('%%article%%', 'a'))
>>> df
text
0 I have a car
1 after dawn
2 DataFrame
But the better is probably Series.replace:
>>> df.replace('%%article%%', 'a')
text
0 I have a car
1 after dawn
2 DataFrame
Use:
df['text'].str.replace('%%article%%', 'a')
Output:
0 I have a car
1 after dawn
2 DataFrame
Name: text, dtype: object

De-duplication with merge of data

I have a dataset with duplicates, triplicates and more and I want to keep only one record of each unique with merge of data, for example:
id name address age city
1 Alex 123,blv
1 Alex 13
3 Alex 24 Florida
1 Alex Miami
Merging data using the id field:
Output:
id name address age city
1 Alex 123,blv 13 Miami
3 Alex 24 Florida
I've changed a bit the code from this answer.
Code to create the initial dataframe:
import pandas as pd
import numpy as np
d = {'id': [1,1,3,1],
'name': ["Alex", "Alex", "Alex", "Alex"],
'address': ["123,blv" , None, None, None],
'age': [None, 13, 24, None],
'city': [None, None, "Florida", "Miami"]
}
df = pd.DataFrame(data=d, index=d["id"])
print(df)
Output:
id name address age city
1 1 Alex 123,blv NaN None
1 1 Alex None 13.0 None
3 3 Alex None 24.0 Florida
1 1 Alex None NaN Miami
Aggregation code:
def get_notnull(x):
if x.notnull().any():
return x[x.notnull()]
else:
return np.nan
aggregation_functions = {'name': 'first',
'address': get_notnull,
'age': get_notnull,
'city': get_notnull
}
df = df.groupby(df['id']).aggregate(aggregation_functions)
print(df)
Output:
name address age city
id
1 Alex 123,blv 13.0 Miami
3 Alex NaN 24.0 Florida
(
df
.reset_index(drop=True) # set unique index for eash record
.drop('id', axis=1) # exclude 'id' column from processing
.groupby(df['id']) # group by 'id'
.agg(
# return first non-NA/None value for each column
lambda s: s.get(s.first_valid_index())
)
.reset_index() # get back the 'id' value for each record
)
ps. As an option:
df.replace([None, ''], pd.NA).groupby('id').first().reset_index()

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Pandas Create DataFrame with ColumnNames from a list

Considering the following list made up of sub-lists as elements, I need to create a pandas dataframe
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
The desired output is as following, with the first argument being converted to the column name in the dataframe.
tom nick juli
0 10 15 14
Is there a way by which this output can be achieved?
Best Regards.
Use dictionary comprehension and pass to DataFrame constructor:
print ({x[0]: x[1:] for x in data})
{'tom': [10], 'nick': [15], 'juli': [14]}
df = pd.DataFrame({x[0]: x[1:] for x in data})
print (df)
tom nick juli
0 10 15 14
You could also use dict + extended iterable unpacking:
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
result = pd.DataFrame(dict((column, values) for column, *values in data))
print(result)
Output
tom nick juli
0 10 15 14
We also do:
pd.DataFrame(data).set_index(0).T
0 tom nick juli
1 10 15 14

How to sum columns in python based on column with not empty string

df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3],
'Sum over columns':[1,10,8,5,10]})
Hi everybody, could you please help me with following issue:
I'm trying to sum over columns to get a sum of data1 and data2.
If column with string (key1) is not NaN and if column with string (key2) is not NaN then sum data1 and data2. The result I want is shown in the sum column. Thank your for your help!
Try using the .apply method of df on axis=1 and numpy's array multiplication function to get your desired output:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3]})
df['Sum over columns'] = df.apply(lambda x: np.multiply(x[0:2], ~x[2:4].isnull()).sum(), axis=1)
Or:
df['Sum over columns'] = np.multiply(df[['data1','data2']], ~df[['key1','key2']].isnull()).sum(axis=1)
Either one of them should yield:
# data1 data2 key1 key2 Sum over columns
# 0 2 1 NaN ab 1
# 1 5 5 a aa 10
# 2 8 9 b NaN 8
# 3 5 6 b NaN 5
# 4 7 3 a one 10
I hope this helps.

Resources