drop duplicated and concat pandas - python-3.x

I have a dataframe that looks like this:
'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CZ10", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': ["bonjour", "bonjour", "bonjour", "hola", "Hello", None]
Result:
id date code col_example .... comments
0 1 2019 CB25/CZ10 22 .... bonjour (and not bonjour // bonjour)
1 2 2011 CD15 None .... bonjour
2 3 2017 None 55 .... hola // Hello
3 4 2018 AZ51 121 .... None
I want to keep a single id
If two ids are the same, I would like:
If comments = None and = str: Keep only the comments which are not None (example: id = 1, keep the comments "hello")
If two comments = str: Concaten the two comments with a "//" (example id = 3, comments = "hola // hello")
For the moment I tried with sort_value, and drop_duplicate without success
thank you

I believe you need DataFrame.dropna by column comments and then GroupBy.agg with join and GroupBy.last, last add DataFrame.mask for replace empty strings to None rows:
df1 = (df.groupby('id')
.agg({'date': 'last',
'comments': lambda x: ' // '.join(x.dropna())})
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id date comments
0 1 2019 bonjour
1 2 2011 bonjour
2 3 2017 hola // Hello
3 4 2018 None
EDIT: For avoid removed all columns is necessary aggregate all of them, you can create dictionary for aggregation dynamic like:
df = pd.DataFrame({'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CB25", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': [None, "bonjour", "bonjour", "hola", "Hello", None]})
print (df)
id date code col_example comments
0 1 2017 CB25 22 None
1 2 2011 CD15 None bonjour
2 1 2019 CB25 22 bonjour
3 3 2013 None 55 hola
4 3 2017 None 55 Hello
5 4 2018 AZ51 121 None
d = dict.fromkeys(df.columns.difference(['id','comments']), 'last')
d['comments'] = lambda x: ' // '.join(x.dropna())
print (d)
{'code': 'last', 'col_example': 'last', 'date': 'last',
'comments': <function <lambda> at 0x000000000ECA99D8>}
df1 = (df.groupby('id')
.agg(d)
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id code col_example date comments
0 1 CB25 22 2019 bonjour
1 2 CD15 None 2011 bonjour
2 3 None 55 2017 hola // Hello
3 4 AZ51 121 2018 None

Related

changing values in data frame based on duplicates - python

I have a quite large data set of over 100k rows with many duplicates and some missing or faulty values. Trying to simplify the problem in the snippet below.
sampleData = {
'BI Business Name' : ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www#1', 'www#1', 'www#2', 'www#3', np.nan],
'BI Telephone' : ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
I'm trying to change the values based on duplicate rows so if any three fields are matching then the forth one should match as well. I should get outcome like this:
result = {
'BI Business Name' : ['AAA', 'AAA', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www#1', 'www#1', 'www#2', 'www#3', 'www#3'],
'BI Telephone' : ['999', '999', '666', '12345', '12345']
}
df = pd.DataFrame(result)
I have found extremely long winded method - here showing just the part for changing the name.
df['Phone_code_web'] = df['BId Postcode'] + df['BI Website'] + df['BI Telephone']
reference_name = df[['BI Business Name', 'BI Telephone', 'BId Postcode','BI Website']]
reference_name = reference_name.dropna()
reference_name['Phone_code_web'] = reference_name['BId Postcode'] + reference_name['BI Website'] +
reference_name['BI Telephone']
duplicate_ref = reference_name[reference_name['Phone_code_web'].duplicated()]
reference_name = pd.concat([reference_name,duplicate_ref]).drop_duplicates(keep=False)
reference_name
def replace_name(row):
try:
old_name = row['BI Business Name']
reference = row['Phone_code_web']
new_name = reference_name[reference_name['Phone_code_web']==reference].iloc[0,0]
print(new_name)
return new_name
except Exception as e:
return old_name
df['BI Business Name']=df.apply(replace_name, axis=1)
df
Is there easier way of doing this?
You can try this:
import pandas as pd
sampleData = {
'BI Business Name': ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode': ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website': ['www#1', 'www#1', 'www#2', 'www#3', np.nan],
'BI Telephone': ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
print(df)
def fill_gaps(_df, _x): # _df and _x are local variables that represent the dataframe and one of its rows, respectively
# pd.isnull(_x) = list of Booleans indicating which columns have NaNs
# df.columns[pd.isnull(_x)] = list of columns whose value is a NaN
for col in df.columns[pd.isnull(_x)]:
# len(set(y) & set(_x)) = length of the intersection of the row being considered (_x) and each of the other rows in turn (y)
# the mask is a list of Booleans which are True if:
# 1) y[col] is not Null (e.g. for row 3 we need to replace (BI Telephone = NaN) with a non-NaN 'BI Telephone' value)
# 2) and the length of the intersection above is at least 3 (as required)
mask = df.apply(lambda y: pd.notnull(y[col]) and len(set(y) & set(_x)) == 3, axis=1)
# if the mask has at least one "True" value, select the value in the corresponding column (if there are several possible values, select the first one)
_x[col] = df[mask][col].iloc[0] if any(mask) else _x[col]
return _x
# Apply the logic described above to each row in turn (x = each row)
df = df.apply(lambda x: fill_gaps(df, x), axis=1)
print(df)
Output:
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www#1 999
1 BBB NW1 8NZ www#1 999
2 CCC WC2N 4AA www#2 666
3 DDD CV7 9JY www#3 NaN
4 DDD CV7 9JY NaN 12345
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www#1 999
1 BBB NW1 8NZ www#1 999
2 CCC WC2N 4AA www#2 666
3 DDD CV7 9JY www#3 12345
4 DDD CV7 9JY www#3 12345

Pandas dataframe transpose with column name instead of index

I can't seem to figure out how to show actual column name in json after dataframe has been transposed. Any thoughts please?
from pandasql import *
import pandas as pd
pysqldf = lambda q: sqldf(q, globals())
q1 = """
SELECT
beef as beef, veal as veal, pork as pork, lamb_and_mutton as lamb
FROM
meat m
LIMIT 3;
"""
meat = load_meat()
df = pysqldf(q1)
#df = df.reset_index(drop=True)
#print(df.T.to_json(orient='records'))
df1 = df.T.reset_index(drop=True)
df1.columns = range(len(df1.columns))
print(df.T.to_json(orient='records'))
Output
[{"0":751.0,"1":713.0,"2":741.0},{"0":85.0,"1":77.0,"2":90.0},{"0":1280.0,"1":1169.0,"2":1128.0},{"0":89.0,"1":72.0,"2":75.0}]
Expected Output
[ { "0": "beef", "1": 751, "2": 713, "3": 741},{"0": "veal", "1": 85, "2": 77, "3": 90 },{"0": "pork", "1": 1280, "2": 1169, "3": 1128},{ "0": "lamb", "1": 89, "2": 72, "3": 75 }]
Try this:
Where df:
beef veal pork lamb
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Use T, reset_index, and set_axis:
df.T.reset_index()\
.set_axis(range(len(df.columns)), axis=1, inplace=False)\
.to_json(orient='records')
Output:
'[{"0":"beef","1":0,"2":4,"3":8},{"0":"veal","1":1,"2":5,"3":9},{"0":"pork","1":2,"2":6,"3":10},{"0":"lamb","1":3,"2":7,"3":11}]'

Pandas - Fastest way indexing with 2 dataframes

I am developing a software in Python 3 with Pandas library.
Time is very important but memory not so much.
For better visualization I am using the names a and b with few values, although there are many more:
a -> 50000 rows
b -> 5000 rows
I need to select from dataframe a and b (using multiples conditions)
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] ,
'a2': [1, 2, 3],
'a3': [3.14, 2.73, -23.00],
'a4': [pd.np.nan, pd.np.nan, pd.np.nan]
})
a
a1 a2 a3 a4
0 x 1 3.14 NaN
1 y 2 2.73 NaN
2 z 3 -23.00 NaN
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'],
'b2': [2018, 2019, 2020, 2015, 2012]
})
b
b1 b2
0 x 2018
1 y 2019
2 z 2020
3 k 2015
4 l 2012
So far my code is like this:
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if (len(mask.index) != 0): #not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except KeyError: #not found
pass
But as you can see, I'm using for iterrows that is very slow compared to other methods and I'm changing the value of the DataFrame I'm iterating, that is not recommended.
Could you help me find a better way? The results should be like this:
a
a1 a2 a3 a4
0 x 1 3.14 2018
1 y 2 2.73 NaN
2 z 3 -23.00 2020
I tried things like this below, but I didnt made it work.
a.loc[ (a['a1'] == b['b1']) , 'a4'] = b.loc[b['b2'] != 2019]
*the real code has more conditions
Thanks!
EDIT
I benchmark using: iterrows, merge, set_index/loc. Here is the code:
import timeit
import pandas as pd
def f_iterrows():
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
a3 = row["a3"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if len(mask.index) != 0: # not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except: # not found
pass
def f_merge():
a.merge(b[b.b2 != 2019], left_on='a1', right_on='b1', how='left').drop(['a4', 'b1'], 1).rename(columns={'b2': 'a4'})
def f_lock():
df1 = a.set_index('a1')
df2 = b.set_index('b1')
df1.loc[:, 'a4'] = df2.b2[df2.b2 != 2019]
#variables for testing
number_rows = 100
number_iter = 100
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] * number_rows,
'a2': [1, 2, 3] * number_rows,
'a3': [3.14, 2.73, -23.00] * number_rows,
'a4': [pd.np.nan, pd.np.nan, pd.np.nan] * number_rows
})
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'] * number_rows,
'b2': [2018, 2019, 2020, 2015, 2012] * number_rows
})
print('For: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
print('Merge: %s s' % str(timeit.timeit(f_merge, number=number_iter)))
print('Loc: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
They all worked :) and the time to run is:
For: 277.9994369489998 s
Loc: 274.04929955067564 s
Merge: 2.195712725706926 s
So far Merge is the fastest.
If another option appears I will update here, thanks again.
IIUC
a.merge(b[b.b2!=2019],left_on='a1',right_on='b1',how='left').drop(['a4','b1'],1).rename(columns={'b2':'a4'})
Out[263]:
a1 a2 a3 a4
0 x 1 3.14 2018.0
1 y 2 2.73 NaN
2 z 3 -23.00 2020.0

Convert list of Pandas Dataframe JSON objects

I have a Dataframe with one column where each cell in the column is a JSON object.
players
0 {"name": "tony", "age": 57}
1 {"name": "peter", age": 46}
I want to convert this to a data frame as:
name age
tony 57
peter 46
Any ideas how I do this?
Note: the original JSON object looks like this...
{
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
Use DataFrame constructor if types of values are dicts:
#print (type(df.loc[0, 'players']))
#<class 'str'>
#import ast
#df['players'] = df['players'].apply(ast.literal_eval)
print (type(df.loc[0, 'players']))
<class 'dict'>
df = pd.DataFrame(df['players'].values.tolist())
print (df)
age name
0 57 tony
1 46 peter
But better is use json_normalize from jsons object as suggested #jpp:
json = {
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
df = json_normalize(json, 'players')
print (df)
age name
0 57 tony
1 46 peter
This can do it:
df = df['players'].apply(pd.Series)
However, it's slow:
In [20]: timeit df.players.apply(pd.Series)
1000 loops, best of 3: 824 us per loop
#jezrael suggestion is faster:
In [24]: timeit pd.DataFrame(df.players.values.tolist())
1000 loops, best of 3: 387 us per loop

How convert "object" to numeric values including Nan

I have this data for many years, and need to plot error graph for different years, 1993 was selected with
fm93 = fmm[(fmm.Year == 1993)]
then the fm93 data frame is
Year moB m1 std1 co1 min1 max1 m2S std2S co2S min2S max2S
1993 1 296.42 18.91 31 262.4 336 -- -- -- -- --
1993 2 280.76 24.59 28 239.4 329.3 -- -- -- -- --
1993 3 271.41 19.16 31 236.4 304.8 285.80 20.09 20 251.6 319.7
1993 4 287.98 22.52 30 245.9 341 296.75 21.77 27 261.1 345.7
1993 5 287.05 30.79 30 229.2 335.7 300.06 27.64 24 249.5 351.8
1993 6 288.65 11.29 4 275.9 301.9 263.70 73.40 7 156.5 361
1993 7 280.11 36.01 12 237 363 302.67 26.39 22 262.9 377.1
1993 8 296.51 34.55 31 234.8 372.9 305.85 39.95 28 234.1 417.9
1993 9 321.31 34.54 25 263.8 396 309.01 42.52 29 205.9 403.2
1993 10 315.80 8.63 2 309.7 321.9 288.65 35.86 31 230.9 345.4
1993 11 288.26 24.07 30 231.4 322.8 297.99 23.81 28 238 336.5
1993 12 296.87 18.31 31 257.6 331.5 303.02 20.02 29 265.7 340.7
When I try to plot moB,m1 with err std1 appear the error
ValueError: err must be [ scalar | N, Nx1 or 2xN array-like ]
That is because the values are "object"..
array([[1993, 1, '296.42', '18.91', '31', '262.4', '336', '--', '--', '--',
'--', '--'],
[1993, 2, '280.76', '24.59', '28', '239.4', '329.3', '--', '--',
'--', '--', '--'],
[1993, 3, '271.41', '19.16', '31', '236.4', '304.8', '285.80',
'20.09', '20', '251.6', '319.7'],
[1993, 4, '287.98', '22.52', '30', '245.9', '341', '296.75',
'21.77', '27', '261.1', '345.7'],
[1993, 5, '287.05', '30.79', '30', '229.2', '335.7', '300.06',
'27.64', '24', '249.5', '351.8'],
[1993, 6, '288.65', '11.29', '4', '275.9', '301.9', '263.70',
'73.40', '7', '156.5', '361'],
[1993, 7, '280.11', '36.01', '12', '237', '363', '302.67', '26.39',
'22', '262.9', '377.1'],
[1993, 8, '296.51', '34.55', '31', '234.8', '372.9', '305.85',
'39.95', '28', '234.1', '417.9'],
[1993, 9, '321.31', '34.54', '25', '263.8', '396', '309.01',
'42.52', '29', '205.9', '403.2'],
[1993, 10, '315.80', '8.63', '2', '309.7', '321.9', '288.65',
'35.86', '31', '230.9', '345.4'],
[1993, 11, '288.26', '24.07', '30', '231.4', '322.8', '297.99',
'23.81', '28', '238', '336.5'],
[1993, 12, '296.87', '18.31', '31', '257.6', '331.5', '303.02',
'20.02', '29', '265.7', '340.7']], dtype=object)
I try convert this data with
fm93_1 = fm93.astype('float64', raise_on_error = False)
But the problem remain.... How can convert Nan values ('--') or ignore to plot my results?
thanks in advance
First you should try to plot a sample of data after the '--' to see if a plot can be generated. You can try:
# This should plot from row 3 onwards, omitting row 1 and 2
df.iloc[3:].plot()
Assuming '--' is the problem, you can replace with np.NaN. NaN values are not plotted.
df.replace('--', np.NaN, inplace=True)
df.plot()
Another way is to select df that does not contain '--':
mask = df[df == '--'].any(axis=1) # Check if row contains '--'
valid_indexes = df[mask == False].index # Return index for rows w/o '--'
df.iloc[valid_indexes].plot()

Resources