Merging lines that has duplicates and summing last column - python-3.x

I have this input text file
1;2;29.02.2017;10.00-11.00;5;
1;2;29.02.2017;10.00-11.00;3;
1;3;02.02.2017;09.00-10.00;4;
1;3;03.02.2017;12.00-13.00;2;
1;3;28.02.2017;08.00-09.00;6;
1;3;29.02.2017;10.00-11.00;3;
1;3;29.02.2017;10.00-11.00;2;
1;3;29.02.2017;11.00-12.00;2;
1;3;29.02.2017;12.00-13.00;3;
10;11;28.02.2017;08.00-09.00;6;
10;11;28.02.2017;08.00-09.00;1;
10;12;02.02.2017;09.00-10.00;8;
10;12;28.02.2017;08.00-09.00;2;
10;12;28.02.2017;08.00-09.00;1;
values separated by ';' are as follows:
1- id_1(str), 2- id_2(str), 3- date(str), 4- time(str), 5- area(int)
As output, I need a text file that contains lines from input, that have 1,2,3,4 duplicates and the sum of area. I need lines without duplicates to be dropped, e.g.
1;2;29.02.2017;10.00-11.00;8; (sum of 5 from line 1 and 3 from line 2)
1;3;29.02.2017;10.00-11.00;5;
10;11;28.02.2017;08.00-09.00;7;
10;12;28.02.2017;08.00-09.00;3;
What I achieved so far, is getting dropped lines without duplicates, but I had to remove the area column.
I used this:
seen = set()
for line1 in imp:
line1_lower = line1.lower()
if line1_lower in seen:
print(line1)
else:
seen.add(line1_lower)

You can use read_csv first with parameter names for create column names (if csv have no header):
import pandas as pd
from pandas.compat import StringIO
temp=u"""1;2;29.02.2017;10.00-11.00;5;
1;2;29.02.2017;10.00-11.00;3;
1;3;02.02.2017;09.00-10.00;4;
1;3;03.02.2017;12.00-13.00;2;
1;3;28.02.2017;08.00-09.00;6;
1;3;29.02.2017;10.00-11.00;3;
1;3;29.02.2017;10.00-11.00;2;
1;3;29.02.2017;11.00-12.00;2;
1;3;29.02.2017;12.00-13.00;3;
10;11;28.02.2017;08.00-09.00;6;
10;11;28.02.2017;08.00-09.00;1;
10;12;02.02.2017;09.00-10.00;8;
10;12;28.02.2017;08.00-09.00;2;
10;12;28.02.2017;08.00-09.00;1;"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=";", names=['id_1','id_2','date','time','area','tmp'])
print (df)
id_1 id_2 date time area tmp
0 1 2 29.02.2017 10.00-11.00 5 NaN
1 1 2 29.02.2017 10.00-11.00 3 NaN
2 1 3 02.02.2017 09.00-10.00 4 NaN
3 1 3 03.02.2017 12.00-13.00 2 NaN
4 1 3 28.02.2017 08.00-09.00 6 NaN
5 1 3 29.02.2017 10.00-11.00 3 NaN
6 1 3 29.02.2017 10.00-11.00 2 NaN
7 1 3 29.02.2017 11.00-12.00 2 NaN
8 1 3 29.02.2017 12.00-13.00 3 NaN
9 10 11 28.02.2017 08.00-09.00 6 NaN
10 10 11 28.02.2017 08.00-09.00 1 NaN
11 10 12 02.02.2017 09.00-10.00 8 NaN
12 10 12 28.02.2017 08.00-09.00 2 NaN
13 10 12 28.02.2017 08.00-09.00 1 NaN
Then groupby and aggregate size and sum, last use boolean indexing for remove duplicates - get values where size is greater as 1:
cols = ['id_1','id_2','date','time']
df = df.groupby(cols)['area'].agg(['size', 'sum'])
df = df[df['size'] > 1].drop('size',axis=1).reset_index()
print (df)
id_1 id_2 date time sum
0 1 2 29.02.2017 10.00-11.00 8
1 1 3 29.02.2017 10.00-11.00 5
2 10 11 28.02.2017 08.00-09.00 7
3 10 12 28.02.2017 08.00-09.00 3
Another solution is remove duplicates first by boolean indexing with duplicated and then aggregate sum:
cols = ['id_1','id_2','date','time']
mask = df.duplicated(cols, keep=False)
df = df[mask].groupby(cols, as_index=False)['area'].sum()
print (df)
id_1 id_2 date time area
0 1 2 29.02.2017 10.00-11.00 8
1 1 3 29.02.2017 10.00-11.00 5
2 10 11 28.02.2017 08.00-09.00 7
3 10 12 28.02.2017 08.00-09.00 3

Related

How to add value to specific index that is out of bounds

I have a list array
list = [[0, 1, 2, 3, 4, 5],[0],[1],[2],[3],[4],[5]]
Say I add [6, 7, 8] to the first row as the header for my three new columns, what's the best way to add values in these new columns, without getting index out of bounds? I've tried first filling all three columns with "" but when I add a value, it then pushes the "" out to the right and increases my list size.
Would it be any easier to use a Pandas dataframe? Are you allowed "gaps" in a Pandas dataframe?
according to ops comment i think a pandas df is the more appropriate solution. you can not have 'gaps', but nan values like this
import pandas as pd
# create sample data
a = np.arange(1, 6)
df = pd.DataFrame(zip(*[a]*5))
print(df)
output:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
for adding empty columns:
# add new columns, not empty but filled w/ nan
df[5] = df[6] = df[7] = float('nan')
# fill single value in column 7, index 3
df[7].iloc[4] = 123
print(df)
output:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 NaN NaN NaN
1 2 2 2 2 2 NaN NaN NaN
2 3 3 3 3 3 NaN NaN NaN
3 4 4 4 4 4 NaN NaN NaN
4 5 5 5 5 5 NaN NaN 123.0

Replace and remove duplicates string elements from one column in Python

Given a small dataset as follows:
id room area room_vector
0 1 A-102 world 01 , 02, 03, 04
1 2 NaN 24 A; B; C
2 3 B309 NaN s01, s02 , s02
3 4 C·102 25 E2702-2703,E2702-2703
4 5 E_1089 hello 03,05,06
5 6 27 NaN 05-08,09,10-12, 05-08
6 7 27 NaN NaN
I need to manipulate room_vector column with the following logic:
(1) remove white spaces and replace ; with ,;
(2) replace duplicates and keep one seperated by ,.
For the first one, I've tried:
df['room_vector'] = df['room_vector'].str.replace([' ', ';'], '')
Out:
TypeError: unhashable type: 'list'
How could I get the expected result as follows:
id room area room_vector
0 1 A-102 world 01,02,03,04
1 2 NaN 24 A,B,C
2 3 B309 NaN s01,s02
3 4 C·102 25 E2702-2703
4 5 E_1089 hello 03,05,06
5 6 27 NaN 05-08,09,10-12
6 7 27 NaN NaN
Many thanks.
Idea is remove whitespaces, then split by , or ; in Series.str.split and then remove duplicates with original order by create dictionary from keys and extracted keys but only for lists else is returned original:
f = lambda x: ','.join(dict.fromkeys(x).keys()) if isinstance(x, list) else x
df['room_vector'] = df['room_vector'].str.replace(' ', '').str.split('[,;]').apply(f)
print(df)
id room area room_vector
0 1 A-102 world 01,02,03,04
1 2 NaN 24 A,B,C
2 3 B309 NaN s01,s02
3 4 C·102 25 E2702-2703
4 5 E_1089 hello 03,05,06
5 6 27 NaN 05-08,09,10-12
6 7 27 NaN NaN

Concatenate 2 dataframes. I would like to combine duplicate columns

The following code can be used as an example of the problem I'm having:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df3=pd.concat([df1,df2], axis=1)
print(df3)
The result I get from this concatenation is:
B B
1 10 NaN
2 11 NaN
3 12 NaN
4 NaN 10
5 NaN 11
6 NaN 12
I would like to have:
B
1 10
2 11
3 12
4 10
5 11
6 12
I know that I can concatenate along axis=0. Unfortunately, that only solves the problem for this little example. The actual code I'm working with is more complex. Concatenating along axis=0 causes the index to be duplicated. I don't want that either.
EDIT:
People have asked me to give a more complex example to describe why simply removing 'axis=1' doesn't work. Here is a more complex example, first with axis=1 INCLUDED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2], axis=1)
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3], axis=1)
print(df4)
This gives me:
B B C
1 10 NaN 20
2 11 NaN 21
3 12 NaN 22
4 NaN 10 NaN
5 NaN 11 NaN
6 NaN 12 NaN
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Now here is an example with axis=1 REMOVED:
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3])
print(df4)
This gives me:
B C
A
1 10 NaN
2 11 NaN
3 12 NaN
4 10 NaN
5 11 NaN
6 12 NaN
1 NaN 20
2 NaN 21
3 NaN 22
I would like to have:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Sorry it wasn't very clear. I hope this helps.
Here is a two step process, for the example provided after the 'EDIT' point. Start by creating the dictionaries:
import pandas as pd
dic = {'A':['1','2','3'], 'B':['10','11','12']}
dic2 = {'A':['4','5','6'], 'B':['10','11','12']}
dic3 = {'A':['1','2','3'], 'C':['20','21','22']}
Step 1: convert each dictionary to a data frame, with index 'A', and concatenate (along axis=0):
t = pd.concat([pd.DataFrame(dic).set_index('A'),
pd.DataFrame(dic2).set_index('A'),
pd.DataFrame(dic3).set_index('A')])
Step 2: concatenate non-null elements of col 'B' with non-null elements of col 'C' (you could put this in a list comprehension if there are more than two columns). Now we concatenate along axis=1:
result = pd.concat([
t.loc[ t['B'].notna(), 'B' ],
t.loc[ t['C'].notna(), 'C' ],
], axis=1)
print(result)
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN
Edited:
If two objects need to be added along axis=1, then the new columns will be appended.And with axis=0 or default same column will be appended with new values.
Refer Below Solution:
import pandas as pd
dic={'A':['1','2','3'], 'B':['10','11','12']}
df1=pd.DataFrame(dic)
df1.set_index('A', inplace=True)
dic2={'A':['4','5','6'], 'B':['10','11','12']}
df2=pd.DataFrame(dic2)
df2.set_index('A', inplace=True)
df=pd.concat([df1,df2])
dic3={'A':['1','2','3'], 'C':['20','21','22']}
df3=pd.DataFrame(dic3)
df3.set_index('A', inplace=True)
df4=pd.concat([df,df3],axis=1) #As here C is new new column so need to use axis=1
print(df4)
Output:
B C
1 10 20
2 11 21
3 12 22
4 10 NaN
5 11 NaN
6 12 NaN

identify common column values across two different sized dataframes in pandas

I have two dataframes of different row and column sizes. I want to compare the two and create new columns in df2 based on whether values exist in df1. First for an example (I think you can copy/paste this text into a .csv to import), df1 looks like this:
subject block target dist1 dist2 dist3
7 1 doorlock candleholder01 jar03 stroller
7 2 glassescase clownfish kangaroo ram
7 3 badger chocolatefonduedish hosenozzle toycar04
7 4 hyena crocodile pig toad
7 1 scooter cormorant lizard rockbass
df2 like this:
subject image
7 acorn
7 chainsaw
7 doorlock
7 stroller
7 bathtub
7 clownfish
7 bagtie
7 birdie
7 witchhat
7 crocodile
7 honeybee
7 electricitymeter
7 flowerwreath
7 jar03
7 camera02a
and what I'd like to achieve is this:
subject image present type block
7 acorn 0 NA NA
7 chainsaw 0 NA NA
7 doorlock 1 target 1
7 stroller 1 dist3 1
7 bathtub 0 NA NA
7 clownfish 1 dist1 2
7 bagtie 0 NA NA
7 birdie 0 NA NA
7 witchhat 0 NA NA
7 crocodile 1 dist1 4
7 honeybee 0 NA NA
7 electricitymeter 0 NA NA
7 flowerwreath 0 NA NA
7 jar03 1 dist2 1
7 camera02a 0 NA NA
Specifically, I would like to identify, from the 4 columns in df1 ('target', 'dist1', 'dist2', 'dist3'), which values exist in the 'image' column of df2, and then (1) generate a column (boolean or 0/1) in df2 indicating whether that value exists in df1, (2) generate a second column in df2 with the name of the column in which that item exists in df1 (i.e. 'target', 'dist1', ...), and finally (3) generate a column in df2 with the df1 'block' value from which that item came from, if any.
I hope this is clear. I'd also like some ideas on how to handle the cases that don't match - should I code these as NAN or just empty strings? The thing is I will probably be groupby()'ing later, and I had some problems with groupby() when the df contained missing values..
You can do this by using melt on df1 and merge.
df1 = df1.melt(id_vars=['subject', 'block'], var_name='type', value_name='image')
df2['present'] = df2['image'].isin(df1['image']).astype(int)
pd.merge(df2, df1[['image', 'type', 'block']], on='image', how='left')
subject image present type block
0 7 acorn 0 NaN NaN
1 7 chainsaw 0 NaN NaN
2 7 doorlock 1 target 1.0
3 7 stroller 1 dist3 1.0
4 7 bathtub 0 NaN NaN
5 7 clownfish 1 dist1 2.0
6 7 bagtie 0 NaN NaN
7 7 birdie 0 NaN NaN
8 7 witchhat 0 NaN NaN
9 7 crocodile 1 dist1 4.0
10 7 honeybee 0 NaN NaN
11 7 electricitymeter 0 NaN NaN
12 7 flowerwreath 0 NaN NaN
13 7 jar03 1 dist2 1.0
14 7 camera02a 0 NaN NaN
As for the missing values, I would keep them as NaN. pandas is pretty powerful in terms of working with missing data, so may as well take advantage of this.

pandas moving aggregate string

from pandas import *
import StringIO
df = read_csv(StringIO.StringIO('''id months state
1 1 C
1 2 3
1 3 6
1 4 9
2 1 C
2 2 C
2 3 3
2 4 6
2 5 9
2 6 9
2 7 9
2 8 C
'''), delimiter= '\t')
I want to create a column show the cumulative state of column state, by id.
id months state result
1 1 C C
1 2 3 C3
1 3 6 C36
1 4 9 C369
2 1 C C
2 2 C CC
2 3 3 CC3
2 4 6 CC36
2 5 9 CC69
2 6 9 CC699
2 7 9 CC6999
2 8 C CC6999C
Basically the cum concatenation of string columns. What is the best way to do it?
So long as the dtype is str then you can do the following:
In [17]:
df['result']=df.groupby('id')['state'].apply(lambda x: x.cumsum())
df
Out[17]:
id months state result
0 1 1 C C
1 1 2 3 C3
2 1 3 6 C36
3 1 4 9 C369
4 2 1 C C
5 2 2 C CC
6 2 3 3 CC3
7 2 4 6 CC36
8 2 5 9 CC369
9 2 6 9 CC3699
10 2 7 9 CC36999
11 2 8 C CC36999C
Essentially we groupby on 'id' column and then apply a lambda with a transform to return the cumsum. This will perform a cumulative concatenation of the string values and return a Series with it's index aligned to the original df so you can add it as a column

Resources