Using groupby and filters on a dataframe - python-3.x

I have a dataframe with both string and integer values.
Attaching a sample data dictionary to understand the dataframe that I have:
data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12]
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}
I need to extract data as under:
Max value from col4
Grouped by col1
Filtered out col3 from the result if value is Y
Filter col5 from the result to show only values not more than 5.
So I tried something and faced following problems.
1- I used following method to find max value from the data. But I am not able to find max value from each group.
print(dataframe['col4'].max()) #this worked to get one max value
print(dataframe.groupby('col1').max() #this doesn't work
Second one doesn't work for me as that returns maximum value for col2 as well. I need the result to have col2 value against the max row under each group.
2- I am not able to apply filter on both col3 (str) and col5 (int) in one command. Any way to do that?
print(dataframe[dataframe['col3'] != 'Y' & dataframe['col5'] < 6]) #generates an error
The output that I am expecting through this is:
col1 col2 col3 col4 col5
0 A 10 X 45 3
3 B 10 X 56 4
6 C 10 X 87 4
10 D 20 X 43 4
#
# 78 is max in group A, but ignored as col5 is 6 (we need < 6)
# Similarly, 89 is max in group D, but ignored as col3 is Y.
I apologize if I am doing something wrong. I am quite new to this.
Thank you.

I'm not a python developer, but im my opinion you do it in a wrong way.
You shoud have a list of structure insted of structure of list.
Then you can start workin on such list.
This is an example solution, so probably it coud be done im much smootcher way:
data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12],
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}
newData = [];
for i in range(len(data['col1'])):
newData.append({'col1' : data['col1'][i], 'col2' : data['col2'][i], 'col3' : data['col3'][i], 'col4' : data['col4'][i], 'col5' : data['col5'][i]})
withoutY = list(filter(lambda d: d['col3'] != 'Y', newData))
lessThen5 = list(filter(lambda d: d['col5'] < 5, withoutY))
values = set(map(lambda d: d['col1'], lessThen5))
groupped = [[d1 for d1 in lessThen5 if d1['col1']==d2] for d2 in values]
result = [];
for i in range(len(groupped)):
result.append(max(groupped[i], key = lambda g: g['col4']))
sortedResult = sorted(result, key = lambda r: r['col1'])
print (sortedResult)
result:
[
{'col1': 'A', 'col2': 10, 'col3': 'X', 'col4': 45, 'col5': 3},
{'col1': 'B', 'col2': 10, 'col3': 'X', 'col4': 56, 'col5': 4},
{'col1': 'C', 'col2': 10, 'col3': 'X', 'col4': 87, 'col5': 4},
{'col1': 'D', 'col2': 20, 'col3': 'X', 'col4': 43, 'col5': 4}
]

Ok, I didn't actually notice.
So i was try something like this:
#fd is a filtered data
fd=data.query('col3 != "Y"').query('col5 < 6')
# or fd=data[data.col3 != 'Y'][data.col5 < 6]
#m is max for col4 grouped by col1
m=fd.groupby('col1')['col4'].max()
This will group by col1 and get max from col4, but in result we have 2 colums (col1 and col4).
I don't know what do you want to achieve.
If you want to have all line, here is the code:
result=fd[lambda x: x.col4 == m.get(x.col1).values]
You need to be careful, because you not alway will have one line for "col1".
E.g. For data
data = pd.DataFrame({
'col1': ['A','A','A','A','B','B','B','B','C','C','C','D','D','D'],
'col2': [20,10,20,30,10,20,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,45,23,78,45,56,12,34,87,54,43,89,43,12],
'col5': [1,3,4,6,1,4,3,2,4,3,5,3,4,6]})
Result will be:
col1 col2 col3 col4 col5
0 A 20 X 45 1
1 A 10 X 45 3
5 B 20 X 56 4
8 C 10 X 87 4
12 D 20 X 43 4
Additionally if you want to have normal index instead of ..., 8, 9 12, you could use "where" instead of "query"

Related

pd dataframe from lists and dictionary using series

I have few lists and a dictionary and would like to create a pd dataframe.
Could someone help me out, I seem to be missing something:
one simple example bellow:
dict={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
Using series I would do like this:
df = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
and would have the lists within the df as expected
for dict would do
df = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
And would expect this result:
col1 col2 col3 col4
1 x a 1
2 y b 3
3 c text1
4
The problem is like this the first df would be overwritten by the second call of pd.Dataframe
How would I do this to have only one df with 4 columns?
I know one way would be to split the dict in 2 separate lists and just use Series over 4 lists, but I would think there is a better way to do this, out of 2 lists and 1 dict as above to have directly one df with 4 columns.
thanks for the help
you can also use pd.concat to concat two dataframe.
df1 = pd.DataFrame({'col1': pd.Series(l1), 'col2': pd.Series(l3)})
df2 = pd.DataFrame(list(dic.items()), columns=['col3', 'col4'])
df = pd.concat([df1, df2], axis=1)
Why not build each column seperately via dict.keys() and dict.values() instead of using dict.items()
df = pd.DataFrame({
'col1': pd.Series(l1),
'col2': pd.Series(l3),
'col3': pd.Series(dict.keys()),
'col4': pd.Series(dict.values())
})
print(df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
Alternatively:
column_values = [l1, l3, dict.keys(), dict.values()]
data = {f"col{i}": pd.Series(values) for i, values in enumerate(column_values)}
df = pd.DataFrame(data)
print(df)
col0 col1 col2 col3
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN
You can unpack zipped values of list generated from d.items() and pass to itertools.zip_longest for add missing values for match by maximum length of list:
#dict is python code word, so used d for variable
d={"a": 1, "b": 3, "c": "text1"}
l1 = [1, 2, 3, 4]
l3 = ["x", "y"]
df = pd.DataFrame(zip_longest(l1, l3, *zip(*d.items()),
fillvalue=np.nan),
columns=['col1','col2','col3','col4'])
print (df)
col1 col2 col3 col4
0 1 x a 1
1 2 y b 3
2 3 NaN c text1
3 4 NaN NaN NaN

Can I apply vectorization here? Or should I think about this differently?

To put it simply, I have rows of activity that happen in a given month of the year. I want to append additional rows of inactivity in between this activity, while resetting the month values into a sequence. For example, if I have months 2, 5, 7, I need to map these to 1, 4, 7, while my inactive months happen in 2, 3, 5, and 6. So, I would have to add four rows with this inactivity. I've done this with dictionaries and for-loops, but I know this is not efficient, especially when I move this to thousands of rows of data to process. Any suggestions on how to optimize this? Do I need to think about the data format differently? I've had a suggestion to make lists and then move that to the dataframe at the end, but I don't see a huge gain there. I don't know enough of NumPy to figure out how to do this with vectorization since that's super fast and it would be awesome to learn something new. Below is my code with the steps I took:
df = pd.DataFrame({'col1': ['A','A', 'B','B','B','C','C'], 'col2': ['X','Y','X','Y','Z','Y','Y'], 'col3': [1, 8, 2, 5, 7, 6, 7]})
Output:
col1 col2 col3
0 A X 1
1 A Y 8
2 B X 2
3 B Y 5
4 B Z 7
5 C Y 6
6 C Y 7
I'm creating a dictionary to handle this in for loops:
df1 = df.groupby('col1')['col3'].apply(list).to_dict()
df2 = df.groupby('col1')['col2'].apply(list).to_dict()
max_num = max(df.col3)
Output:
{'A': [1, 8], 'B': [2, 5, 7], 'C': [6, 7]}
{'A': ['X', 'Y'], 'B': ['X', 'Y', 'Z'], 'C': ['Y', 'Y']}
8
And now I'm adding those rows using my dictionaries by creating a new data frame:
df_new = pd.DataFrame({'col1': [], 'col2': [], 'col3': []})
for key in df1.keys():
k = 1
if list(df1[key])[-1] - list(df1[key])[0] + 1 < max_num:
for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
if i in df1[key]:
df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
else:
df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
k += 1
df_new = df_new.append({'col1': key, 'col2': 'E', 'col3': str(k)}, ignore_index=True)
else:
for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
if i in df1[key]:
df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
else:
df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
k += 1
Output:
col1 col2 col3
0 A X 1
1 A N 2
2 A N 3
3 A N 4
4 A N 5
5 A N 6
6 A N 7
7 A Y 8
8 B X 1
9 B N 2
10 B N 3
11 B Y 4
12 B N 5
13 B Z 6
14 B E 7
15 C Y 1
16 C Y 2
17 C E 3
And then I pivot to the form I want it:
df_pivot = df_new.pivot(index='col1', columns='col3', values='col2')
Output:
col3 1 2 3 4 5 6 7 8
col1
A X N N N N N N Y
B X N N Y N Z E NaN
C Y Y E NaN NaN NaN NaN NaN
Thanks for the help.
We can replace the steps of creating and using dictionaries by the statement below, which utilizes reindex to place the additional values N and E without explicit loops.
df_new = df.set_index('col3')\
.groupby('col1')\
.apply(lambda dg:
dg.drop('col1', 1)
.reindex(range(dg.index.min(), dg.index.max()+1), fill_value='N')
.reindex(range(dg.index.min(), min(max_num, dg.index.max()+1)+1), fill_value='E')
.set_index(pd.RangeIndex(1, min(max_num, dg.index.max()-dg.index.min()+1+1)+1, name='col3'))
)\
.reset_index()
After this, you can apply your pivot statement as it is.

How to check if a Dataframe contains a list or dictionary

I have a dataframe :
col1 col2 col3 col4
A 11 [{'id':2}] {"price": 0.0}
B 21 [{'id':3}] {"price": 2.0}
C 31 [{'id':4}] {"price": 3.0}
I want to find out what all columns are of datatype 'list' and 'dictionary' and probably store the result into another list.
How do i go about it ?
when i use this :
data.applymap(type).apply(pd.value_counts)
output is :
col1 col2 col3 col4
0 a 11 [{'id':2}] {"price": 0.0}
1 b 21 [{'id':3}] {"price": 2.0}
2 c 31 [{'id':4}] {"price": 3.0}
IIUC,
we can use apply and literal_eval from the ast standard library to build up a dictionary:
for performance reasons, lets work with the first row of the data frame as apply is computationally quite heavy.
from ast import literal_eval
data_dict = {}
for col in df.columns:
try:
col_type = df[col].iloc[:1].apply(literal_eval).apply(type)[0]
data_dict[col] = col_type
except (ValueError,SyntaxError):
data_dict[col] = 'unable to evaluate'
print(data_dict)
{'col1': 'unable to evaluate',
'col2': 'unable to evaluate',
'col3': list,
'col4': dict}

How to select rows and columns that meet criteria from a list

Let's say I've got a pandas dataframe that looks like:
df1 = pd.DataFrame({"Item ID":["A", "B", "C", "D", "E"], "Value1":[1, 2, 3, 4, 0],
"Value2":[4, 5, 1, 8, 7], "Value3":[3, 8, 1, 2, 0],"Value4":[4, 5, 7, 9, 4]})
print(df1)
Item_ID Value1 Value2 Value3 Value4
0 A 1 4 3 4
1 B 2 5 8 5
2 C 3 1 1 7
3 D 4 8 2 9
4 E 0 7 0 4
Now I've got a second dataframe that looks like:
df2 = {"Item ID":["A", "C", "D"], "Value5":[4, 5, 7]}
print(df2)
Item_ID Value5
0 A 4
1 C 5
2 D 7
What I want do is find where the Item ID's match between my two data frames, and then add the "Value5" column values to the intersection of the rows AND ONLY columns Value1 and Value2 from df1 (these columns could change every iteration, so these columns need to be contained in a variable).
My output should show:
4 added to Row A, columns "Value1" and "Value2"
5 added to Row C, columns "Value1" and "Value2"
7 added to Row D, columns "Value1" and "Value2"
Item_ID Value1 Value2 Value3 Value4
0 A 5 8 3 4
1 B 2 5 8 5
2 C 8 6 1 7
3 D 11 15 2 9
4 E 0 7 0 4
Of course my data is many thousand rows long. I can do it using a for loop, but this is taking way too long. I want to be able to vectorize this in some way. Any ideas?
This is what I ended up doing based on #sammywemmy's suggestions
#Takes columns names and changes them into a list
names = df1.colnames.tolist()
#Merge df1 and df2 based on 'Item_ID'
merged = df1.merge(df2, on='Item_ID', how='outer')
for i in range(len(names)):
#using assign and **, we can bring in variable names with assign.
#Then add our Value 5 column
merged = merged.assign(**{names[i] : lambda x : x[names[i]] + x.Value5})
#Only keep all the columns before and including 'Value4'
df1= merged.loc[:,:'Value4']
Try this:
#set 'Item ID' as the index
df1 = df1.set_index('Item ID')
df2 = df2.set_index('Item ID')
#create list of columns that you are interested in
list_of_cols = ['Value1','Value2']
#create two separate dataframes
#unselected will not contain the columns you want to add
unselected = df1.drop(list_of_cols,axis=1)
#this will contain the columns you wish to add
selected = df1.filter(list_of_cols)
#reindex df2 so it has the same indices as df1
#then convert to a series
#fill the null values with 0
A = df2.reindex(index=selected.index,fill_value=0).loc[:,'Value5']
#add the series A to selected
selected = selected.add(A,axis='index')
#combine selected and unselected into one dataframe
result = pd.concat([unselected,selected],axis=1)
#this part is extra to get ur dataframe back to the way it was
#assumption here is that it is value1, value 2, bla bla
#so 1>2>3
#if ur columns are not actually Value1, Value2,
#bla bla, then a different sorting has to be used
#alternatively before the calculations,
#you could create a mapping of the columns to numbers
#that will give u a sorting mechanism and
#restore ur dataframe after calculations are complete
columns = sorted(result.columns,key = lambda x : x[-1])
#reindex back to the way it was
result = result.reindex(columns,axis='columns')
print(result)
Value1 Value2 Value3 Value4
Item ID
A 5 8 3 4
B 2 5 8 5
C 8 6 1 7
D 11 15 2 9
E 0 7 0 4
Alternative solution, using python's built-in dictionaries:
#create dictionaries
dict1 = (df1
#create temporary column
#and set as index
.assign(temp=df1['Item ID'])
.set_index('temp')
.to_dict('index')
)
dict2 = (df2
.assign(temp=df2['Item ID'])
.set_index('temp')
.to_dict('index')
)
list_of_cols = ['Value1','Value2']
intersected_keys = dict1.keys() & dict2.keys()
key_value_pair = [(key,col) for key in intersected_keys
for col in list_of_cols ]
#check for keys that are in both dict1 and 2
#loop through dict 1 and add values from dict2
#can be optimized with a dict comprehension
#leaving as is for better clarity IMHO
for key, val in key_value_pair:
dict1[key][val] = dict1[key][val] + dict2[key]['Value5']
#print(dict1)
{'A': {'Item ID': 'A', 'Value1': 5, 'Value2': 8, 'Value3': 3, 'Value4': 4},
'B': {'Item ID': 'B', 'Value1': 2, 'Value2': 5, 'Value3': 8, 'Value4': 5},
'C': {'Item ID': 'C', 'Value1': 8, 'Value2': 6, 'Value3': 1, 'Value4': 7},
'D': {'Item ID': 'D', 'Value1': 11, 'Value2': 15, 'Value3': 2, 'Value4': 9},
'E': {'Item ID': 'E', 'Value1': 0, 'Value2': 7, 'Value3': 0, 'Value4': 4}}
#create dataframe
pd.DataFrame.from_dict(dict1,orient='index').reset_index(drop=True)
Item ID Value1 Value2 Value3 Value4
0 A 5 8 3 4
1 B 2 5 8 5
2 C 8 6 1 7
3 D 11 15 2 9
4 E 0 7 0 4

fill a new column in a pandas dataframe from the value of another dataframe [duplicate]

This question already has an answer here:
Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe
(1 answer)
Closed 4 years ago.
I have two dataframes :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5, 1]})
col1 col2 col3
0 a c 1
1 b c 2
2 a d 3
3 a d 4
4 b c 5
5 h i 1
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'f'], 'col2': ['c', 'c', 'd', 'k'], 'col3': [12, 23, 45, 78]})
col1 col2 col3
0 a c 12
1 b c 23
2 a d 45
3 f k 78
and I'd like to build a new column in the first one according to the values of col1 and col2 that can be found in the second one. That is this new one :
pd.DataFrame(data={'col1': ['a', 'b', 'a', 'a', 'b'], 'col2': ['c', 'c', 'd', 'd', 'c'], 'col3': [1, 2, 3, 4, 5],'col4' : [12, 23, 45, 45, 23]})
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23
5 h i 1 NaN
How am I able to do that ?
Tks for your attention :)
Edit : it has been adviced to look for the answer in this subject Adding A Specific Column from a Pandas Dataframe to Another Pandas Dataframe but it is not the same question.
In here, not only the ID does not exist since it is splitted in col1 and col2 but above all, although being unique in the second dataframe, it is not unique in the first one. This is why I think that neither a merge nor a join can be the answer to this.
Edit2 : In addition, couples col1 and col2 of df1 may not be present in df2, in this case NaN is awaited in col4, and couples col1 and col2 of df2 may not be needed in df1. To illustrate these cases, I addes some rows in both df1 and df2 to show how it could be in the worst case scenario
You could also use map like
In [130]: cols = ['col1', 'col2']
In [131]: df1['col4'] = df1.set_index(cols).index.map(df2.set_index(cols)['col3'])
In [132]: df1
Out[132]:
col1 col2 col3 col4
0 a c 1 12
1 b c 2 23
2 a d 3 45
3 a d 4 45
4 b c 5 23

Resources