Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame - python-3.x

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.

IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

Related

Python pandas move cell value to another cell in same row

I have a dataFrame like this:
id Description Price Unit
1 Test Only 1254 12
2 Data test Fresher 4
3 Sample 3569 1
4 Sample Onces Code test
5 Sample 245 2
I want to move to the left Description column from Price column if not integer then become NaN. I have no specific word to call in or match, the only thing is if Price column have Non-integer value, that string value move to Description column.
I already tried pandas replace and concat but it doesn't work.
Desired output is like this:
id Description Price Unit
1 Test Only 1254 12
2 Fresher 4
3 Sample 3569 1
4 Code test
5 Sample 245 2
This should work
# data
df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Description': ['Test Only', 'Data test', 'Sample', 'Sample Onces', 'Sample'],
'Price': ['1254', 'Fresher', '3569', 'Code test', '245'],
'Unit': [12, 4, 1, np.nan, 2]})
# convert price column to numeric and coerce errors
price = pd.to_numeric(df.Price, errors='coerce')
# for rows where price is not numeric, replace description with these values
df.Description = df.Description.mask(price.isna(), df.Price)
# assign numeric price to price column
df.Price = price
df
Use:
#convert valeus to numeric
price = pd.to_numeric(df['Price'], errors='coerce')
#test missing values
m = price.isna()
#shifted only matched rows
df.loc[m, ['Description','Price']] = df.loc[m, ['Description','Price']].shift(-1, axis=1)
print (df)
id Description Price
0 1 Test Only 1254
1 2 Fresher NaN
2 3 Sample 3569
3 4 Code test NaN
4 5 Sample 245
If need numeric values in ouput Price column:
df = df.assign(Price=price)
print (df)
id Description Price
0 1 Test Only 1254.0
1 2 Fresher NaN
2 3 Sample 3569.0
3 4 Code test NaN
4 5 Sample 245.0

How do I check total amount of times a certain value occurred in a nested loop?

Question: Calculate the total number of apples bought on monday and wendsday only.
This is my code currently:
apple = 0
banana = 0
orange = 0
#number of fruits bought on monday, tuesday and wendsday respectively
fruit = [ ['apple', 'cherry', 'apple', 'orange'], \
['orange', 'apple', 'apple', 'banana'], \
['banana', 'apple', 'cherry', 'orange'] ]
for x in fruit:
if 'apple' in x:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple + 1
print(apple)
For some reason, the current result which I am getting for printing apple is 0.
What is wrong with my code?
the problem in your code is you are only incrementing the the number of apples but you are not assigning them into any variable, that's why it is printing it's initial value:
apple = 0
apple + 1
you need to do:
apple += 1
and also fruit.index(x) always return the index of the first occurence of that element, that is:
fruit[1].index('apple')
will return index of first occurence of 'apple' in fruit[1], which is 1.
but According to your question, this solution is incorrect because they were asking no of apples on monday and wednesday only so you need to this manually, because according to your solution it will also count the apples on tuesday also where index of 'apple' is 0 or 2. below is the correct solution
apple = 0
banana = 0
orange = 0
#number of fruits bought on monday, tuesday and wendsday respectively
fruit = [ ['apple', 'cherry', 'apple', 'orange'],
['orange', 'apple', 'apple', 'banana'],
['banana', 'apple', 'cherry', 'orange'] ]
apple += fruit[0].count('apple')
apple += fruit[2].count('apple')
print(apple)
There are two issues with your code.
The first issue is:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple + 1
apple + 1 is not doing anything meaningful. If you want you need to increment apple, you need to do apple += 1. This results in apple being 2
The second issue is that you need to calculate the total number, which is 3 apples and not 2. Two apples were bought on Monday and 1 on Wednesday.
You can use collections.Counter for this
from collections import Counter
for x in fruit:
if 'apple' in x:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple += Counter(x)['apple']
it should be apple += 1, not apple + 1

pandas data frame effeciently remove duplicates and keep records largest int value

I have a data frame with two columns NAME, and VALUE, where NAME contains duplicates and VALUE contains INTs. I would like to efficiently drop duplicates records of column NAME while keeping the record with the largest VALUE. I figured out how to do it will two steps, sort and drop duplicates, but I am new to pandas and am curious if there is a more efficient way to achieve this with the query function?
import pandas
import io
import json
input = """
KEY VALUE
apple 0
apple 1
apple 2
bannana 0
bannana 1
bannana 2
pear 0
pear 1
pear 2
pear 3
orange 0
orange 1
orange 2
orange 3
orange 4
"""
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df[['KEY','VALUE']].sort_values(by=['VALUE']).drop_duplicates(subset='KEY', keep='last')
dicty = dict(zip(df['KEY'], df['VALUE']))
print(json.dumps(dicty, indent=4))
running this yields the expected output:
{
"apple": 2,
"bannana": 2,
"pear": 3,
"orange": 4
}
Is there a more efficient way to achieve this transformation with pandas?
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df.groupby('KEY')['VALUE'].max()
If your input needs to be a dictionary, just add to_dict() :
df.groupby('KEY')['VALUE'].max().to_dict()
Also you can try:
[*df.groupby('KEY',sort=False).last().to_dict().values()][0]
{'apple': 2, 'bannana': 2, 'pear': 3, 'orange': 4}

Concatenate two rows based on the same value in the next row of a new column

I am creating a new column and trying to concatenate the rows where the column value is the same. 1 the 1st row would have the initial value in that row, second row would the value of the 1st row and 2nd row. I have been able to make it work where the column has two values, if the column has 3 or more values only two values are being concatenated in the final row.
data={ 'Fruit':['Apple','Apple','Mango','Mango','Mango','Watermelon'],
'Color':['Red','Green','Yellow','Green','Orange','Green']
}
df = pd.DataFrame(data)
df['length']=df['Fruit'].str.len()
df['Fruit_color']=df['Fruit']+df['length'].map(lambda x: ' '*x)
df['same_fruit']=np.where(df['Fruit']!=df['Fruit'].shift(1),df['Fruit_color'],df['Fruit_color'].shift(1)+" "+df['Fruit_color]
Current output:
How do i get the expected output.
Below is the output that i am expecting
Regards,
Ren.
Here is an answer:
In [1]:
import pandas as pd
data={ 'Fruit':['Apple','Apple','Mango','Mango','Mango','Watermelon'],
'Color':['Red','Green','Yellow','Green','Orange','Green']
}
df = pd.DataFrame(data)
df['length']=df['Fruit'].str.len()
df['Fruit_color']=df['Fruit'] + ' ' + df['Color']
df.sort_values(by=['Fruit_color'], inplace=True)
## Get the maximum of fruit occurrence
maximum = df[['Fruit', 'Color']].groupby(['Fruit']).count().max().tolist()[0]
## Iter shift as many times as the highest occurrence
new_cols = []
for i in range(maximum):
temporary_col = 'Fruit_' + str(i)
df[temporary_col] = df['Fruit'].shift(i+1)
new_col = 'new_col_' + str(i)
df[new_col] = df['Fruit_color'].shift(i+1)
df.loc[df[temporary_col] != df['Fruit'], new_col] = ''
df.drop(columns=[temporary_col], axis=1, inplace=True)
new_cols.append(new_col)
## Use this shifted columns to create `same fruit` and drop useless columns
df['same_fruit'] = df['Fruit_color']
for col in new_cols:
df['same_fruit'] = df['same_fruit'] + ' ' + df[col]
df.drop(columns=[col], axis=1, inplace=True)
Out [1]:
Fruit Color length Fruit_color same_fruit
1 Apple Green 5 Apple Green Apple Green
0 Apple Red 5 Apple Red Apple Red Apple Green
3 Mango Green 5 Mango Green Mango Green
4 Mango Orange 5 Mango Orange Mango Orange Mango Green
2 Mango Yellow 5 Mango Yellow Mango Yellow Mango Orange Mango Green
5 Watermelon Green 10 Watermelon Green Watermelon Green

Getting all rows where for column 'C' the entry is larger than the preceding element in column 'C'

How can I select all rows of a data frame where a condition is met according to a column, which has to do with the relationship between every 2 entries of that column. To give the specific example, lets say I have a DataFrame:
>>>df = pd.DataFrame({'A': [ 1, 2, 3, 4],
'B':['spam', 'ham', 'egg', 'foo'],
'C':[4, 5, 3, 4]})
>>> df
A B C
0 1 spam 4
1 2 ham 5
2 3 egg 3
3 4 foo 4
>>>df2 = df[ return every row of df where C[i] > C[i-1] ]
>>> df2
A B C
1 2 ham 5
3 4 foo 4
There is plenty of great information about slicing and indexing in the pandas docs and here, but this is a bit more complicated, I think. I could also be going about it wrong. What I'm looking for is the rows of data where the value stored in C is no longer monotonously declining.
Any help is appreciated!
Use boolean indexing with compare by shifted column values:
print (df[df['C'] > df['C'].shift()])
A B C
1 2 ham 5
3 4 foo 4
Detail:
print (df['C'] > df['C'].shift())
0 False
1 True
2 False
3 True
Name: C, dtype: bool
If want all monotonously declining rows compare diff of column:
print (df[df['C'].diff() > 0])
A B C
1 2 ham 5
3 4 foo 4

Resources