Concatenate two rows based on the same value in the next row of a new column - python-3.x

I am creating a new column and trying to concatenate the rows where the column value is the same. 1 the 1st row would have the initial value in that row, second row would the value of the 1st row and 2nd row. I have been able to make it work where the column has two values, if the column has 3 or more values only two values are being concatenated in the final row.
data={ 'Fruit':['Apple','Apple','Mango','Mango','Mango','Watermelon'],
'Color':['Red','Green','Yellow','Green','Orange','Green']
}
df = pd.DataFrame(data)
df['length']=df['Fruit'].str.len()
df['Fruit_color']=df['Fruit']+df['length'].map(lambda x: ' '*x)
df['same_fruit']=np.where(df['Fruit']!=df['Fruit'].shift(1),df['Fruit_color'],df['Fruit_color'].shift(1)+" "+df['Fruit_color]
Current output:
How do i get the expected output.
Below is the output that i am expecting
Regards,
Ren.

Here is an answer:
In [1]:
import pandas as pd
data={ 'Fruit':['Apple','Apple','Mango','Mango','Mango','Watermelon'],
'Color':['Red','Green','Yellow','Green','Orange','Green']
}
df = pd.DataFrame(data)
df['length']=df['Fruit'].str.len()
df['Fruit_color']=df['Fruit'] + ' ' + df['Color']
df.sort_values(by=['Fruit_color'], inplace=True)
## Get the maximum of fruit occurrence
maximum = df[['Fruit', 'Color']].groupby(['Fruit']).count().max().tolist()[0]
## Iter shift as many times as the highest occurrence
new_cols = []
for i in range(maximum):
temporary_col = 'Fruit_' + str(i)
df[temporary_col] = df['Fruit'].shift(i+1)
new_col = 'new_col_' + str(i)
df[new_col] = df['Fruit_color'].shift(i+1)
df.loc[df[temporary_col] != df['Fruit'], new_col] = ''
df.drop(columns=[temporary_col], axis=1, inplace=True)
new_cols.append(new_col)
## Use this shifted columns to create `same fruit` and drop useless columns
df['same_fruit'] = df['Fruit_color']
for col in new_cols:
df['same_fruit'] = df['same_fruit'] + ' ' + df[col]
df.drop(columns=[col], axis=1, inplace=True)
Out [1]:
Fruit Color length Fruit_color same_fruit
1 Apple Green 5 Apple Green Apple Green
0 Apple Red 5 Apple Red Apple Red Apple Green
3 Mango Green 5 Mango Green Mango Green
4 Mango Orange 5 Mango Orange Mango Orange Mango Green
2 Mango Yellow 5 Mango Yellow Mango Yellow Mango Orange Mango Green
5 Watermelon Green 10 Watermelon Green Watermelon Green

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.
IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

Group By - but sum one column, and show original columns

I have a 5 column df. I need to groupby by the common names in column A, and sum column B and D. But I need to keep my output that currently sits in columns C through E.
Everytime I groupby its drops columns not involved in the the grouping.
I understand some columns will have 2 non common rows, for a common item in column A, and I need to display both of those values. Hope an example illustrates the problem better.
A
B
C
D
E
Apple
10
Green
1
X
Pear
15
Brown
2
Y
Pear
5
Yellow
3
Z
Banana
4
Yellow
4
P
Plum
2
Red
5
R
I'd like to output :
A
B
C
D
E
Apple
10
Green
1
X
Pear
20
Brown
5
Y
Yellow
Z
Banana
4
Yellow
4
P
Plum
2
Red
5
R
I cant seem to find the right combination within the groupby function
df_save =df_orig.loc[:, ["A", "C", "E"]]
df_agg = df_orig.groupby("A").agg({"B": "sum", "D" : "sum"}).reset_index()
df_merged = df_save.merge(df_agg)
for c in ["B", "D"] :
df_merged.loc[df_merged[c].duplicated(), c] = ''
A
C
E
B
D
Apple
Green
X
10
1
Pear
Brown
Y
155
23
Pear
Yellow
Z
Banana
Yellow
P
4
4
Plum
Red
R
2
5
The above is the output after the operations. I hope this works. Thanks

Pandas Drop an Entire Column if All of the Values equal a Certain Value

Let's say I have dataframes that looks like this:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
The logic I am trying to implement is something like this:
If all of column C = "NaN" then drop the entire column
Else if all of column C = "Member" drop the entire column
else do nothing
Any suggestions?
Edit: Added Expected Output
Expected Output if using on both data frames:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b
0 dave blue
1 bill red
2 sally green
Edit #2: Why am I doing this in the first place?
I am ripping text from a PDF file into placing into CSV files using the Tabula library.
The data is not coming out in the way that I am hoping it would, so I am applying ETL concepts to move the data around.
The final outcome would be for management to be able to open the final result into a nicely formatted Excel file.
Some of the columns have part of the headers put into a separate row and things got shifted around for some reason while ripping the data out of the PDF.
The headers look something like this:
Team Type Member Contact
Count
What I am doing is checking an entire column for certain header values. If the entire column has a header value, I'm dropping the entire column.
Idea is replace Member to missing values first, then test if at least one no missing value by notna with any and add all columns with Trues for mask by Series.reindex:
mask = (df[['c']].replace('Member',np.nan)
.notna()
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Another idea id chain both mask by & for bitwise AND:
mask = ((df[['c']].notna() & df[['c']].ne('Member'))
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Last filter by columns in DataFrame.loc:
df = df.loc[:, mask]
Here's an alternate approach to do this.
import pandas as pd
import numpy as np
c = ['a','b','c']
d = [
['dave', 'blue', np.NaN],
['bill', 'red', np.NaN],
['sally', 'green', 'Member'],
['Ian', 'Org', 'Paid']]
df1 = pd.DataFrame(d,columns = c)
df2 = df1.loc[df1['a'] != 'Ian']
print (df1)
print (df2)
if df1.c.replace('Member',np.NaN).isnull().all():
df1 = df1[df1.columns.drop(['c'])]
print (df1)
if df2.c.replace('Member',np.NaN).isnull().all():
df2 = df2[df2.columns.drop(['c'])]
print (df2)
Output of this is:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b
0 dave blue
1 bill red
2 sally green
my idea is simple, maybe it will help you. I wanna make sure that you want this one: drop the whole column if this column contains only NaN or 'Member' else do nothing.
So I need to check the column first (contain only NaN or 'Member'). We change 'Member' to NaN and use numpy for a test(or something else).
import pandas as pd
df = pd.DataFrame({'A':['dave','bill','sally','ian'],'B':['blue','red','green','org'],'C':[np.nan,np.nan,'Member','Paid']})
df2 = df.drop(index=[3],axis=0)
print(df)
print(df2)
# df 1
col = pd.Series([np.nan if x=='Member' else x for x in df['C'].tolist()])
if col.isnull().all():
df = df.drop(columns='C')
# df2
col = pd.Series([np.nan if x=='Member' else x for x in df2['C'].tolist()])
if col.isnull().all():
df2 = df2.drop(columns='C')
print(df)
print(df2)
A B C
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 ian org Paid
A B
0 dave blue
1 bill red
2 sally green

pandas data frame effeciently remove duplicates and keep records largest int value

I have a data frame with two columns NAME, and VALUE, where NAME contains duplicates and VALUE contains INTs. I would like to efficiently drop duplicates records of column NAME while keeping the record with the largest VALUE. I figured out how to do it will two steps, sort and drop duplicates, but I am new to pandas and am curious if there is a more efficient way to achieve this with the query function?
import pandas
import io
import json
input = """
KEY VALUE
apple 0
apple 1
apple 2
bannana 0
bannana 1
bannana 2
pear 0
pear 1
pear 2
pear 3
orange 0
orange 1
orange 2
orange 3
orange 4
"""
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df[['KEY','VALUE']].sort_values(by=['VALUE']).drop_duplicates(subset='KEY', keep='last')
dicty = dict(zip(df['KEY'], df['VALUE']))
print(json.dumps(dicty, indent=4))
running this yields the expected output:
{
"apple": 2,
"bannana": 2,
"pear": 3,
"orange": 4
}
Is there a more efficient way to achieve this transformation with pandas?
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df.groupby('KEY')['VALUE'].max()
If your input needs to be a dictionary, just add to_dict() :
df.groupby('KEY')['VALUE'].max().to_dict()
Also you can try:
[*df.groupby('KEY',sort=False).last().to_dict().values()][0]
{'apple': 2, 'bannana': 2, 'pear': 3, 'orange': 4}

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

Resources