How can I extract numbers as well as string from multiple rows in a Data Frame column? - python-3.x

DF1
index|Number
0 |[Number 1]
1 |[Number 2]
2 |[kg]
3 |[]
4 |[kg,Number 3]
In my dataframe in the Number column, I need to extract the number if present, kg if the string has kg and NaN if there is no value. If the row has both the number and kg then I will extract only the number.
Expected Output
index|Number
0 |1
1 |2
2 |kg
3 |NaN
4 |3
I wrote a lambda function for this but I am getting Error
NumorKG = lambda x: x.str.extract('(\d+)') if x.str.extract('(\d+)').isdigit() else 'kg' if x.str.find('kg') else "NaN"
DF1['Number']=DF1['Number'].apply(NumorKG)
The error that I am getting is:
AttributeError: 'str' object has no attribute 'str'

Use numpy.where for set values:
#extract numeric to Series
d = df['Number'].str.extract('(\d+)', expand=False)
#test if digit
mask1 = d.str.isdigit().fillna(False)
#test if values contains kg
mask2 = df['Number'].str.contains('kg', na=False)
df['Number'] = np.where(mask1, d,
np.where(mask2 & ~mask1, 'kg',np.nan))
print (df)
Number
0 1
1 2
2 kg
3 nan
4 3
Your solution should be changed:
import re
def NumorKG(x):
a = re.findall('(\d+)', x)
if len(a) > 0:
return a[0]
elif 'kg' in x:
return 'kg'
else:
return np.nan
df['Number']=df['Number'].apply(NumorKG)
print (df)
Number
0 1
1 2
2 kg
3 NaN
4 3
And your lambda function should be changed:
NumorKG = lambda x: re.findall('(\d+)', x)[0]
if len(re.findall('(\d+)', x)) > 0
else 'kg'
if 'kg' in x
else np.nan

In apply, what is returned is a scalar, so you can't use the .str accessor.
As you are dealing with only one column, no need for apply.
As an alternative to Jezrael (that would be reproducible), this is a possible solution:
DF1 = pd.DataFrame({'Number': [["Number 1"], ["Number 2"], ["kg"], [""], ["kg", "Number 3"]]})
DF1['Number'] = DF1.Number.str.join(sep=" ")
mask_digit = DF1.Number.str.extract('(\d+)', expand=False).str.isdigit().fillna(False)
mask_kg = DF1['Number'].str.contains('kg', na=False)
DF1.loc[mask_digit, 'Number'] = DF1.Number.str.extract('(\d+)', expand=False)
DF1.loc[mask_kg,'Number'] = 'kg'
DF1.loc[~(mask_digit | mask_kg), 'Number'] = np.NaN

Related

Pandas: Every value of cell list to lower case

I have a dataframe like this
# initialize list of lists
data = [[1, ['ABC', 'pqr']], [2, ['abc', 'XY']], [3, np.nan]]
# Create the pandas DataFrame
data = pd.DataFrame(data, columns = ['Name', 'Val'])
data
Name Val
0 1 [ABC, pqr]
1 2 [abc, XY]
2 3 NaN
I am trying to convert every value in the list, to it's lower case
data['Val'] = data['Val'].apply(lambda x: np.nan if len(x) == 0 else [item.lower() for item in x])
data
However I get this error
TypeError: object of type 'float' has no len()
Expected final output
Name Val
0 1 [abc, pqr]
1 2 [abc, xy]
2 3 NaN
First idea is filter rows without missing values and processing:
m = data['Val'].notna()
data.loc[m, 'Val'] = data.loc[m, 'Val'].apply(lambda x: [item.lower() for item in x])
print (data)
Name Val
0 1 [abc, pqr]
1 2 [abc, xy]
2 3 NaN
Or you can processing only lists filtered by isinstance:
f = lambda x: [item.lower() for item in x] if isinstance(x, list) else np.nan
data['Val'] = data['Val'].apply(f)
print (data)
Name Val
0 1 [abc, pqr]
1 2 [abc, xy]
2 3 NaN

Pandas Aggregate data other than a specific value in specific column

I have my data like this in pandas dataframe python
df = pd.DataFrame({
'ID':range(1, 8),
'Type':list('XXYYZZZ'),
'Value':[2,3,2,9,6,1,4]
})
The oputput that i want to generate is
How can i generate these results using python pandas dataframe. I want to include all the Y values of type column, and does not want to aggregate them.
First filter values by boolean indexing, aggregate and append filter out rows, last sorting:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID'))
print (df1)
ID Type Value
0 1 X 5
2 3 Y 2
3 4 Y 9
1 5 Z 11
If want range 1 to length of data for ID column:
mask = df['Type'] == 'Y'
df1 = (df[~mask].groupby('Type', as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.append(df[mask])
.sort_values('ID')
.assign(ID = lambda x: np.arange(1, len(x) + 1)))
print (df1)
ID Type Value
0 1 X 5
2 2 Y 2
3 3 Y 9
1 4 Z 11
Another idea is create helper column for unique values only for Y rows and aggregate by both columns:
mask = df['Type'] == 'Y'
df['g'] = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type','g'], as_index=False)
.agg({'ID':'first', 'Value':'sum'})
.drop('g', axis=1)[['ID','Type','Value']])
print (df1)
ID Type Value
0 1 X 5
1 3 Y 2
2 4 Y 9
3 5 Z 11
Similar alternative with Series g, then drop is not necessary:
mask = df['Type'] == 'Y'
g = np.where(mask, mask.cumsum() + 1, 0)
df1 = (df.groupby(['Type',g], as_index=False)
.agg({'ID':'first', 'Value':'sum'})[['ID','Type','Value']])

counting NaNs in 'for' loop in python

I am trying to iterate over the rows in df and count consecutive rows when a certain value is NaN or 0 and start the count over if the value will change from NaN or 0. I would like to get something like this:
Value Period
0 1
0 2
0 3
NaN 4
21 NaN
4 NaN
0 1
0 2
NaN 3
I wrote the function which takes a dataframe as an argument and returns it with an additional column which denotes the count:
def calc_period(df):
period_x = []
sum_x = 0
for i in range(1,df.shape[0]):
if df.iloc[i,0] == np.nan or df.iloc[i,0] == 0:
sum_x += 1
period_x.append(sum_x)
else:
period_x.append(None)
sum_x = 0
period_x.append(sum_x)
df['period_x'] = period_x
return df
The function works well when the value is 0. But when the value is NaN the count is also NaN and I get the following result:
Value Period
0 1
0 2
0 3
NaN NaN
NaN NaN
Here is a revised version of your code:
import pandas as pd
import numpy as np
import math
def is_nan_or_zero(val):
return math.isnan(val) or val == 0
def calc_period(df):
is_first_nan_or_zero = is_nan_or_zero(df.iloc[0, 0])
period_x = [1 if is_first_nan_or_zero else np.nan]
sum_x = 1 if is_first_nan_or_zero else 0
for i in range(1,df.shape[0]):
val = df.iloc[i,0]
if is_nan_or_zero(val):
sum_x += 1
period_x.append(sum_x)
else:
period_x.append(None)
sum_x = 0
df['period_x'] = period_x
return df
There were 2 fixes:
Replacing df.iloc[i,0] == np.nan with math.isnan(val)
Remove period_x.append(sum_x) at the end, and add the first period value instead (since we start iterating from the second value)

How to use pandas df column value in if-else expression to calculate additional columns

I am trying to calculate additional metrics from existing pandas dataframe by using an if/else condition on existing column values.
if(df['Sell_Ind']=='N').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.CurrentPrice, axis=1).astype(float).round(2)
elif(df['Sell_Ind']=='Y').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.Sold_price, axis=1).astype(float).round(2)
else:
df['MarketValue'] = df.apply(lambda row: 0)
For the if condition the MarketValue is calculated correctly but for the elif condition, its not giving the correct value.
Can anyone point me as what wrong I am doing in this code.
I think you need numpy.select, apply can be removed and multiple columns by mul:
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
Sample:
df = pd.DataFrame({'Sold_price':[7,8,9,4,2,3],
'SharesUnits':[1,3,5,7,1,0],
'CurrentPrice':[5,3,6,9,2,4],
'Sell_Ind':list('NNYYTT')})
#print (df)
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
print (df)
CurrentPrice Sell_Ind SharesUnits Sold_price MarketValue
0 5 N 1 7 5.0
1 3 N 3 8 9.0
2 6 Y 5 9 45.0
3 9 Y 7 4 28.0
4 2 T 1 2 0.0
5 4 T 0 3 0.0

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'
EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]
In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Resources