Replace string column endwiths specific substrings under condition of another column with Pandas - python-3.x

Given a dataset as follows:
id company name value
0 1 Finl Corp. 7
1 2 Fund Tr Corp 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity Inc 5
Or:
[{'id': 1, 'company name': 'Finl Corp.', 'value': 7},
{'id': 2, 'company name': 'Fund Tr Corp', 'value': 6},
{'id': 3, 'company name': 'Inc Invt Fd', 'value': 5},
{'id': 4, 'company name': 'Govt Fd Inc.', 'value': 3},
{'id': 5, 'company name': 'Trinity Inc', 'value': 5}]
I need to replace if company name column's contents endwiths ['Corp.', 'Corp', 'Inc.', 'Inc'], while at same time value is >= 5
The expected result will be:
id company name value
0 1 Finl 7
1 2 Fund Tr 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity 5
How could I acheive that in Pandas and regex?
Trial code with error: TypeError: replace() missing 1 required positional argument: 'repl'
mask = (df1['value'] >= 5)
df1.loc[mask, 'company_name_concise']= df1.loc[mask, 'company name'].str.replace(r'\bCorp.|Corp|Inc.|Inc$', regex=True)

You can change values in regex by add \s* for spaces with $ for end of strings:
mask = (df1['value'] >= 5)
L = ['Corp.', 'Corp', 'Inc.', 'Inc']
pat = '|'.join(f'\s*{x}$' for x in L)
df1.loc[mask, 'company name']= df1.loc[mask,'company name'].str.replace(pat,'',regex=True)
print (df1)
id company name value
0 1 Finl 7
1 2 Fund Tr 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity 5

str.replace takes two arguments, the pattern and the replacement:
mask = (df1['value'] >= 5)
df1.loc[mask, 'company_name_concise']= df1.loc[mask, 'company name'].str.replace(r'\b(?:Corp\.?|Inc\.?)$', '', regex=True)
Note that the regex pattern you want here is:
\b word boundary
(?:
Corp\.? match Corp or Corp.
| OR
Inc\.? match Inc or Inc.
)
$ at the end of the company name

Or for length of code you could just directly modify the whole column and assign by index:
df.loc[df['value'] > 5, 'company name'] = df['company name'].str.replace(r'\bCorp.|Corp|Inc.|Inc$', '')
>>> df
id company name value
0 1 Finl 7
1 2 Fund Tr 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity Inc 5
>>>
Or a solution with np.where:
>>> df['company name'] = np.where(df['value'] > 5, df['company name'].str.replace(r'\bCorp.|Corp|Inc.|Inc$', ''), df['company name'])
>>> df
id company name value
0 1 Finl 7
1 2 Fund Tr 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity Inc 5
>>>

Related

Removing rows from DataFrame based on different conditions applied to subset of a data

Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25

Count positive, negative or zero values numbers for multiple columns in Python

Given a dataset as follows:
[{'id': 1, 'ltp': 2, 'change': nan},
{'id': 2, 'ltp': 5, 'change': 1.5},
{'id': 3, 'ltp': 3, 'change': -0.4},
{'id': 4, 'ltp': 0, 'change': 2.0},
{'id': 5, 'ltp': 5, 'change': -0.444444},
{'id': 6, 'ltp': 16, 'change': 2.2}]
Or
id ltp change
0 1 2 NaN
1 2 5 1.500000
2 3 3 -0.400000
3 4 0 2.000000
4 5 5 -0.444444
5 6 16 2.200000
I would like to count the number of positive, negative and 0 values for columns ltp and change, the result may like this:
columns positive negative zero
0 ltp 5 0 1
1 change 3 2 0
How could I do that with Pandas or Numpy? Thanks.
Updated: if I need groupby type and count following the logic above
id ltp change type
0 1 2 NaN a
1 2 5 1.500000 a
2 3 3 -0.400000 a
3 4 0 2.000000 b
4 5 5 -0.444444 b
5 6 16 2.200000 b
The expected output:
type columns positive negative zero
0 a ltp 3 0 0
1 a change 1 1 0
2 b ltp 2 0 1
3 b change 2 1 0
Use np.sign with selected columns first, then counts values in value_counts, transpose, replaced missing values and last rename columns names by dictionary with convert index to column columns:
d= {-1:'negative', 1:'positive', 0:'zero'}
df = (np.sign(df[['ltp','change']])
.apply(pd.value_counts)
.T
.fillna(0)
.astype(int)
.rename(columns=d)
.rename_axis('columns')
.reset_index())
print (df)
columns negative zero positive
0 ltp 0 1 5
1 change 2 0 3
EDIT: Another solution with type column with DataFrame.melt, mapping column with np.sign and count values by crosstab:
d= {-1:'negative', 1:'positive', 0:'zero'}
df1 = df.melt(id_vars='type', value_vars=['ltp','change'], var_name='columns')
df1['value'] = np.sign(df1['value']).map(d)
df1 = (pd.crosstab([df1['type'],df1['columns']], df1['value'])
.rename_axis(columns=None)
.reset_index())
print (df1)
type columns negative positive zero
0 a change 1 1 0
1 a ltp 0 3 0
2 b change 1 2 0
3 b ltp 0 2 1

how can i sum a columns in DataFrame with each date in time series data

Here's the example
'''
df = pd.DataFrame({'Country': ['United States', 'China', 'Italy', 'spain'],
'2020-01-01' : [0, 2, 1, 0]
'2020-01-02' : [1, 0, 1, 2]
'2020-01-03' : [0, 3, 2, 0]
df
'''
i want to sum the value of columns by date so that next columns has the added value.__ which means 2020-01-02 has a new added value of (2020-01-01+2020-01-02) and so on..
Convert Country column to index by DataFrame.set_index and use DataFrame.cumsum per rows by axis=1:
df = df.set_index('Country').cumsum(axis=1)
print (df)
2020-01-01 2020-01-02 2020-01-03
Country
United States 0 1 1
China 2 2 5
Italy 1 2 4
spain 0 2 2
Or select all columns without first by DataFrame.iloc before cumsum:
df.iloc[:, 1:] = df.iloc[:, 1:].cumsum(axis=1)
print (df)
Country 2020-01-01 2020-01-02 2020-01-03
0 United States 0 1 1
1 China 2 2 5
2 Italy 1 2 4
3 spain 0 2 2

Create new variable for grouped data using python

I have a data frame like this:
d = {'name': ['john', 'john', 'john', 'Tim', 'Tim', 'Tim','Bob', 'Bob'], 'Prod': ['101', '102', '101', '501', '505', '301', '302', '302'],'Qty': ['5', '4', '1', '3', '5', '4', '1', '3']}
df = pandas.DataFrame(data= d)
What I want to do is, create a new id variable. Whenever a name (say john) appears for the first time this id will be equal to 1, for other occurrence of the same name (john) this id variable will be 0. This will be done for all the other names in the data. How do I go about doing that ?
Final output should be like this:
NOTE: If someone knows SAS, there you can sort your data by the name and then use first.name.
""if first.variable = 1 then id = 1""
For first occurrence of same name first.name = 1. For any other repeat occurrence of same name, first.name = 0. I am trying to replicate the same in python.
So far I have tried pandas groupby and first() functionality and also numpy.where() but couldnt make any of that work. Any fresh perspective will be appreciated.
You can using cumcount
s=df.groupby(['Prod','name']).cumcount().add(1)
df['counter']=s.mask(s.gt(1),0)
df
Out[1417]:
Prod Qty name counter
0 101 5 john 1
1 102 4 john 1
2 101 1 john 0
3 501 3 Tim 1
4 505 5 Tim 1
5 301 4 Tim 1
6 302 1 Bob 1
7 302 3 Bob 0
Update :
s=df.groupby(['name']).cumcount().add(1).le(1).astype(int)
s
Out[1421]:
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 0
dtype: int32
More Fast
df.loc[df.name.drop_duplicates().index,'counter']=1
df.fillna(0)
Out[1430]:
Prod Qty name counter
0 101 5 john 1.0
1 102 4 john 0.0
2 101 1 john 0.0
3 501 3 Tim 1.0
4 505 5 Tim 0.0
5 301 4 Tim 0.0
6 302 1 Bob 1.0
7 302 3 Bob 0.0
We can just work directly with your dictionary d and loop through to create a new entry.
d = {'name': ['john', 'john', 'john', 'Tim', 'Tim', 'Tim','Bob', 'Bob'], 'Prod': ['101', '102', '101', '501', '505', '301', '302', '302'],'Qty': ['5', '4', '1', '3', '5', '4', '1', '3']}
names = set() #store names that have appeared
id = []
for i in d['name']:
if i in names: #if it appeared add 0
id.append(0)
else:
id.append(1) #add 1 and note that it has appeared
names.add(i)
d['id'] = id #add entry to your dictionary
df = pandas.DataFrame(data= d)

Where does the leading whitespace come from in this RDD and how to avoid it?

string_integers.txt
a 1 2 3
b 4 5 6
c 7 8 9
sample.py
import re
pattern = re.compile("(^[a-z]+)\s")
txt = sc.textFile("string_integers.txt")
string_integers_separated = txt.map(lambda x: pattern.split(x))
print string_integers_separated.collect()
outcome
[[u'', u'a', u'1 2 3'], [u'', u'b', u'4 5 6'], [u'', u'c', u'7 8 9']]
expected outcome
[[u'a', u'1 2 3'], [u'b', u'4 5 6'], [u'c', u'7 8 9']]
You split on pattern anchored in the beginning of the string so prefix will always be the empty string. You can for example use match:
pattern = re.compile("([a-z]+)\s+(.*$)")
pattern.match("a 1 2 3").groups()
# ('a', '1 2 3')
or lookbehind:
pattern = re.compile("(?<=a)\s")
pattern.split("a 1 2 3", maxsplit=1)
# ['a', '1 2 3']
or just split:
"a 1 2 3".split(maxsplit=1)
# ['a', '1 2 3']

Resources