How to replace a tag %%article%% by letter a - python-3.x

I have this dataframe:
pd.DataFrame({'text': ['I have %%article%% car', '%%article%%fter dawn', 'D%%article%%t%%article%%Fr%%article%%me']})
I am trying to replace %%article%% by letter a to have as output:
pd.DataFrame({'text': ['I have a car', 'after dawn', 'DataFrame']})
I tried to create a dict ={'%%article%%':'a'} and then:
df['text'] = df['text'].map(dict)
But it's not working, it returns NaN

When passing a dict to Series.map, it uses table lookup so that only elements that exactly match '%%article%%' will be replaced by 'a'.
An example from doc:
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0 cat
1 dog
2 NaN
3 rabbit
>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0 kitten
1 puppy
2 NaN
3 NaN
An element with something like 'ccat' will not be replaced. Instead, you can use a function to replace them:
>>> df = pd.DataFrame({'text': ['I have %%article%% car', '%%article%%fter dawn', 'D%%article%%t%%article%%Fr%%article%%me']})
>>> df.text = df.text.map(lambda i: i.replace('%%article%%', 'a'))
>>> df
text
0 I have a car
1 after dawn
2 DataFrame
But the better is probably Series.replace:
>>> df.replace('%%article%%', 'a')
text
0 I have a car
1 after dawn
2 DataFrame

Use:
df['text'].str.replace('%%article%%', 'a')
Output:
0 I have a car
1 after dawn
2 DataFrame
Name: text, dtype: object

Related

Pandas: Merging rows into one

I have the following table:
Name
Age
Data_1
Data_2
Tom
10
Test
Tom
10
Foo
Anne
20
Bar
How I can merge this rows to get this output:
Name
Age
Data_1
Data_2
Tom
10
Test
Foo
Anne
20
Bar
I tried this code (and some other related (agg, groupby other fields, et cetera)):
import pandas as pd
data = [['tom', 10, 'Test', ''], ['tom', 10, 1, 'Foo'], ['Anne', 20, '', 'Bar']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Data_1', 'Data_2'])
df = df.groupby("Name").sum()
print(df)
But I only get something like this:
c2
Name
--------
--------------
Anne
Foo
Tom
Bar
Just a groupby and a sum will do.
df.groupby(['Name','Age']).sum().reset_index()
Name Age Data_1 Data_2
0 Anne 20 Bar
1 tom 10 Test Foo
Use this if the empty cells are NaN :
(df.set_index(['Name', 'Age'])
.stack()
.groupby(level=[0, 1, 2])
.apply(''.join)
.unstack()
.reset_index()
)
Otherwise, add this line df.replace('', np.nan, inplace=True) before the code above.
# Output
Name Age Data_1 Data_2
0 Anne 20 NaN Bar
1 Tom 10 Test Foo

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Pandas: Every value of cell list to lower case

I have a dataframe like this
# initialize list of lists
data = [[1, ['ABC', 'pqr']], [2, ['abc', 'XY']], [3, np.nan]]
# Create the pandas DataFrame
data = pd.DataFrame(data, columns = ['Name', 'Val'])
data
Name Val
0 1 [ABC, pqr]
1 2 [abc, XY]
2 3 NaN
I am trying to convert every value in the list, to it's lower case
data['Val'] = data['Val'].apply(lambda x: np.nan if len(x) == 0 else [item.lower() for item in x])
data
However I get this error
TypeError: object of type 'float' has no len()
Expected final output
Name Val
0 1 [abc, pqr]
1 2 [abc, xy]
2 3 NaN
First idea is filter rows without missing values and processing:
m = data['Val'].notna()
data.loc[m, 'Val'] = data.loc[m, 'Val'].apply(lambda x: [item.lower() for item in x])
print (data)
Name Val
0 1 [abc, pqr]
1 2 [abc, xy]
2 3 NaN
Or you can processing only lists filtered by isinstance:
f = lambda x: [item.lower() for item in x] if isinstance(x, list) else np.nan
data['Val'] = data['Val'].apply(f)
print (data)
Name Val
0 1 [abc, pqr]
1 2 [abc, xy]
2 3 NaN

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

I have two dataframes like the following examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
'c': ['123', '100', '6', np.nan]})
print(df)
a b c
0 20 1.0 NaN
1 50 NaN 1.0
2 100 1.0 1.0
print(df_id)
b c
0 50 123
1 4954 100
2 93920 6
3 20 NaN
For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].
My desired result is as follows:
result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
'c': [np.nan, np.nan, 1]})
print(result)
a b c
0 20 1.0 NaN
1 50 NaN NaN # df_id['c'] did not contain '50'
2 100 NaN 1.0 # df_id['b'] did not contain '100'
My attempt to do this is here:
for i, letter in enumerate(['b','c']):
df[letter] = (df.apply(lambda x: x[letter] if x['a']
.isin(df_id[letter].tolist()) else np.nan, axis = 1))
The error I get:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
This is in Python 3.5.2, Pandas version 20.1
You can solve your problem using this instead:
for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)
just replace isin with in.
The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.
However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.
Hope that was helpful. If you have any questions please ask.
Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:
for i, letter in enumerate(['b','c']):
mask = df['a'].isin(df_id[letter])
name = letter + '_new'
# for some reason, df[letter] = df.loc[mask, letter] does not work
df.loc[mask, name] = df.loc[mask, letter]
df[letter] = df[name]
del df[name]
This isn't pretty, but seems to work.
If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe.
First create the mask:
mask = df_id.apply(lambda x: df['a'].isin(x))
b c
0 True False
1 True False
2 False True
This can be applied to the original dataframe:
df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
a b c
0 20 1.0 NaN
1 50 NaN NaN
2 100 NaN 1.0

Replace empty list values in Pandas DataFrame with NaN

I know that similar questions have been asked before, but I literarily tried every possible solution listed here and none of them worked.
I am having a dataframe which consists of dates, strings, empty values, and empty list values. It is very huge, 8 million rows.
I want to replace all of the empty list values - so only cells that contain only [], nothing else with NaN. Nothing seems to work.
I tried this:
df = df.apply(lambda y: np.nan if (type(y) == list and len(y) == 0) else y)
as advised similarly in this question replace empty list with NaN in pandas dataframe but it doesn't change anything in my dataframe.
Any help would be appreciated.
Just to assume the OP wants to convert empty list, the string '[]' and the object '[]' to na, below is a solution.
Setup
#borrowed from piRSquared's answer.
df = pd.DataFrame([
[1, 'hello', np.nan, None, 3.14],
['2017-06-30', 2, 'a', 'b', []],
[pd.to_datetime('2016-08-14'), 'x', '[]', 'z', 'w']
])
df
Out[1062]:
0 1 2 3 4
0 1 hello NaN None 3.14
1 2017-06-30 2 a b []
2 2016-08-14 00:00:00 x [] z w
Solution:
#convert all elements to string first, and then compare with '[]'. Finally use mask function to mark '[]' as na
df.mask(df.applymap(str).eq('[]'))
Out[1063]:
0 1 2 3 4
0 1 hello NaN None 3.14
1 2017-06-30 2 a b NaN
2 2016-08-14 00:00:00 x NaN z w
I'm going to make the assumption that you want to mask actual empty lists.
pd.DataFrame.mask will turn cells that have corresponding True values to np.nan
I want to find actual list values. So I'll use df.applymap(type) to get the type in every cell and see if it is equal to list
I know that [] evaluates to False in a boolean context, so I'll use df.astype(bool) to see.
I'll end up masking those cells that are both list type and evaluate to False
Consider the dataframe df
df = pd.DataFrame([
[1, 'hello', np.nan, None, 3.14],
['2017-06-30', 2, 'a', 'b', []],
[pd.to_datetime('2016-08-14'), 'x', '[]', 'z', 'w']
])
df
0 1 2 3 4
0 1 hello NaN None 3.14
1 2017-06-30 2 a b []
2 2016-08-14 00:00:00 x [] z w
Solution
df.mask(df.applymap(type).eq(list) & ~df.astype(bool))
0 1 2 3 4
0 1 hello NaN None 3.14
1 2017-06-30 2 a b NaN
2 2016-08-14 00:00:00 x [] z w

Resources