Efficient and pythonic way to search through a long string - python-3.x

I have created some code to search through a string and return True if there is an emoji in the string. The strings are found in a column in a pandas dataframe, and one can assume the string and the length of the dataframe could be arbitrarily long. I then create a new column in my dataframe with these boolean results.
Here is my code:
import emoji
contains_emoji = []
for row in df['post_text']:
emoji_found = False
for char in row:
if emoji.is_emoji(char):
emoji_found = True
break
contains_emoji.append(emoji_found)
df['has_emoji'] = contains_emoji
In an effort to get slicker, I was wondering if anyone could recommend a faster, shorter, or more pythonic way of searching like this?

Use emoji.emoji_count():
import emoji
# Create example dataframe
df = pd.DataFrame({'post_text':['🌍', '😂', 'text 😃', 'abc']})
# Create column based on emoji within text
df['has_emoji'] = df['post_text'].apply(lambda x: emoji.emoji_count(x) > 0)
# print dataframe
print(df)
OUTPUT:
post_text has_emoji
0 🌍 True
1 😂 True
2 text 😃 True
3 abc False

why not just
df["has_emoji"] = df.post_text.apply(emoji.emoji_count) > 0

You can use str.contains with a regex pattern that matches any emoji:
df['has_emoji'] = df['post_text'].str.contains(r'[\U0001f600-\U0001f650]')
For reference here is a link to the source code for emoji.emoji_count(): https://github.com/carpedm20/emoji/blob/master/emoji/core.py

Related

Replace items like A2 as AA in the dataframe

I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.
I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse

Python: Using Pandas and Regex to Clean Phone Numbers with Country Code

I'm attempting to use pandas to clean phone numbers so that it returns only the 10 digit phone number and removes the country code if it is present and any special characters.
Here's some sample code:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
Returns
0 11921674056
1 1233454568
2 1233455678
3 1231231234
As you can see, this regex works well except for the country code. And unfortunately, this system I'm loading into cannot accept the country code. What I'm struggling with, is finding a regex that with strip the country code as well. All the regex's I've found will match the 10 digits I need, and in this case with using pandas, I need to not match them.
I could easily write a function and use .apply but I feel like there is likely a simple regex solution that I'm missing.
Thanks for any help!
I don't think regex is necessary here, which is nice because regex is a pain in the buns.
To append your current solution:
phone_series = pandas.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
r1 = '[^0-9]+'
phone_series.str.replace(r1, '')
phone_series = phone_series.apply(lambda x: x[-10:])
My lazier solution:
>>> phone_series = pd.Series(['+1(192) 167-4056', '123-345-4568', '1233455678', '(123) 123-1234'])
>>> p2 = phone_series.apply(lambda x: ''.join([i for i in x if str.isnumeric(i)])[-10:])
>>> p2
0 1921674056
1 1233454568
2 1233455678
3 1231231234
dtype: object

How to remove a complete row when no match found in a column's string values with any object from a given list?

Please help me complete this piece of code. Let me know of any other detail is required.
Thanks in advance!
Given: a column 'PROD_NAME' from pandas dataframe of string type (e.g. Smiths Crinkle Cut Chips Chicken g), a list of certain words (['Chip', 'Chips' etc])
To do: if none of the words from the list is contained in the strings of the dataframe objects, we drop the whole row. Basically we're removing unnecessary products from a dataframe.
This is what data looks like:
Here's my code:
# create a function to Keep only those products which have
# chip, chips, doritos, dorito, pringle, Pringles, Chps, chp, in their name
def onlyChips(df, *cols):
temp = []
chips = ['Chip', 'Chips', 'Doritos', 'Dorito', 'Pringle', 'Pringles', 'Chps', 'Chp']
copy = cp.deepcopy(df)
for col in [*cols]:
for i in range(len(copy[col])):
for item in chips:
if item not in copy[col][i]:
flag = False
else:
flag = True
break;
# drop only those string which doesn't have any match from chips list, if flag never became True
if not flag:
# drop the whole row
return <new created dataframe>
new = onlyChips(df_txn, 'PROD_NAME')
Filter the rows instead of deleting them. Create a boolean mask for each row. Use str.contains on each column you need to search and see if any of the columns match the given criteria row-wise. Filter the rows if not.
search_cols = ['PROD_NAME']
mask = df[search_cols].apply(lambda x: x.str.contains('|'.join(chips))).any(axis=1)
df = df[mask]

Using pandas manipulate number format

just out of my curiosity, I have a name list with phone numbers in a csv file, and I want to change these phone numbers from ############ (11 digits) to the format of ###-####-####, adding two minus sign in between 3-4 and 7-8 place.
is this possible?
If it's Dataframe you can use apply with formate string
df
num
0 09187543839
1 08745763412
df.num = df.num.apply(lambda x : "{}-{}-{}".format(x[:3],x[3:7],x[7:]))
df
num
0 091-8754-3839
1 087-4576-3412
Yes, it is possible. Below is a code-snippet that accomplishes what you want:
phone = str(55512354567)
print(f'{phone[:3]}-{phone[3:7]}-{phone[7:]}')
You can adapt the above idea to your Pandas dataframe as shown below:
# Sample data
data_df = pd.DataFrame([[55512345678], [55587654321]], columns=['phone'])
# Create a string column
data_df['phone_str'] = data_df['phone'].map(lambda x: str(x))
# Convert the column values to the right format
data_df['phone_str'] = data_df['phone_str'].map(lambda x: f'{x[:3]}-{x[3:7]}-{x[7:]}')
I may not be using pandas but this could potentially work...
n = 3
n1 = 7
str = "12345678901"
l, m, r = str[:n], str[n:n1], str[n1:]
final = l+"-"+m+"-"+r
print(final)
Output:
123-4567-8901

Pandas select rows where a value in a columns does not starts with a string

I have a data where I need to filter out any rows that do start with a certain values - emphasis on plural:
Below the data exactly as it appears in file data.xlsx
Name Remains
GESDSRPPZ0161 TRUE
RT6000996 TRUE
RT6000994 TRUE
RT6000467 TRUE
RT6000431 TRUE
MCOPSR0034 FALSE
MCOPSR0033 FALSE
I need to be able to return a dataframe where name DOES NOT start with MCO, GE,etc.
import pandas as pd
import numpy as np
### data
file = r'C:\Users\user\Desktop\data.xlsx'
data = pd.read_excel(file, na_values = '')
data['name'] = data['name'].str.upper()
prefixes = ['IM%','JE%','GE%','GV%','CHE%','MCO%']
new_data = data.select(lambda x: x not in prefixes)
new_data.shape
the last call returns exactly the same dataset as I started with.
I tried:
pandas select from Dataframe using startswith
but it excludes data if the string is elsewhere (not only starts with)
df = df[df['Column Name'].isin(['Value']) == False]
The above answer would work if I knew exactly the string in question, however it changes (the common part is MCOxxxxx, GVxxxxxx, GExxxxx...)
The vvery same happens with this one:
How to implement 'in' and 'not in' for Pandas dataframe
because the values I have to pass have to be exact. Is there any way to do with using the same logic as here (Is there any equivalent for wildcard characters like SQL?):
How do I select rows where a column value starts with a certain string?
Thanks for help! Can we expand please on the below?
#jezrael although I've chosen the other solution for simplicity (and my lack of understanding of your solution), but I'd like to ask for a bit of explanation please. What does '^' + '|^' do in this code and how is it different from Wen's solution? How does it compare performance wise when you have for loop construct as oppose to operation on series like map or apply? If I understand correctly contains() is not bothered with the location whereby startswith() specifically looks at the beggining of the string. Does it mean the
^indicates to contains() to do what? Start at the beginning?
And | is it another special char for the method or is it treated like logical OR? I'd really want to learn this if you don't mind sharing. Thanks
You can using startswith , the ~ in the front will convert from in to not in
prefixes = ['IM','JE','GE','GV','CHE','MCO']
df[~df.Name.str.startswith(tuple(prefixes))]
Out[424]:
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
Use str.contains with ^ for start of string and filter by boolean indexing:
prefixes = ['IM','JE','GE','GV','CHE','MCO']
pat = '|'.join([r'^{}'.format(x) for x in prefixes])
df = df[~df['Name'].str.contains(pat)]
print (df)
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
Thanks, #Zero for another solution:
df = df[~df['Name'].str.contains('^' + '|^'.join(prefixes))]
print (df)
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True

Resources