I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code that is supposed to split the string after every three words:
import pandas as pd
import numpy as np
df1 = {
'State':['Arizona AZ asdf hello abc','Georgia GG asdfg hello def','Newyork NY asdfg hello ghi','Indiana IN asdfg hello jkl','Florida FL ASDFG hello mno']}
df1 = pd.DataFrame(df1,columns=['State'])
df1
def splitTextToTriplet(df):
text = df['State'].str.split()
n = 3
grouped_words = [' '.join(str(text[i:i+n]) for i in range(0,len(text),n))]
return grouped_words
splitTextToTriplet(df1)
Currently the output is as such:
['0 [Arizona, AZ, asdf, hello, abc]\n1 [Georgia, GG, asdfg, hello, def]\nName: State, dtype: object 2 [Newyork, NY, asdfg, hello, ghi]\n3 [Indiana, IN, asdfg, hello, jkl]\nName: State, dtype: object 4 [Florida, FL, ASDFG, hello, mno]\nName: State, dtype: object']
But I am actually expecting this output in 5 rows, one column on dataframe:
['Arizona AZ asdf', 'hello abc']
['Georgia GG asdfg', 'hello def']
['Newyork NY asdfg', 'hello ghi']
['Indiana IN asdfg', 'hello jkl']
['Florida FL ASDFG', 'hello mno']
how can I change the regex so it produces the expected output?
For efficiency, you can use a regex and str.extractall + groupby/agg:
(df1['State']
.str.extractall(r'((?:\w+\b\s*){1,3})')[0]
.groupby(level=0).agg(list)
)
output:
0 [Arizona AZ asdf , hello abc]
1 [Georgia GG asdfg , hello def]
2 [Newyork NY asdfg , hello ghi]
3 [Indiana IN asdfg , hello jkl]
4 [Florida FL ASDFG , hello mno]
regex:
( # start capturing
(?:\w+\b\s*) # words
{1,3} # the maximum, up to three
) # end capturing
You can do:
def splitTextToTriplet(row):
text = row['State'].split()
n = 3
grouped_words = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
return grouped_words
df1.apply(lambda row: splitTextToTriplet(row), axis=1)
which gives as output the following Dataframe:
0
0
['Arizona AZ asdf', 'hello abc']
1
['Georgia GG asdfg', 'hello def']
2
['Newyork NY asdfg', 'hello ghi']
3
['Indiana IN asdfg', 'hello jkl']
4
['Florida FL ASDFG', 'hello mno']
Related
I have a string as:
s=
"(2021-06-29T10:53:42.647Z) [Denis]: hi
(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane
(2021-06-29T11:58:29.053Z) [Nicholas]:
(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##"
I want to extract the text from it. Expected output as:
comments=['hi','TA FOR SHOWING','how are you bane',' ','#END_REMOTE#','VAL 01JUL2021','##ENDED AT 08:07 GMT##']
What I have tried is:
comments=re.findall(r']:\s+(.*?)\n',s)
regex works well but I'm not able to get the blank text as ''
You can exclude matching the ] instead in the capture group, and if you also want to match the value on the last line, you can assert the end of the string $ instead of matching a mandatory newline with \n
Note that \s can match a newline and also the negated character class [^]]* can match a newline
]:\s+([^]]*)$
Regex demo | Python demo
import re
regex = r"]:\s+([^]]*)$"
s = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
"(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
"(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
"(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")
print(re.findall(regex, s, re.MULTILINE))
Output
['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']
If you don't want to cross lines:
]:[^\S\n]+([^]\n]*)$
Regex demo
You could identify all after the colon into an array from capture group 1.
re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s)
then loop the array assigning a space to all empty elements.
>>> import re
>>>
>>> s= """
... (2021-06-29T10:53:42.647Z) [Denis]: hi
... (2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
... (2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane
... (2021-06-29T11:58:29.053Z) [Nicholas]:
... (2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
... (2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
... (2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##
... """
>>>
>>> talk = [re.sub('^$', ' ', w) for w in re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s)]
>>> print(talk)
['hi', 'TA FOR SHOWING', 'how are you bane', ' ', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']
Is this what you want?
comments = re.findall(r']:\s(.*?)\n',s)
If the space after : is always one space, \s+ should be \s. \s+ means one or more spaces.
With your shown samples please try following regex.
^\(\d{4}-\d{2}-\d{2}T(?:\d{2}:){2}\d{2}\.\d{3}Z\)\s+\[[^]]*\]:\s+([^)]*)$
Online demo for above regex
Explanation: Adding detailed explanation for above.
^\(\d{4}-\d{2}-\d{2} ##Matching from starting of line ( followed by 4 digits-2 digits- 2 digits here.
T(?:\d{2}:){2} ##Matching T followed by a non-capturing group which is matching 2 digits followed by colon 2 times.
\d{2}\.\d{3}Z\)\s+ ##Matching 2 digits followed by dot followed by 3 digits Z and ) followed by space(s).
\[[^]]*\]:\s+ ##Matching literal [ till first occurrence of ] followed by ] colon and space(s).
([^)]*)$ ##Creating 1st capturing group which has everything till next occurrence of `)`.
With Python3x:
import re
regex = r"^\(\d{4}-\d{2}-\d{2}T(?:\d{2}:){2}\d{2}\.\d{3}Z\)\s+\[[^]]*\]:\s+([^)]*)$"
varVal = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
"(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
"(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
"(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
"(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")
print(re.findall(regex, varVal, re.MULTILINE))
Output will be as follows with samples shown by OP:
['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']
Following is the test string and we need to replace the '\xa' with ''
'FF 6 VRV AVENUE SUBRAMANIYAM PALAYAM PinCode:-\xa0641034'
i was using the following set of lines in python to do the objective but to no use
new_str = str.replace(r'\\xa', '')
but the output is same
'FF 6 VRV AVENUE SUBRAMANIYAM PALAYAM PinCode:-\xa0641034'
I think you are trying to replace the unicode character '\xa0' -
s = 'FF 6 VRV AVENUE SUBRAMANIYAM PALAYAM PinCode:-\xa0641034'
s = s.replace('\xa0', '')
print(s)
#'FF 6 VRV AVENUE SUBRAMANIYAM PALAYAM PinCode:-641034'
Code example
from itertools import *
from collections import Counter
from tqdm import *
#for i in tqdm(Iterable):
for i in combinations_with_replacement(['1','2','3','4','5','6','7','8'], 8):
b = (''.join(i))
if b == '72637721':
print (b)
when i try profuct i have
for i in product(['1','2','3','4','5','7','6','8'], 8):
TypeError: 'int' object is not iterable
How can i get all combinations ? ( i was belive it before not test , so now all what i was do wrong)
i was read about combinations_with_replacement return all , but how i see it's lie
i use python 3.8
Out put for ask
11111111 11111112 11111113 11111114 11111115 11111116 11111117
11111118 11111122 11111123 11111124 11111125 11111126 11111127
11111128 11111133 11111134 11111135 11111136 11111137 11111138
11111144 11111145 11111146 11111147 11111148 11111155 11111156
11111157 11111158 11111166 11111167 11111168 11111177 11111178
11111188 11111222 11111223 11111224 11111225 11111226 11111227
11111228 11111233 11111234 11111235 11111236 11111237 11111238
11111244 11111245 11111246 11111247 11111248 11111255 11111256
11111257 11111258 11111266 11111267 11111268 11111277 11111278
11111288
what it start give at end
56666888 56668888 56688888 56888888 58888888 77777777 77777776
77777778 77777766 77777768 77777788 77777666 77777668 77777688
77777888 77776666 77776668 77776688 77776888 77778888 77766666
77766668 77766688 77766888 77768888 77788888 77666666 77666668
77666688 77666888 77668888 77688888 77888888 76666666 76666668
76666688 76666888 76668888 76688888 76888888 78888888 66666666
66666668 66666688 66666888 66668888 66688888 66888888 68888888
88888888
more cleare think it how it be count from 1111 1111 to 8888 8888 ( but for characters , so this why i use try do it at permutation/combine with repitions...
it miss some possible combinations of that symbols.
as example what i try do , make all permutatuion of possible variants of hex numbers , like from 0 to F , but make it not only for them , make this possible for any charater.
this only at example ['1','2','3','4','5','6','7','8']
this can be ['a','b','x','c','d','g','r','8'] etc.
solition is use itertools.product instead combinations_with_replacement
from itertools import *
for i in product(['1','2','3','4','5','6','7','8'],repeat = 8):
b = (''.join(i))
if b == '72637721':
print (b)
:
itertools.product ('ABCD', 'ABCD') AA AB AC AD BA BB BC BD CA CB CC CD DA DB DC DD # full multiplication with duplicates and mirrored pairs
itertools.permutations ('ABCD', 2) -> AB AC AD BA BC BD CA CB CD DA DB DC # full multiplication without duplicates and mirrored pairs
itertools.combinations_with_replacement ('ABCD', 2) -> AA AB AC AD BB BC BD CC CD DD # no mirror pairs with duplicates
itertools.combinations ('ABCD', 2) -> AB AC AD BC BD CD # no mirrored pairs and no duplicates
Here's the updated code that will print you all the combinations. It does not matter if your list has strings and numbers.
To ensure that you are doing a combination only for the specific number of elements, I recommend that you do:
comb_list = [1, 2, 3, 'a']
comb_len = len(comb_list)
and replace the line with:
comb = combinations_with_replacement(comb_list, comb_len)
from itertools import combinations_with_replacement
comb = combinations_with_replacement([1, 2, 3, 'a'], 4)
for i in list(comb):
print (''.join([str(j) for j in i]))
This will result as follows:
1111
1112
1113
111a
1122
1123
112a
1133
113a
11aa
1222
1223
122a
1233
123a
12aa
1333
133a
13aa
1aaa
2222
2223
222a
2233
223a
22aa
2333
233a
23aa
2aaa
3333
333a
33aa
3aaa
aaaa
I don't know what you are trying to do. Here's an attempt to start a dialogue to get to the final answer:
samples = [1,2,3,4,5,'a','b']
len_samples = len(samples)
for elem in samples:
print (str(elem)*len_samples)
The output of this will be as follows:
1111111
2222222
3333333
4444444
5555555
aaaaaaa
bbbbbbb
Is this what you want? If not, explain your question section what you expect as an output.
Given a dataframe as follows:
firstname lastname email_address \
0 Doug Watson douglas.watson#dignityhealth.org
1 Nick Holekamp nick.holekamp#rankenjordan.org
2 Rob Schreiner rob.schriener#wellstar.org
3 Austin Phillips austin.phillips#precmed.com
4 Elise Geiger egeiger#puracap.com
5 Paul Urick purick#diplomatpharmacy.com
6 Michael Obringer michael.obringer#lashgroup.com
7 Craig Heneghan cheneghan#west-ward.com
8 Kathy Hirst kathleen.hirst#sunovion.com
9 Stefan Bluemmers stefan.bluemmers#grunenthal.com
companyname
0 Dignity Health
1 Ranken Jordan Pediatric Bridge Hospital
2 WellStar Health System
3 Precision Medical Products, Inc.
4 puracap.com
5 Diplomat Specialty Pharmacy
6 Lash Group
7 West-Ward Pharmaceuticals
8 Sunovion Pharmaceuticals
9 GrĂ¼nenthal Group
How could I create possible email addresses using common email patterns as such: firstlast#example.com, first.last#example.com, f.last#example.com, lastF#example.com, first_last#example.com, firstL#example.com, etc.
df['email1'] = df.firstname.str.lower() + '.' + df.lastname.str.lower() + '#' + df.companyname.str.replace('\s+', '').str.lower() + '.com'
print(df['email1'])
Out:
0 doug.watson#dignityhealth.com
1 nick.holekamp#rankenjordanpediatricbridgehospi... --->problematic
2 rob.schreiner#wellstarhealthsystem.com
3 austin.phillips#precisionmedicalproducts,inc..com --->problematic
4 elise.geiger#puracap.com.com --->problematic
...
9995 terry.hanley#kempersportsmanagement.com
9996 christine.marks#geocomp.com
9997 darryl.rickner#doe.com
9998 lalit.sharma#lovelylifestyle.com
9999 parul.dutt#infibeam.com
Some of them seems quite problematic, anyone could help to solve this issue? Thanks a lot.
EDITED:
print(df) after applying #Sajith Herath's solution:
Out:
firstname lastname companyname \
0 Nick Holekamp Ranken ...
email
0 nick. ...
You can use a method to create permutations of username with different separators and define a max length that simplify the domain using company name as follows
import pandas as pd
import random
data = {"firstname":["Nick"],"lastname":["Holekamp"],"companyname":["Ranken \
Jordan Pediatric Bridge Hospital"]}
df = pd.DataFrame(data=data)
max_char = 5
emails = []
def simplify_domain(text):
if len(text)>max_char:
text = ''.join([c for c in text if c.isupper()])
return text.lower()
return text.replace("\s+","").lower()
def username_permutations(first_name,last_name):
# define separators
separators = [".", "_", "-"]
#lower case
combinations = list(map(lambda x:f"{first_name.lower()}{x} \
{last_name.lower()}",separators))
#append a random number to tail
n = random.randint(1, 100)
combinations.extend(list(map(lambda x:f"{x}{n}",combinations)))
return combinations
for index,row in df.iterrows():
usernames = username_permutations(row["firstname"],row["lastname"])
email_permutations = list(map(lambda x: f" \
{x}#{simplify_domain(row['companyname'])}.com",usernames))
emails.append(','.join(email_permutations))
df["email"] = emails
Final result will be nick.holekamp#rjpbh.com,nick_holekamp#rjpbh.com,nick-holekamp#rjpbh.com,nick.holekamp66#rjpbh.com,nick_holekamp66#rjpbh.com,nick-holekamp66#rjpbh.com
you can modify simplify_domain method to validate given string such as removing inc or .com values
I have a bunch of addresses:
123 Main Street, PO Box 345, Chicago, IL 92921
1992 Super Way, Bakersfield, CA
234 Wonderland Lane, Attn: Daffy Duck, Orlando, FL 09922
How could I cut out the second string in there, when I do myStr.split(',') on each?
The idea is that I want to return:
123 Main Street, Chicago, IL 92921
1992 Super Way, CA
234 Wonderland Lane, Orlando, FL 09922
I could loop through each part, and build yet another string, skipping the second index, but was wondering if there's a better way to do so.
What I have now:
def filter_address(address):
print("Filtering address on",address)
updated_addr = ""
indx = 0
for section in address.split(","):
if indx != 1:
updated_addr = updated_addr + "," + section
indx += 1
updated_addr = updated_addr[1:] # This is to remove the leading `,`
new_address = filter_address("123 Main Street, Chicago, IL 92921")
You could use del in python and glue back the components of the string with ", " after splitting them.
For example:
address = "123 Main Street, PO Box 345, Chicago, IL 92921".split(",")
del address[1]
pretty_address = ", ".join(address)
print(pretty_address) # Gives 123 Main Street, Chicago, IL 92921