I'm trying to split a string in a list of strings. Right now i have to split whenever I see any of these characters: '.', ';', ':', '?', '!', '( )', '[ ]', '{ }' (keep in mind that I have to mantain whatever is inside the brackets).
To solve it I tried to write
print(re.split("\(([^)]*)\)|[.,;:?!]\s*", "Hello world,this is(example)"))
but as output I get:
['Hello world', None, 'this is', 'example', '']
Omitting the ' ' at the end that I'll solve later, how can I remove the None that appears in the middle of the list?
By the way I can't iterate in the list another time because the program shall work with huge files and I have to make it as fast as possible.
Also I don't have to necessarily use re.split so everything that works will be just fine!
I'm still new at this so I'm sorry if something is incorrect.
Not sure if this is fast enough but you could do this:
re.sub(r";|,|:|\(|\)|\[|\]|\?|\.|\{|\}|!", " ", "Hello world,this is(example)").split()
Related
I am trying to split a target sentence into composite pieces for a later function using re.split() and the regex
(#?\w+)(\W+)
Ideally, this would split words and non-word characters in a generated list, preserving both as separate list items, with the exception of the "#" symbol which could precede a word. If there is an # symbol before a word, I want to keep it as a cohesive item in the split. My example is below.
My test sentence is as follows:
this is a test of proper nouns #Ryan
So the line of code is:
re.split(r'(#?\w+)(\W+)', "this is a test of proper nouns #Ryan")
The list that I want to generate would include "#Ryan" as a single item but, instead, it looks like this
['', 'this', ' ', '', 'is', ' ', '', 'a', ' ', '', 'test', ' ', '', 'of', ' ', '', 'proper', ' ', '', 'nouns', ' #', 'Ryan']
Since the first container has the # symbol, I would have thought that it would be evaluated first but that is apparently not the case. I have tried using lookaheads or removing # from the \W+ container to no avail.
https://regex101.com/r/LeezvP/1
With your shown samples, could you please try following(written and tested in Python 3.8.5). considering that you need to remove empty/null items in your list. This will give output where # is together with words.
##First split the text/line here and save it to list named li.
li=re.split(r'(#?\w+)(?:\s+)', "this is a test of proper nouns #Ryan")
li
['', 'this', '', 'is', '', 'a', '', 'test', '', 'of', '', 'proper', '', 'nouns', '#Ryan']
##Use filter to remove nulls in list li.
list(filter(None, li))
['this', 'is', 'a', 'test', 'of', 'proper', 'nouns', '#Ryan']
Simple explanation would be, use split function with making 1 capturing group which has an optional # followed by words and 1 non-capturing group which has spaces one or more occurrences in it. This will place null elements in list, so to remove them use filter function.
NOTE: As per OP's comments nulls/spaces may be required, so in that case one could refer following code; which worked for OP:
li=re.split(r'(#?\w+)(\s+|\W+)', "this is a test of proper nouns #Ryan")
You could also match using re.findall and use an alternation | matching the desired parts.
(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+
Explanation
(?: Non capture group
[^#\w\s]+ Match 1+ times any char except # word char or whitespace char
| Or
#(?!\w) Match # when not directly followed by a word char
)+ Close the group and match 1+ times
| Or
\s+ Match 1+ whitespace chars to keep them as a separate match in the result
| Or
#?\w+ Match # directly followed by 1+ word chars
Regex demo
Example
import re
pattern = r"(?:[^#\w\s]+|#(?!\w))+|\s+|#?\w+"
print(re.findall(pattern, "this is a test of proper nouns #Ryan"))
# Output
# ['this', ' ', 'is', ' ', 'a', ' ', 'test', ' ', 'of', ' ', 'proper', ' ', 'nouns', ' ', '#Ryan']
print(re.findall(pattern, "this #Ryan #$#test#123#4343##$%$test#1#$#$###1####"))
# Output
# ['this', ' ', '#Ryan', ' ', '#$', '#test', '#123', '#4343', '##$%$', 'test', '#1', '#$#$##', '#1', '####']
The regex, #?\w+|\b(?!$) should meet your requirement.
Explanation at regex101:
1st Alternative #\w
# matches the character # literally (case sensitive)
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative \b(?!$)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Negative Lookahead (?!$)
Assert that the Regex below does not match
$ asserts position at the end of a line
This is a follow up from a question I asked yesterday which I got brilliant responses for but now I have more problems :P
(How do I get python to detect a right brace, and put a space after that?)
Say I have this string that's in a txt document which I make Python read
!0->{100}!1o^{72}->{30}o^{72}->{30}o^{72}->{30}o^{72}->{30}o^{72}->{30}
I want to seperate this conjoined string into individual components that can be indexed after detecting a certain symbol.
If it detects !0, it's considered as one index.
If it detects ->{100}, that is also considered as another part of the list.
It seperates all of them into different parts until the computer prints out:
!0, ->{100}, !1, o^{72}, ->{30}
From yesterdays code, I tried a plethora of things.
I tried this technique which separates anything with '}' perfectly but has a hard time separating !0
text = "(->{200}o^{90}->{200}o^{90}->{200}o^{90}!0->{200}!1o^{90})" #this is an example string
my_string = ""
for character in text:
my_string += character
if character == "}":
my_string+= "," #up until this point, Guimonte's code perfectly splits "}"
elif character == "0": #here is where I tried to get it to detect !0. it splits that, but places ',' on all zeroes
my_string+= ","
print(my_string)
The output:
(->{20,0,},o^{90,},->{20,0,},o^{90,},->{20,0,},o^{90,},!0,->{20,0,},!1o^{90,},)
I want the out put to insead be:
(->{200}, o^{90}, ->{200}, o^{90}, ->{200}, o^{90}, !0, ->{200}, !1, o^{90})
It seperates !0 but it also messes with the other symbols.
I'm starting to approach a check mate scenario. Is there anyway I can get it to split !0 and !1 as well as the right brace?
I'm new to programming in Python and now i have made a script to move files from one location to another.
Now i wanted to have a logfile for it, but i can't find a way to farmat the text it puts in the logfile.
I have the following code:
#logging
log= 'Succesfully moved', x, 'to', moveto
logging.basicConfig(filename='\\\\fatboy.leleu.be\\iedereen\\Glenn\\insitecopy.log',filemode='a',level=logging.INFO,format='%(asctime)s %(message)s',datefmt='%d/%m/%Y ' ' %I:%M:%S %p')
logging.info(log)
The output is this:
14/12/2018 08:54:17 AM ('Succesfully moved', '2126756_landrover.pdf', 'to', '\\\\fatboy.leleu.be\\MPWorkflow\\Jobs\\2126756_test\\PDF Druk')
14/12/2018 08:54:17 AM ('Succesfully moved', '2126757_landrover - kopie.pdf', 'to', '\\\\fatboy.leleu.be\\MPWorkflow\\Jobs\\2126757_test2\\PDF Druk')
Now i want to remove the brackets, the apostrophe and the comma, but don't know how
The simplest way is to use logging.info(" ".join(log)), because your "log" variable looks if it was a tuple. But it will working only if log is really a tuple and contains only str type elements.
Python shows tuples in that form as you can see in your log: opening round bracket, items (between apostrophes if item is a string), closing round bracket.
Pls, try code below;
log= 'Succesfully moved ' + x + ' to ' + moveto
This
log= 'Succesfully moved', x, 'to', moveto
Is creating a tuple try something like
log = 'Succefully moved {} to {}'.format(x, moveto)
When I try to print whatever data on several lines using python 3, a single whitespace gets added to the beginning of all the lines except first one. for example:
[in] print('a','\n','b','\n','c')
the output will be:
a
b
c
but my desired output is:
a
b
c
so far I've only been able to do this by doing three print commands. Anyone has any thoughts?
From the docs:
print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
Print objects to the text stream file, separated by sep and followed by end.
sep, end, file and flush, if present, must be given as keyword
arguments.
Calling print('a', '\n', 'b') will print each of those three items with a space in between, which is what you are seeing.
You can change the separator argument to get what you want:
print('a', 'b', sep='\n')
Also see the format method.
I have a text file something like this (suppose A and B are persons and below text is a conversation between them):
A: Hello
B: Hello
A: How are you?
B: I am good. Thanks and you?
I added this conversation into a list that returns below result:
[['A', 'Hello\n'], ['A', 'How are you?\n'], ['B', 'Hello\n'], ['B', 'I am good. Thanks and you?\n']]
I use these commands in a loop:
new_sentence = line.split(': ', 1)[1]
attendees_and_sentences[index].append(person)
attendees_and_sentences[index].append(new_sentence)
print(attendees_and_sentences) # with this command I get the above result
print(attendees_and_sentences[0][1]) # if I run this one, then I don't get "\n" in the sentence.
The problem is those "\n" characters on my result screen. How can I get rid of them?
Thank you.
You can use Python's rstrip function.
For example:
>>> 'my string\n'.rstrip()
'my string'
And if you want to trim the trailing newlines while preserving other whitespace, you can specify the characters to remove, like so:
>>> 'my string \n'.rstrip()
'my string '