How can I split text using pyparsing with a specific token? - python-3.x

PLEASE NOTE:
In Splitting text into lines with pyparsing it is about how to parse a file using a single token at the end of a line which is \n that is pretty easy peasy. My question differs as I have hard time ignoring last text which is started before : and exclude it from free text search entered before filters.
On our API I have a user input like some free text port:45 title:welcome to our website and what I need to have at the end of parsing is 2 parts -> [some free text, port:45 title:welcome]
from pyparsing import *
token = "some free text port:45 title:welcome to our website"
t = Word(alphas, " "+alphanums) + Word(" "+alphas,":"+alphanums)
This does give me an error:
pyparsing.ParseException: Expected W:( ABC..., :ABC...), found ':' (at char 21), (line:1, col:22)
Because it gets all strings up to some free text port and then :45 title:welcome to our website.
How can I get all data before port: in a separate group and port:.... in another group using pyparsing?

I know that the question is about pyparsing, but for the specific use I think using regex is far more standard and simpler where instead pyparsing is probably better suited for more complicated parsing problems.
Here one possible working regex:
^(.+port\:\d+) (title:.+)$
And here the python code:
import re
pattern = "^(.+port\:\d+) (title:.+)$"
token = "some free text port:45 title:welcome to our website"
m = re.match(pattern, token)
if m:
grp1, grp2 = m.group(1), m.group(2)

Adding " " as one of the valid characters in a Word pretty much always has this problem, and so is general a pyparsing anti-pattern. Word does its character repetition matching inside its parse() method, so there is no way to add any kind of lookahead.
To get spaces in your expressions, you will probably need a OneOrMore, wrapped in originalTextFor, like this:
import pyparsing as pp
word = pp.Word(pp.printables, excludeChars=":")
non_tag = word + ~pp.FollowedBy(":")
# tagged value is two words with a ":"
tag = pp.Group(word + ":" + word)
# one or more non-tag words - use originalTextFor to get back
# a single string, including intervening white space
phrase = pp.originalTextFor(non_tag[1, ...])
parser = (phrase | tag)[...]
parser.runTests("""\
some free text port:45 title:welcome to our website
""")
Prints:
some free text port:45 title:welcome to our website
['some free text', ['port', ':', '45'], ['title', ':', 'welcome'], 'to our website']
[0]:
some free text
[1]:
['port', ':', '45']
[2]:
['title', ':', 'welcome']
[3]:
to our website

Related

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

Regex: Match x if not surrounded with y [duplicate]

I'm very new in programming and regex so apologise if this's been asked before (I didn't find one, though).
I want to use Python to summarise word frequencies in a literal text. Let's assume the text is formatted like
Chapter 1
blah blah blah
Chapter 2
blah blah blah
....
Now I read the text as a string, and I want to use re.findall to get every word in this text, so my code is
wordlist = re.findall(r'\b\w+\b', text)
But the problem is that it matches all these Chapters in each chapter title, which I don't want to include in my stats. So I want to ignore what matches Chapter\s*\d+. What should I do?
Thanks in advance, guys.
Solutions
You could remove all Chapter+space+digits first :
wordlist = re.findall(r'\b\w+\b', re.sub(r'Chapter\s*\d+\s*','',text))
If you want to use just one search , you can use a negative lookahead to find any word that isn't preceded by "Chapter X" and does not begin with a digit :
wordlist = re.findall(r'\b(?!Chapter\s+\d+)[A-Za-z]\w*\b',text)
If performance is an issue, loading a huge string and parsing it with a Regex wouldn't be the correct method anyway. Just read the file line by line, toss any line that matches r'^Chapter\s*\d+' and parse each remaining line separately with r'\b\w+\b' :
import re
lines=open("huge_file.txt", "r").readlines()
wordlist = []
chapter = re.compile(r'^Chapter\s*\d+')
words = re.compile(r'\b\w+\b')
for line in lines:
if not chapter.match(line):
wordlist.extend(words.findall(line))
print len(wordlist)
Performance
I wrote a small ruby script to write a huge file :
all_dicts = Dir["/usr/share/dict/*"].map{|dict|
File.readlines(dict)
}.flatten
File.open('huge_file.txt','w+') do |txt|
newline=true
txt.puts "Chapter #{rand(1000)}"
50_000_000.times do
if rand<0.05
txt.puts
txt.puts
txt.puts "Chapter #{rand(1000)}"
newline = true
end
txt.write " " unless newline
newline = false
txt.write all_dicts.sample.chomp
if rand<0.10
txt.puts
newline = true
end
end
end
The resulting file has more than 50 million words and is about 483MB big :
Chapter 154
schoolyard trashcan's holly's continuations
Chapter 814
assure sect's Trippe's bisexuality inexperience
Dumbledore's cafeteria's rubdown hamlet Xi'an guillotine tract concave afflicts amenity hurriedly whistled
Carranza
loudest cloudburst's
Chapter 142
spender's
vests
Ladoga
Chapter 896
petition's Vijayawada Lila faucets
addendum Monticello swiftness's plunder's outrage Lenny tractor figure astrakhan etiology's
coffeehouse erroneously Max platinum's catbird succumbed nonetheless Nissan Yankees solicitor turmeric's regenerate foulness firefight
spyglass
disembarkation athletics drumsticks Dewey's clematises tightness tepid kaleidoscope Sadducee Cheerios's
The two-step process took 12.2s to extract the wordlist on average, the lookahead method took 13.5s and Wiktor's answer also took 13.5s. The lookahead method I first wrote used re.IGNORECASE, and it took around 18s.
There's basically no difference in performance between all the Regexen methods when reading the whole file.
What surprised me though is that the readlines script took around 20.5s, and didn't use much less memory than the other scripts. If you have any idea how to improve the script, please comment!
Match what you do not need and capture what you need, and use this technique with re.findall that only returns captured values:
re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)
Details:
\bChapter\s*\d+\b - Chapter as a whole word followed with 0+ whitespaces and 1+ digits
| - or
\b(\w+)\b - match and capture into Group 1 one or more word chars
To avoid getting empty values in the resulting list, filter it (see demo):
import re
s = "Chapter 1: Black brown fox 45"
print(filter(None, re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)))

How do I search for a substring in a string then find the character before the substring in python

I am making a small project in python that lets you make notes then read them by using specific arguments. I attempted to make an if statement to check if the string has a comma in it, and if it does, than my python file should find the comma then find the character right below that comma and turn it into an integer so it can read out the notes the user created in a specific user-defined range.
If that didn't make sense then basically all I am saying is that I want to find out what line/bit of code is causing this to not work and return nothing even though notes.txt has content.
Here is what I have in my python file:
if "," not in no_cs: # no_cs is the string I am searching through
user_out = int(no_cs[6:len(no_cs) - 1])
notes = open("notes.txt", "r") # notes.txt is the file that stores all the notes the user makes
notes_lines = notes.read().split("\n") # this is suppose to split all the notes into a list
try:
print(notes_lines[user_out])
except IndexError:
print("That line does not exist.")
notes.close()
elif "," in no_cs:
user_out_1 = int(no_cs.find(',') - 1)
user_out_2 = int(no_cs.find(',') + 1)
notes = open("notes.txt", "r")
notes_lines = notes.read().split("\n")
print(notes_lines[user_out_1:user_out_2]) # this is SUPPOSE to list all notes in a specific range but doesn't
notes.close()
Now here is the notes.txt file:
note
note1
note2
note3
and lastly here is what I am getting in console when I attempt to run the program and type notes(0,2)
>>> notes(0,2)
jeffv : notes(0,2)
[]
A great way to do this is to use the python .partition() method. It works by splitting a string from the first occurrence and returns a tuple... The tuple consists of three parts 0: Before the separator 1: The separator itself 2: After the separator:
# The whole string we wish to search.. Let's use a
# Monty Python quote since we are using Python :)
whole_string = "We interrupt this program to annoy you and make things\
generally more irritating."
# Here is the first word we wish to split from the entire string
first_split = 'program'
# now we use partition to pick what comes after the first split word
substring_split = whole_string.partition(first_split)[2]
# now we use python to give us the first character after that first split word
first_character = str(substring_split)[0]
# since the above is a space, let's also show the second character so
# that it is less confusing :)
second_character = str(substring_split)[1]
# Output
print("Here is the whole string we wish to split: " + whole_string)
print("Here is the first split word we want to find: " + first_split)
print("Now here is the first word that occurred after our split word: " + substring_split)
print("The first character after the substring split is: " + first_character)
print("The second character after the substring split is: " + second_character)
output
Here is the whole string we wish to split: We interrupt this program to annoy you and make things generally more irritating.
Here is the first split word we want to find: program
Now here is the first word that occurred after our split word: to annoy you and make things generally more irritating.
The first character after the substring split is:
The second character after the substring split is: t

How can I search a pattern and extract the value behind it

I am a newbee in python. I am trying to pull data (XXXX) out from a text with a pattern PDB:XXXX. The XXXX varies, but it is exactly what I want.
Since the data all contain PDB:, I use re.findall() to search and get this pattern. But this only gave me a list of PDB:. How can I get it to include the XXXX???
this is my code:
text = 'blah...........
PDB:AAAA
blah...........
blah...........
PDB:BBBB'
etc.
r = re.findall("PDB:",text)
and the output gave me:
['PDB:', 'PDB:']
My desired output should be something like
['AAAA', 'BBBB']
You need to use """ to quote multi-line strings in Python. Also, to get a specific subset of the matched pattern, you need to use capture groups (the parentheses in my regular expression below).
import re
text = """blah...........
PDB:AAAA
blah...........
blah...........
PDB:BBBB"""
results = re.findall(r"PDB:(.*)", text)
print results #['AAAA', 'BBBB']

regex - Making all letters in a text lowercase using re.sub in python but exclude specific string?

I am writing a script to convert all uppercase letters in a text to lower case using regex, but excluding specific strings/characters such as "TEA", "CHI", "I", "#Begin", "#Language", "ENG", "#Participants", "#Media", "#Transcriber", "#Activities", "SBR", "#Comment" and so on.
The script I have is currently shown below. However, it does not provide the desired outputs. For instance when I input "#Activities: SBR", the output given is "#Activities#activities: sbr#activities: sbrSBR". The intended output is "#Activities": "SBR".
I am using Python 3.5.2
Can anyone help to provide some guidance? Thank you.
import os
from itertools import chain
import re
def lowercase_exclude_specific_string(line):
line = line.strip()
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
return filtered_line
First, let's see why you're getting the wrong output.
For instance when I input "#Activities: SBR", the output given is
"#Activities#activities: sbr#activities: sbrSBR".
This is because your code
PATTERN = r'[^TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment]'
filtered_line = re.sub(PATTERN, line.lower(), line)
is doing negated character class matching, meaning it will match all characters that are not in the list and replace them with line.lower() (which is "#activities: sbr"). You can see the matched characters in this regex demo.
The code will match ":" and " " (whitespace) and replace both with "#activities: sbr", giving you the result "#Activities#activities: sbr#activities: sbrSBR".
Now to fix that code. Unfortunately, there is no direct way to negate words in a line and apply substitution on the other words on that same line. Instead, you can split the line first into individual words, then apply re.sub on it using your PATTERN. Also, instead of a negated character class, you should use a negative lookahead:
(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn’t match at
the current position in the string.
Here's the code I got:
def lowercase_exclude_specific_string(line):
line = line.strip()
words = re.split("\s+", line)
result = []
for word in words:
PATTERN = r"^(?!TEA|CHI|I|#Begin|#Language|ENG|#Participants|#Media|#Transcriber|#Activities|SBR|#Comment).*$"
lword = re.sub(PATTERN, word.lower(), word)
result.append(lword)
return " ".join(result)
The re.sub will only match words not in the PATTERN, and replace it with its lowercase value. If the word is part of the excluded pattern, it will be unmatched and re.sub returns it unchanged.
Each word is then stored in a list, then joined later to form the line back.
Samples:
print(lowercase_exclude_specific_string("#Activities: SBR"))
print(lowercase_exclude_specific_string("#Activities: SOME OTHER TEXT SBR"))
print(lowercase_exclude_specific_string("Begin ABCDEF #Media #Comment XXXX"))
print(lowercase_exclude_specific_string("#Begin AT THE BEGINNING."))
print(lowercase_exclude_specific_string("PLACE #Begin AT THE MIDDLE."))
print(lowercase_exclude_specific_string("I HOPe thIS heLPS."))
#Activities: SBR
#Activities: some other text SBR
begin abcdef #Media #Comment xxxx
#Begin at the beginning.
place #Begin at the middle.
I hope this helps.
EDIT:
As mentioned in the comments, apparently there is a tab in between : and the next character. Since the code splits the string using \s, the tab can't be preserved, but it can be restored by replacing : with :\t in the final result.
return " ".join(result).replace(":", ":\t")

Resources