Extract characters within certain symbols - python-3.x

I have extracted text from an HTML file, and have the whole thing in a string.
I am looking for a method to loop through the string, and extract only values that are within square brackets and put strings in a list.
I have looked in to several questions, among them this one: Extract character before and after "/"
But i am having a hard time modifying it. Can someone help?
Solved!
Thank you for all your inputs, I will definitely look more into regex. I managed to do what i wanted in a pretty manual way (may not be beautiful):
#remove all html code and append to string
for i in html_file:
html_string += str(html2text.html2text(i))
#set this boolean if current character is either [ or ]
add = False
#extract only values within [ or ], based on add = T/F
for i in html_string:
if i == '[':
add = True
if i == ']':
add = False
clean_string += str(i)
if add == True:
clean_string += str(i)
#split string into list without square brackets
clean_string_list = clean_string.split('][')
The HTML file I am looking to get as pure text (dataframe later on) instead of HTML, is my personal Facebook data that i have downloaded.

Try out this regex, given a string it will place all text inside [ ] into a list.
import re
print(re.findall(r'\[(\w+)\]','spam[eggs][hello]'))
>>> ['eggs', 'hello']
Also this is a great reference for building your own regex.
https://regex101.com
EDIT: If you have nested square brackets here is a function that will handle that case.
import re
test ='spam[eg[nested]gs][hello]'
def square_bracket_text(test_text,found):
"""Find text enclosed in square brackets within a string"""
matches = re.findall(r'\[(\w+)\]',test_text)
if matches:
found.extend(matches)
for word in found:
test_text = test_text.replace('[' + word + ']','')
square_bracket_text(test_text,found)
return found
match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']
hope it helps!

You can also use re.finditer() for this, see below example.
Let suppose, we have word characters inside brackets so regular expression will be \[\w+\].
If you wish, check it at https://rextester.com/XEMOU85362.
import re
s = "<h1>Hello [Programmer], you are [Excellent]</h1>"
g = re.finditer("\[\w+\]", s)
l = list() # or, l = []
for m in g:
text = m.group(0)
l.append(text[1: -1])
print(l) # ['Programmer', 'Excellent']

Related

Get number from string in Python

I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').

How to split strings from .txt file into a list, sorted from A-Z without duplicates?

For instance, the .txt file includes 2 lines, separated by commas:
John, George, Tom
Mark, James, Tom,
Output should be:
[George, James, John, Mark, Tom]
The following will create the list and store each item as a string.
def test(path):
filename = path
with open(filename) as f:
f = f.read()
f_list = f.split('\n')
for i in f_list:
if i == '':
f_list.remove(i)
res1 = []
for i in f_list:
res1.append(i.split(', '))
res2 = []
for i in res1:
res2 += i
res3 = [i.strip(',') for i in res2]
for i in res3:
if res3.count(i) != 1:
res3.remove(i)
res3.sort()
return res3
print(test('location/of/file.txt'))
Output:
['George', 'James', 'John', 'Mark', 'Tom']
Your file opening is fine, although the 'r' is redundant since that's the default. You claim it's not, but it is. Read the documentation.
You have not described what task is so I have no idea what's going on there. I will assume that it is correct.
Rather than populating a list and doing a membership test on every iteration - which is O(n^2) in time - can you think of a different data structure that guarantees uniqueness? Google will be your friend here. Once you discover this data structure, you will not have to perform membership checks at all. You seem to be struggling with this concept; the answer is a set.
The input data format is not rigorously defined. Separators may be commas or commas with trailing spaces, and may appear (or not) at the end of the line. Consider making an appropriate regular expression and using its splitting feature to split individual lines, though normal splitting and stripping may be easier to start.
In the following example code, I've:
ignored task since you've said that that's fine;
separated actual parsing of file content from parsing of in-memory content to demonstrate the function without a file;
used a set comprehension to store unique results of all split lines; and
used a generator to sorted that drops empty strings.
from io import StringIO
from typing import TextIO, List
def parse(f: TextIO) -> List[str]:
words = {
word.strip()
for line in f
for word in line.split(',')
}
return sorted(
word for word in words if word != ''
)
def parse_file(filename: str) -> List[str]:
with open(filename) as f:
return parse(f)
def test():
f = StringIO('John, George , Tom\nMark, James, Tom, ')
words = parse(f)
assert words == [
'George', 'James', 'John', 'Mark', 'Tom',
]
f = StringIO(' Han Solo, Boba Fet \n')
words = parse(f)
assert words == [
'Boba Fet', 'Han Solo',
]
if __name__ == '__main__':
test()
I came up with a very simple solution if anyone will need:
lines = x.read().split()
lines.sort()
new_list = []
[new_list.append(word) for word in lines if word not in new_list]
return new_list
with open("text.txt", "r") as fl:
list_ = set()
for line in fl.readlines():
line = line.strip("\n")
line = line.split(",")
[list_.add(_) for _ in line if _ != '']
print(list_)
I think that you missed a comma after Jim in the first line.
You can avoid the use of a loop by using split property :
content=file.read()
my_list=content.split(",")
to delete the occurence in your list you can transform it to set :
my_list=list(set(my_list))
then you can sort it using sorted
so the finale code :
with open("file.txt", "r") as file :
content=file.read()
my_list=content.replace("\n","").replace(" ", "").split(",")
result=sorted(list(set(my_list)))
you can add a key to your sort function

python not removing punctuation

i have a text file i want to remove punctuation and save it as a new file but it is not removing anything any idea why?
code:
def punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# Print string without punctuation
print(string)
file = open('ir500.txt', 'r+')
file_no_punc = (file.read())
punctuation(l)
with open('ir500_no_punc.txt', 'w') as file:
file.write(file_no_punc)
removing any punctuation why?
def punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# return string without punctuation
return string
file = open('ir500.txt', 'r+')
file_no_punc = (file.read())
file_no_punc = punctuation(file_no_punc)
with open('ir500_no_punc.txt', 'w') as file:
file.write(file_no_punc)
Explanation:
I changed only punctuation(l) to file_no_punc = punctuation(file_no_punc) and print(string) to return string
1) what is l in punctuation(l) ?
2) you are calling punctuation() - which works correctly - but do not use its return value
3) because it is not currently returning a value, just printing it ;-)
Please note that I made only the minimal change to make it work. You might want to post it to our code review site, to see how it could be improved.
Also, I would recommend that you get a good IDE. In my opinion, you cannot beat PyCharm community edition. Learn how to use the debugger; it is your best friend. Set breakpoints, run the code; it will stop when it hits a breakpoint; you can then examine the values of your variables.
taking out the file reading/writing, you could to remove the punctuation from a string like this:
table = str.maketrans("", "", r"!()-[]{};:'\"\,<>./?##$%^&*_~")
# # or maybe even better
# import string
# table = str.maketrans("", "", string.punctuation)
file_with_punc = r"abc!()-[]{};:'\"\,<>./?##$%^&*_~def"
file_no_punc = file_with_punc.lower().translate(table)
# abcdef
where i use str.maketrans and str.translate.
note that python strings are immutable. there is no way to change a given string; every operation you perform on a string will return a new instance.

How can I remove all POS tags except for 'VBD' and 'VBN' from my CSV file?

I want to remove words tagged with the specific part-of-speech tags VBD and VBN from my CSV file. But, I'm getting the error "IndexError: list index out of range" after entering the following code:
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
My CSV file has 10 reviews of 10 people and the row name is Comment.
Here is my full code:
df_Comment = pd.read_csv("myfile.csv")
def clean(text):
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
tagged = nltk.pos_tag(text)
text = text.rstrip()
text = re.sub(r'[^a-zA-Z]', ' ', text)
stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
text_clean = []
for text in df)Comment['Comment']:
text_clean.append(clean(text).split())
print(text_clean)
POS_tag_text_clean = [nltk.pos_tag(t) for t in text_clean]
print(POS_tag_text_clean)
words=[]
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
How can I fix the error?
It is a bit hard to understand your problem without an example and the corresponding outputs, but it might be this:
Assuming that text is a string, text_clean will be a list of lists of strings, where every string represents a word. After the part-of-speech tagging, POS_tag_text_clean will therefore be a list of lists of tuples, each tuple containing a word and its tag.
If I'm right, then your last loop actually loops over items from your dataframe instead of words, as the name of the variable suggests. If an item has only one word (which is not so unlikely, since you filter a lot in clean()), your call to word[1] will fail with an error similar to the one you report.
Instead, try this code:
words = []
for item in POS_tag_text_clean:
words_in_item = []
for word in item:
if word[1] !='VBD' and word[1] !='VBN':
words_in_item .append(word[0])
words.append(words_in_item)

String manipulations using Python Pandas

I have some name and ethnicity data, for example:
John Wick English
Black Widow French
I then do a bit of manipulation to make the name as below
John Wick -> john#wick??????????????????????????????????
Black Widow -> black#widow????????????????????????????????
I then proceed into creating multiple variables and each contain the 3-character sub-strings through the for loop.
I also try to find the number of alphabets using the re.findall.
I have two questions:
1) Is the for loop efficient? Can I replace with better code even though it is working as is?
2) I can't get the code that tries to find the number of alphabet to work. Any suggestions?
import pandas as pd
from pandas import DataFrame
import re
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Desktop\OddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()
# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Z\s\-]', '') # Retain space and hyphen
# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[\s]', '#')
# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens
# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")
# Split into three-character strings
for i in range(1, 41):
substr = "substr" + str(i)
frame3[substr] = frame3["name_filled"].str[i-1:i+2]
# Count number of characters
frame3["name_len"] = len(re.findall('[a-zA-Z]', name))
# Test outputs
print frame3
!) Regarding the loop, I can't think of a better way than what you're already doing
2) Try frame3["name_len"] = frame3["name"].map(lambda x : len(re.findall('[a-zA-Z]', x)))

Resources