String items in list: how to remove certain keywords? - string

I have a set of links that looks like the following:
links = ['http://www.website.com/category/subcategory/1',
'http://www.website.com/category/subcategory/2',
'http://www.website.com/category/subcategory/3',...]
I want to extract the 1, 2, 3, and so on from this list, and store the extracted data in subcategory_explicit. They're stored as str, and I'm having trouble getting at them with the following code:
subcategory_explicit = [cat.get('subcategory') for cat in links if cat.get('subcategory') is not None]
Do I have to change my data type from str to something else? What would be a better way to obtain and store the extracted values?

subcategory_explicit = [i[i.find('subcategory'):] for i in links if 'subcategory' in i]
This uses a substring via slicing, starting at the "s" in "subcategory" until the end of the string. By adding len('subcategory') to the value from find, you can exclude "subcategory" and get "/#" (where # is whatever number).

Try this (using re module):
import re
links = [
'http://www.website.com/category/subcategory/1',
'http://www.website.com/category/subcategory/2',
'http://www.website.com/category/subcategory/3']
d = "|".join(links)
# 'http://www.website.com/category/subcategory/1|http://www.website.com/category/subcategory/2|http://www.website.com/category/subcategory/3'
pattern = re.compile("/category/(?P<category_name>\w+)/\d+", re.I)
subcategory_explicit = pattern.findall(d)
print(subcategory_explicit)

Related

How to insert variable length list into string

I have what I think is a basic question in Python:
I have a list that can be variable in length and I need to insert it into a string for later use.
Formatting is simple, I just need a comma between each name up to nameN and parenthesis surrounding the names.
List = ['name1', 'name2' .... 'nameN']
string = "Their Names are <(name1 ... nameN)> and they like candy.
Example:
List = ['tom', 'jerry', 'katie']
print(string)
Their Names are (tom, jerry, katie) and they like candy.
Any ideas on this? Thanks for the help!
# Create a comma-separated string with names
the_names = ', '.join(List) # 'tom, jerry, katie'
# Interpolate it into the "main" string
string = f"Their Names are ({the_names}) and they like candy."
There are numerous ways to achieve that.
You could use print + format + join similar to the example from #ForceBru.
Using format would make it compatible with both Python2 and Python3.
names_list = ['tom', 'jerry', 'katie']
"""
Convert the list into a string with .join (in this case we are separating with commas)
"""
names_string = ', '.join(names_list)
# names_string == "tom, katie, jerry"
# Now add one string inside the other:
string = "Their Names are ({}) and they like candy.".format(names_string)
print(string)
>> Their Names are (tom, jerry, katie) and they like candy.

How to extract text between specific letters from a string in Python(3.9)?

how may I be able to take from a string in python a value that is in a given text but is inside it, it's between 2 letters that I want it to copy from inside.
e.g.
"Kahoot : ID:1234567 Name:RandomUSERNAME"
I want it to receive the 1234567 and the RandomUSERNAME in 2 different variables.
a way I found to catch is to get it between the "ID:"COPYINPUT until the SPACE., "Name:"COPYINPUT until the end of the text.
How do I code this?
if I hadn't explained correctly tell me, I don't know how to ask/format this question! Sorry for any inconvenience!.
If the text always follows the same format you could just split the string. Alternatively, you could use regular expressions using the re library.
Using split:
string = "Kahoot : ID:1234567 Name:RandomUSERNAME"
string = string.split(" ")
id = string[2][3:]
name = string[3][5:]
print(id)
print(name)
Using re:
import re
string = "Kahoot : ID:1234567 Name:RandomUSERNAME"
id = re.search(r'(?<=ID:).*?(?=\s)', string).group(0)
name = re.search(r'(?<=Name:).*', string).group(0)
print(id)
print(name)

Extract characters within certain symbols

I have extracted text from an HTML file, and have the whole thing in a string.
I am looking for a method to loop through the string, and extract only values that are within square brackets and put strings in a list.
I have looked in to several questions, among them this one: Extract character before and after "/"
But i am having a hard time modifying it. Can someone help?
Solved!
Thank you for all your inputs, I will definitely look more into regex. I managed to do what i wanted in a pretty manual way (may not be beautiful):
#remove all html code and append to string
for i in html_file:
html_string += str(html2text.html2text(i))
#set this boolean if current character is either [ or ]
add = False
#extract only values within [ or ], based on add = T/F
for i in html_string:
if i == '[':
add = True
if i == ']':
add = False
clean_string += str(i)
if add == True:
clean_string += str(i)
#split string into list without square brackets
clean_string_list = clean_string.split('][')
The HTML file I am looking to get as pure text (dataframe later on) instead of HTML, is my personal Facebook data that i have downloaded.
Try out this regex, given a string it will place all text inside [ ] into a list.
import re
print(re.findall(r'\[(\w+)\]','spam[eggs][hello]'))
>>> ['eggs', 'hello']
Also this is a great reference for building your own regex.
https://regex101.com
EDIT: If you have nested square brackets here is a function that will handle that case.
import re
test ='spam[eg[nested]gs][hello]'
def square_bracket_text(test_text,found):
"""Find text enclosed in square brackets within a string"""
matches = re.findall(r'\[(\w+)\]',test_text)
if matches:
found.extend(matches)
for word in found:
test_text = test_text.replace('[' + word + ']','')
square_bracket_text(test_text,found)
return found
match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']
hope it helps!
You can also use re.finditer() for this, see below example.
Let suppose, we have word characters inside brackets so regular expression will be \[\w+\].
If you wish, check it at https://rextester.com/XEMOU85362.
import re
s = "<h1>Hello [Programmer], you are [Excellent]</h1>"
g = re.finditer("\[\w+\]", s)
l = list() # or, l = []
for m in g:
text = m.group(0)
l.append(text[1: -1])
print(l) # ['Programmer', 'Excellent']

Python: Trouble indexing a list from .split()

I'm currently working on a folder rename program that will crawl a directory, and rename specific words to their abbreviated version. These abbreviations are kept in a dictionary. When I try to replace mylist[mylist.index(w)] with the abbreviation, it replaces the entire list. The list shows 2 values, but it is treating them like a single index. Any help would be appreciated, as I am very new to Python.
My current test environment has the following:
c:\test\Accounting 2018
My expected result when this is completed, is c:\test\Acct 2018
import os
keyword_dict = {
'accounting': 'Acct',
'documents': 'Docs',
'document': 'Doc',
'invoice': 'Invc',
'invoices': 'Invcs',
'operations': 'Ops',
'administration': 'Admin',
'estimate': 'Est',
'regulations': 'Regs',
'work order': 'WO'
}
path = 'c:\\test'
def format_path():
for kw in os.walk(path, topdown=False):
#split the output to separate the '\'
usable_path = kw[0].split('\\')
#pull out the last folder name
string1 = str(usable_path[-1])
#Split this output based on ' '
mylist = [string1.lower().split(" ")]
#Iterate through the folders to find any values in dictionary
for i in mylist:
for w in i:
if w in keyword_dict.keys():
mylist[i.index(w)] = keyword_dict.get(w)
print(mylist)
format_path()
When I use print(mylist) prior to the index replacement, I get ['accounting', '2018'], and print(mylist[0]) returns the same result.
After the index replacement, the print(mylist) returns ['acct] the ['2018'] is now gone as well.
Why is it treating the list values as a single index?
I didn't test the following but it should point to the right direction. But first, not sure if it is a good idea spacing is the way to go (Accounting 2018) could come up as accounting2018 or accounting_2018. Better to use regular expression. Anyway, here is a slightly modified version of your code:
import os
keyword_dict = {
'accounting': 'Acct',
'documents': 'Docs',
'document': 'Doc',
'invoice': 'Invc',
'invoices': 'Invcs',
'operations': 'Ops',
'administration': 'Admin',
'estimate': 'Est',
'regulations': 'Regs',
'work order': 'WO'
}
path = 'c:\\test'
def format_path():
for kw in os.walk(path, topdown=False):
#split the output to separate the '\'
usable_path = kw[0].split('\\')
#pull out the last folder name
string1 = str(usable_path[-1])
#Split this output based on ' '
mylist = string1.lower().split(" ") #Remove [] since you are creating a list within a list for no reason
#Iterate through the folders to find any values in dictionary
for i in range(0,len(mylist)):
abbreviation=keyword_dict.get(mylist[i],'')
if abbreviation!='': #abbrevaition exists so overwrite it
mylist[i]=abbreviation
new_path=" ".join(mylist) #create new path (i.e. ['Acct', '2018']==>Acct 2018
usable_path[len(usable_path)-1]=new_path #replace the last item in the original path then rejoin the path
print("\\".join(usable_path))
What you need is:
import re, os
regex = "|".join(keyword_dict.keys())
repl = lambda x : keyword_dict.get(x.group().lower())
path = 'c:\\test'
[re.sub(regex,repl, i[0],re.I) for i in os.walk(path)]
You need to ensure the above is working.(So far it is working as expected) before you can rename

Replacing spaces in lists

I'm creating a google searcher in python. Is there any way that I can replace a space in a list with a "+" for my url? This is my code so far:
q=input("Question=")
qlist=list(q)
#print(qlist)
Can I replace any spaces in my list with a plus, and then turn that back into a string?
Just want to add another line of thought there. Try the urllib library for parsing url strings.
Here's an example:
import urllib
## Create an empty dictionary to hold values (for questions and answers).
data = dict()
## Sample input
input = 'This is my question'
### Data key can be 'Question'
data['Question='] = input
### We'll pass that dictionary hrough the urlencode method
url_values = urllib.parse.urlencode(data)
### And print results
print(url_values)
#-------------------------------------------------------------------------------------------------------
#-------------------------------------------------------------------------------------------------------
#Alternatively, you can setup the dictionary a little better if you only have a couple of key-value pairs
## Input
input = 'This is my question'
# Our dictionary; We can set the input value as the value to the Question key
data = {
'Question=': input
}
print(urllib.parse.urlencode(data))
Output:
'Question%3D=This+is+my+question'
You can just join it together to create 1 long string.
qlist = my_string.split(" ")
result = "+".join(qlist)
print("Output string: {}".format(result))
Look at the join and split operations in python.
q = 'dog cat'
list_info = q.split()
https://docs.python.org/3/library/stdtypes.html#str.split
q = ['dog', 'cat']
s_info = ''.join(q)
https://docs.python.org/3/library/stdtypes.html#str.join

Resources