I'm currently working on a folder rename program that will crawl a directory, and rename specific words to their abbreviated version. These abbreviations are kept in a dictionary. When I try to replace mylist[mylist.index(w)] with the abbreviation, it replaces the entire list. The list shows 2 values, but it is treating them like a single index. Any help would be appreciated, as I am very new to Python.
My current test environment has the following:
c:\test\Accounting 2018
My expected result when this is completed, is c:\test\Acct 2018
import os
keyword_dict = {
'accounting': 'Acct',
'documents': 'Docs',
'document': 'Doc',
'invoice': 'Invc',
'invoices': 'Invcs',
'operations': 'Ops',
'administration': 'Admin',
'estimate': 'Est',
'regulations': 'Regs',
'work order': 'WO'
}
path = 'c:\\test'
def format_path():
for kw in os.walk(path, topdown=False):
#split the output to separate the '\'
usable_path = kw[0].split('\\')
#pull out the last folder name
string1 = str(usable_path[-1])
#Split this output based on ' '
mylist = [string1.lower().split(" ")]
#Iterate through the folders to find any values in dictionary
for i in mylist:
for w in i:
if w in keyword_dict.keys():
mylist[i.index(w)] = keyword_dict.get(w)
print(mylist)
format_path()
When I use print(mylist) prior to the index replacement, I get ['accounting', '2018'], and print(mylist[0]) returns the same result.
After the index replacement, the print(mylist) returns ['acct] the ['2018'] is now gone as well.
Why is it treating the list values as a single index?
I didn't test the following but it should point to the right direction. But first, not sure if it is a good idea spacing is the way to go (Accounting 2018) could come up as accounting2018 or accounting_2018. Better to use regular expression. Anyway, here is a slightly modified version of your code:
import os
keyword_dict = {
'accounting': 'Acct',
'documents': 'Docs',
'document': 'Doc',
'invoice': 'Invc',
'invoices': 'Invcs',
'operations': 'Ops',
'administration': 'Admin',
'estimate': 'Est',
'regulations': 'Regs',
'work order': 'WO'
}
path = 'c:\\test'
def format_path():
for kw in os.walk(path, topdown=False):
#split the output to separate the '\'
usable_path = kw[0].split('\\')
#pull out the last folder name
string1 = str(usable_path[-1])
#Split this output based on ' '
mylist = string1.lower().split(" ") #Remove [] since you are creating a list within a list for no reason
#Iterate through the folders to find any values in dictionary
for i in range(0,len(mylist)):
abbreviation=keyword_dict.get(mylist[i],'')
if abbreviation!='': #abbrevaition exists so overwrite it
mylist[i]=abbreviation
new_path=" ".join(mylist) #create new path (i.e. ['Acct', '2018']==>Acct 2018
usable_path[len(usable_path)-1]=new_path #replace the last item in the original path then rejoin the path
print("\\".join(usable_path))
What you need is:
import re, os
regex = "|".join(keyword_dict.keys())
repl = lambda x : keyword_dict.get(x.group().lower())
path = 'c:\\test'
[re.sub(regex,repl, i[0],re.I) for i in os.walk(path)]
You need to ensure the above is working.(So far it is working as expected) before you can rename
Related
This is my first post, so if I miss something, let me know.
I'm doing a CS50 beginner python course, and I'm stuck with a problem.
Long story short, the problem is to open a csv file, and it looks like this:
name,house
"Abbott, Hannah",Hufflepuff
"Bell, Katie",Gryffindor
.....
So I would love to put into a dictionary (which I did), but the problem now is that I supposed to split the "key" name in 2.
Here is my code, but it doesn't work:
before = []
....
with open(sys.argv[1]) as file:
reader = csv.reader(file)
for name, house in reader:
before.append({"name": name, "house": house})
# here i would love to split the key "name" in "last", "first"
for row in before[1:]:
last, first = name.split(", ")
Any advice?
Thank you in advance.
After you have the dictionary with complete name, you can split the name as below:
before = [{"name": "Abbott, Hannah", "house": "Hufflepuff"}]
# Before split
print(before)
for item in before:
# Go through each item in before dict and split the name
last, first = item["name"].split(', ')
# Add new keys for last and first name
item["last"] = last
item["first"] = first
# Remove the full name entry
item.pop("name")
# After split
print(before)
You can also do the split from the first pass, e.g. store directly the last and first instead of full name.
so I got this function here, and what it's supposed to do is create a file, that I can write in. the second and third parameters are lists, while the first is just the file name that I am going to create to write in. In the function, I made a for loop, and I'm looping through the all_students_list, which is a list, but at each index, is also a list, with the first name and last name in the list. all_courses_list is a list of all the courses in the school, and Schedule is a list that another function returns, giving us the schedule of the student. Then I added the student name and the schedule together, to write to the file. The problem is that it also prints [] square brackets. How can I get rid of it? I've already tried to do
.replace('[', '')
.replace(']', '')
But it doesn't work.
Here is my code.
def generate_student_schedules(filename, all_courses_list, all_students_list):
with open(filename,'w') as fileout:
for one_student in all_students_list:
schedule = get_schedule(all_courses_list)
one_line = ''
one_line += (f'{one_student}')
one_line += (f'{schedule}\n')
fileout.write(one_line)
If one_student is an actual list, then you can use " ".join(one_student), so overall:
def generate_student_schedules(filename, all_courses_list, all_students_list):
with open(filename,'w') as fileout:
for one_student in all_students_list:
schedule = get_schedule(all_courses_list)
one_line = ''
one_line += (" ".join(one_student))
one_line += (f'{schedule}\n')
fileout.write(one_line)
When you print a list, Python's default is to print the brackets and items in the list. You have to build a single string of the components of the list and print that single string. Your format string can pull out individual items or use join across all the items if they are all strings:
>>> student = ['John','Smith']
>>> schedule = ['Class1','Class2']
>>> print(student,schedule)
['John', 'Smith'] ['Class1', 'Class2']
>>> line = f'{student[1]}, {student[0]}: {", ".join(schedule)}'
>>> print(line)
Smith, John: Class1, Class2
So basically I have a folder of files I'm opening and reading into python.
I want to search these files and count the keywords in each file, to make a dataframe like the attached image.
I have managed to open and read these files into a list, but my problem is as follows:
Edit 1:
I decided to try and import the files as a dictionary instead. It works, but when I try to lower-case the values, I get a 'list' object attribute error - even though in my variable explorer, it's defined as a dictionary.
import os
filenames = os.listdir('.')
file_dict = {}
for file in filenames:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
file_dict[file.replace(".txt", "")] = items
def lower_dict(d):
new_dict = dict((k, v.lower()) for k, v in d.items())
return new_dict
print(lower_dict(file_dict))
output =
AttributeError: 'list' object has no attribute 'lower'
Pre-edit post:
1. Each list value doesn't retain the filename key. So I don't have the rows I need.
2. I can't conduct a search of keywords in the list anyway, because it is not tokenized. So I can't count the keywords per file.
Here's my code for opening the files, converting them to lowercase and storing them in a list.
How can I transform this into a dictionary retaining the filename, and tokenized key values?. Additionally, is it better to somehow import the file and contents into a dictionary directly? Can I still tokenize and lower-case everything?
import os
import nltk
# create list of filenames to loop over
filenames = os.listdir('.')
#create an empty list for storage
Lcase_content = []
tokenized = []
num = 0
# read files from folder, convert to lower case
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read()
# convert to lower-case value
Lcase_content.append(content.lower())
## this two lines below don't work - index out of range error
tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num])
num = num + 1
You can compute the count of each token by using Collections. collections.Counter can take a list of strings and return a dictionary-like Counter with each token in its keys and the count of the tokens in values. Since NLTK's workd_tokenize takes a sequence of strings and returns a list, to get a dictionary with tokens and their counts, you can basically do this:
Counter(nltk.tokenize.word_tokenize())
Since you want your file names as index (first column), make it as a nested dictionary, with a file name as a key and another dictionary with tokens and counts as a value, which looks like this:
{'file1.txt': Counter({'cat': 4, 'dog': 0, 'squirrel': 12, 'sea horse': 3}),
'file2.txt': Counter({'cat': 11, 'dog': 4, 'squirrel': 17, 'sea horse': 0})}
If you are familiar with Pandas, you can convert your dictionary to a Pandas dataframe. It will make your life so much easier to work with any tsv/csv/excel file by exporting the Pandas dataframe result as a csv file. Make sure you apply .lower() to your file content and include orient='index' so that files names be your index.
import os
import nltk
from collections import Counter
import pandas as pd
result = dict()
filenames = os.listdir('.')
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read().lower()
result[filename] = Counter(nltk.tokenize.word_tokenize(content))
df = pd.DataFrame.from_dict(result, orient='index').fillna(0)
df['total words'] = df.sum(axis=1)
df.to_csv('words_count.csv', index=True)
Re: your first attempt, since your 'items' is a list (see [i.strip() for i in f.read().split(",")]), you can't apply .lower() to it.
Re: your second attempt, your 'tokenized' is empty as it was initialized as tokenized = []. That's why when you try to do tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num]), tokenized[num] with num = 0 gives you the index out of range error.
I have extracted text from an HTML file, and have the whole thing in a string.
I am looking for a method to loop through the string, and extract only values that are within square brackets and put strings in a list.
I have looked in to several questions, among them this one: Extract character before and after "/"
But i am having a hard time modifying it. Can someone help?
Solved!
Thank you for all your inputs, I will definitely look more into regex. I managed to do what i wanted in a pretty manual way (may not be beautiful):
#remove all html code and append to string
for i in html_file:
html_string += str(html2text.html2text(i))
#set this boolean if current character is either [ or ]
add = False
#extract only values within [ or ], based on add = T/F
for i in html_string:
if i == '[':
add = True
if i == ']':
add = False
clean_string += str(i)
if add == True:
clean_string += str(i)
#split string into list without square brackets
clean_string_list = clean_string.split('][')
The HTML file I am looking to get as pure text (dataframe later on) instead of HTML, is my personal Facebook data that i have downloaded.
Try out this regex, given a string it will place all text inside [ ] into a list.
import re
print(re.findall(r'\[(\w+)\]','spam[eggs][hello]'))
>>> ['eggs', 'hello']
Also this is a great reference for building your own regex.
https://regex101.com
EDIT: If you have nested square brackets here is a function that will handle that case.
import re
test ='spam[eg[nested]gs][hello]'
def square_bracket_text(test_text,found):
"""Find text enclosed in square brackets within a string"""
matches = re.findall(r'\[(\w+)\]',test_text)
if matches:
found.extend(matches)
for word in found:
test_text = test_text.replace('[' + word + ']','')
square_bracket_text(test_text,found)
return found
match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']
hope it helps!
You can also use re.finditer() for this, see below example.
Let suppose, we have word characters inside brackets so regular expression will be \[\w+\].
If you wish, check it at https://rextester.com/XEMOU85362.
import re
s = "<h1>Hello [Programmer], you are [Excellent]</h1>"
g = re.finditer("\[\w+\]", s)
l = list() # or, l = []
for m in g:
text = m.group(0)
l.append(text[1: -1])
print(l) # ['Programmer', 'Excellent']
I have two wordlists, as per examples below:
wordlist 1 :
code1
code2
code3
wordlist 2 :
11
22
23
I want to take wordlist 2 and put every number in a line with first line in wordlist 1
example of the output :
code111
code122
code123
code211
code222
code223
code311
.
.
Can you please help me with how to do it? Thanks!
You can run two nested for loops to iterate over both lists, and append the concatenated string to a new list.
Here is a little example:
## create lists using square brackets
wordlist1=['code1', ## wrap something in quotes to make it a string
'code2','code3']
wordlist2=['11','22','23']
## create a new empty list
concatenated_words=[]
## first for loop: one iteration per item in wordlist1
for i in range(len(wordlist1)):
## word with index i of wordlist1 (square brackets for indexing)
word1=wordlist1[i]
## second for loop: one iteration per item in wordlist2
for j in range(len(wordlist2)):
word2=wordlist2[j]
## append concatenated words to the initially empty list
concatenated_words.append(word1+word2)
## iterate over the list of concatenated words, and print each item
for k in range(len(concatenated_words)):
print(concatenated_words[k])
list1 = ["text1","text2","text3","text4"]
list2 = [11,22,33,44]
def iterativeConcatenation(list1, list2):
result = []
for i in range(len(list2)):
for j in range(len(list1)):
result = result + [str(list1[i])+str(list2[j])]
return result
have you figured it out? depends on if you want to input the names on each list, or do you want it to for instance automatically read then append or extend a new text file? I am working on a little script atm and a very quick and simple way, lets say u want all text files in the same folder that you have your .py file:
import os
#this makes a list with all .txt files in the folder.
list_names = [f for f in os.listdir(os.getcwd()) if f.endswith('.txt')]
for file_name in list_names:
with open(os.getcwd() + "/" + file_name) as fh:
words = fh.read().splitlines()
with open(outfile, 'a') as fh2:
for word in words:
fh2.write(word + '\n')