How do I print out results on a separate line after converting them from a set to a string? - python-3.x

I am currently trying to compare to text files, to see if they have any words in common in both files.
The text files are as
ENGLISH.TXT
circle
table
year
competition
FRENCH.TXT
bien
competition
merci
air
table
My current code is getting them to print, Ive removed all the unnessecary squirly brackets and so on, but I cant get them to print on different lines.
List = open("english.txt").readlines()
List2 = open("french.txt").readlines()
anb = set(List) & set(List2)
anb = str(anb)
anb = (str(anb)[1:-1])
anb = anb.replace("'","")
anb = anb.replace(",","")
anb = anb.replace('\\n',"")
print(anb)
The output is expected to separate both results onto new lines.
Currently Happening:
Competition Table
Expected:
Competition
Table
Thanks in advance!
- Xphoon

Hi I'd suggest you to try two things as a good practice:
1) Use "with" for opening files
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
##your python operations for the file
2) Try to use the "f-String" opportunity if you're using Python 3:
print(f"Hello\nWorld!")
File read using "open()" vs "with open()"
This post explains very well why to use the "with" statement :)
And additionally to the f-strings if you want to print out variables do it like this:
print(f"{variable[index]}\n variable2[index2]}")
Should print out:
Hello and World! in seperate lines
Here is one solution including converting between sets and lists:
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
english_words = englishfile.readlines()
english_words = [word.strip('\n') for word in english_words]
french_words = frenchfile.readlines()
french_words = [word.strip('\n') for word in french_words]
anb = set(english_words) & set(french_words)
anb_list = [item for item in anb]
for item in anb_list:
print(item)
Here is another solution by keeping the words in lists:
with open('english.txt', 'r') as englishfile, open('french.txt', 'r') as frenchfile:
english_words = englishfile.readlines()
english_words = [word.strip('\n') for word in english_words]
french_words = frenchfile.readlines()
french_words = [word.strip('\n') for word in french_words]
for english_word in english_words:
for french_word in french_words:
if english_word == french_word:
print(english_word)

Related

How to split strings from .txt file into a list, sorted from A-Z without duplicates?

For instance, the .txt file includes 2 lines, separated by commas:
John, George, Tom
Mark, James, Tom,
Output should be:
[George, James, John, Mark, Tom]
The following will create the list and store each item as a string.
def test(path):
filename = path
with open(filename) as f:
f = f.read()
f_list = f.split('\n')
for i in f_list:
if i == '':
f_list.remove(i)
res1 = []
for i in f_list:
res1.append(i.split(', '))
res2 = []
for i in res1:
res2 += i
res3 = [i.strip(',') for i in res2]
for i in res3:
if res3.count(i) != 1:
res3.remove(i)
res3.sort()
return res3
print(test('location/of/file.txt'))
Output:
['George', 'James', 'John', 'Mark', 'Tom']
Your file opening is fine, although the 'r' is redundant since that's the default. You claim it's not, but it is. Read the documentation.
You have not described what task is so I have no idea what's going on there. I will assume that it is correct.
Rather than populating a list and doing a membership test on every iteration - which is O(n^2) in time - can you think of a different data structure that guarantees uniqueness? Google will be your friend here. Once you discover this data structure, you will not have to perform membership checks at all. You seem to be struggling with this concept; the answer is a set.
The input data format is not rigorously defined. Separators may be commas or commas with trailing spaces, and may appear (or not) at the end of the line. Consider making an appropriate regular expression and using its splitting feature to split individual lines, though normal splitting and stripping may be easier to start.
In the following example code, I've:
ignored task since you've said that that's fine;
separated actual parsing of file content from parsing of in-memory content to demonstrate the function without a file;
used a set comprehension to store unique results of all split lines; and
used a generator to sorted that drops empty strings.
from io import StringIO
from typing import TextIO, List
def parse(f: TextIO) -> List[str]:
words = {
word.strip()
for line in f
for word in line.split(',')
}
return sorted(
word for word in words if word != ''
)
def parse_file(filename: str) -> List[str]:
with open(filename) as f:
return parse(f)
def test():
f = StringIO('John, George , Tom\nMark, James, Tom, ')
words = parse(f)
assert words == [
'George', 'James', 'John', 'Mark', 'Tom',
]
f = StringIO(' Han Solo, Boba Fet \n')
words = parse(f)
assert words == [
'Boba Fet', 'Han Solo',
]
if __name__ == '__main__':
test()
I came up with a very simple solution if anyone will need:
lines = x.read().split()
lines.sort()
new_list = []
[new_list.append(word) for word in lines if word not in new_list]
return new_list
with open("text.txt", "r") as fl:
list_ = set()
for line in fl.readlines():
line = line.strip("\n")
line = line.split(",")
[list_.add(_) for _ in line if _ != '']
print(list_)
I think that you missed a comma after Jim in the first line.
You can avoid the use of a loop by using split property :
content=file.read()
my_list=content.split(",")
to delete the occurence in your list you can transform it to set :
my_list=list(set(my_list))
then you can sort it using sorted
so the finale code :
with open("file.txt", "r") as file :
content=file.read()
my_list=content.replace("\n","").replace(" ", "").split(",")
result=sorted(list(set(my_list)))
you can add a key to your sort function

How do I perform a regular expression on multiple .txt files in a folder (Python)?

I'm trying to open up 32 .txt files, extract some text from them (using RegEx) and then save them as individual files again(later on in the project I'm hoping to collate them together). I've tested the RegEx on a single file and it seems to work:
import os
import re
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation planning\Manual scrape\Finished years proper')
with open('1988.txt') as txtfile:
text= txtfile.read()
#print(len(text)) #sentences in text
start = r'Body\n\n\n'
docs = re.findall(start, text)
print('Found the start of %s documents.' % len(docs))
end = r'Load-Date:'
print('Found the end of %s documents.' % len(docs))
docs = re.findall(end, text)
regex = start+r'(.+?)'+end
articles = re.findall(regex, text, re.S)
print('You have now parsed the 154 articles so only the body of content remains. All metadata has been removed.')
print('Here is an example of a parsed article:', articles[0])
Now I want to perform the exact same thing on all my .txt files in that folder, but I can't figure out how to. I've been playing around with For loops but with little success. Currently I have this:
import os
import re
finished_years_proper= os.listdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
print('There are %s .txt files in this folder.' % len(finished_years_proper))
if i.endswith(".txt"):
with open(finished_years_proper + i, 'r') as all_years:
for line in all_years:
start = r'Body\n\n\n'
docs = re.findall(start, all_years)
end = r'Load-Date:'
docs = re.findall(end, all_years)
regex = start+r'(.+?)'+end
articles = re.findall(regex, all_years, re.S)
However, I'm returning a type error:
File "C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Method\Python\untitled1.py", line 15, in <module>
with open(finished_years_proper + i, 'r') as all_years:
TypeError: can only concatenate list (not "str") to list
I'm unsure how to proceed... I've seen on other forums that I should convert something into a string, but I'm not sure what to convert or even if this is the right way to proceed. Any help with this would be really appreciated!
After taking Benedictanjw's into my codes I've ended up with this:
Hi, this is what I ended up with:
all_years= []
for fyp in finished_years_proper: #fyp is each text file in folder
with open(fyp, 'r') as year:
for line in year: #line is each element in each text file in folder
start = r'Body\n\n\n'
docs = re.findall(start, line)
end = r'Load-Date:'
docs = re.findall(end, line)
regex = start+r'(.+?)'+end
articles = re.findall(regex, line, re.S)
all_years.append(articles) #append strings to reflect RegEx
parsed_documents= all_years.append(articles)
print(parsed_documents) #returns None. Apparently this is okay.
Does the 'None' mean that the parsing of each file is successful (as in it emulates the result I had when I tested the RegEx on a single file)? And if so, how can I visualise my output without returning None. Many thanks in advance!!
The problem shows because finished_years_proper is a list and in your line:
with open(finished_years_proper + i, 'r') as all_years:
you are trying to concatenate i with that list. I presume you had accidentally defined i elsewhere as a string. I guess you probably want to do something like:
all_years = []
for fyp in finished_years_proper:
with open(fyp, 'r') as year:
for line in year:
... # your regex search on year
all_years.append(xxx)

using Python how to remove redundancy from rows of text file

Hello guys I am using RCV1 dataset. I want to remove duplicates words or tokens from the text file but I am not sure how to do it. And since these are not duplicate rows these are words in articles. I am using python, please help me with this.please see the attached image to get an idea about text file
Assuming that the words of the text file are spaced out with only a blank spaces (i.e., no attached commas and periods), the following code should work for you.
items = []
with open("data.txt") as f:
for line in f:
items += line.split()
newItemList = list(set(items))
If you would like to have the items as a single string:
newItemList = " ".join(list(set(items)))
If you want the order to be preserved as well, then do
newItemList = []
for item in items:
if item not in newItemList:
newItemList += [item]
newItemList = " ".join(newItemList)

Python 3.5, how to remove the brackets and quotes from an element when printing or sending the value to a function?

I am reading a list of states from a file into an list[]:
mystk = []
with open('state_list.txt') as csvfile:
readCSV = csv.reader(csvfile,delimiter=',')
for row in readCSV:
mystk.append(row)
After the read I am adding the values in to a list.
print(str(mystk[0]).strip())
i=0
while i < 10:
strList = mystk[i]
print('Print:',strList)
i = i +1
The output of the above is :
Print: ['AL']
Print: ['AK']
Print: ['AZ']
Print: ['AR']
Print: ['CA']
Print: ['CO']
Print: ['CT']
Print: ['DE']
Print: ['FL']
Print: ['GA']
I am trying to achieve the following:
Print: AL
Print: AK
Print: AZ
Print: AR
Print: CA
Print: CO
Print: CT
Print: DE
Print: FL
Print: GA
I guess I could write a function or loop to strip out the ['?'] using regex or code like this:
i=0
while i < 10:
strList = mystk[i]
strList = str(strList).replace("['", "")
strList = strList.replace("']", "")
print(' ','Print:',strList)
i = i +1
However I was hoping there was an easier way then the code above however I am new to python and if this is the only way then it works for me.
this are recommendations that I mention in my comment plus some other
import csv
def getTID(file='TID.csv', delim='\n'):
result = []
with open(file) as csvTID:
readCSV = csv.reader(csvTID, delimiter=delim)
for row in readCSV:
result.append( row[0] )
return result
stockList = getTID()
for x in stockList:
print(x)
here with the use of arguments give the function more flexibility and with default values I retain the original behavior that way you don't need to modify your code (or the name of the file) if you want to use your function with another file like 'TID_2.cvs' for example, in that case just call getTID('TID_2.cvs') and as the function don't do anything to some global variable you can have the data from 2 or more different files in different variables if you need it, for example
stockList1 = getTID('TID_1.cvs')
stockList2 = getTID('TID_2.cvs')
every line of the csv file is split by commas, to get the string joined by commas again, use str.join:
sep = ", "
for row in mystk:
print(' ', 'Print:', sep.join(row))
Guys thank you all very much, learning this is stuff is awesome. So many ways to do stuff. After reading your comments and understanding the concepts I have written the following function to get the stock list I need:
import csv
stockList = []
def getTID():
with open('TID.csv') as csvTID:
readCSV = csv.reader(csvTID,delimiter='\n')
for row in readCSV:
stockList.append((row[0]))
getTID()
for x in stockList[:]: print(x)
This returns the list as expected: VOD.L, APPL, etc.

this codes output needs to be sent to a seperate text file, why does no work

i have got this piece of code to find the positions of the first occuring of that word and replace them into the actual program.
i have tried this
sentence = "ask not what you can do for your country ask what your country can do for you"
listsentence = sentence.split(" ")
d = {}
i = 0
values = []
for i, word in enumerate(sentence.split(" ")):
if not word in d:
d[word] = (i + 1)
values += [d[word]]
print(values)
example = open('example.txt', 'wt')
example.write(str(values))
example.close()
how do i write this output to a seperate text file such as notepad.
Actually your code works- example.txt is created each time you run this program. You can check that in your directory this file exists.
If you want to open it right after closing it in your script add:
import os
os.system("notepad example.txt")

Resources