I have a .csv file that I am creating, and it is being created by iterating through an input file. My current code for the specific column this question is about looks like this:
input_filename = sys.argv[1]
output_filename = sys.argv[2]
f = open(sys.argv[3]).read()
list.append(("A B", f[0:2], "numeric", "A B"))
For the portion of the code 'f[0:2]', rather than having it append the first few characters of f as a whole file (which obviously makes it append the first few characters every time it is appended), I want it to append [0:2] for the next line in f every time the loop is executed. I have tried:
list.append(("A B", f.line[0:2], "numeric", "A B"))
and other similar approaches, to no avail. I hope this question is clear - if not, I am happy to clarify. Any suggestions for putting this stipulation into this append line are appreciated!
Thank you!
It's a little hard for me to guess what you're trying to do here, but is this something like what you're looking for?
Contents of data.txt
abc
def
The code:
# I'm simply replacing your names so I can test this more easily
input_filename = 'input.txt'
output_filename = 'output.txt'
data_filename = 'data.txt'
transformed_data = []
with open(data_filename) as df:
for line in df:
# remove surrounding whitespace- assuming you want this
line = line.strip()
if line: # make sure there's non-whitespace characters left
transformed_data.append(("A B", line[0:2], "numeric", "A B"))
print(transformed_data)
# produces
# [('A B', 'ab', 'numeric', 'A B'), ('A B', 'de', 'numeric', 'A B')]
If you're working with .csv files, I highly recommend the csv library that comes with Python. Let it handle encoding and formatting for you.
Related
I'm trying to open up 32 .txt files, extract some text from them (using RegEx) and then save them as individual files again(later on in the project I'm hoping to collate them together). I've tested the RegEx on a single file and it seems to work:
import os
import re
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation planning\Manual scrape\Finished years proper')
with open('1988.txt') as txtfile:
text= txtfile.read()
#print(len(text)) #sentences in text
start = r'Body\n\n\n'
docs = re.findall(start, text)
print('Found the start of %s documents.' % len(docs))
end = r'Load-Date:'
print('Found the end of %s documents.' % len(docs))
docs = re.findall(end, text)
regex = start+r'(.+?)'+end
articles = re.findall(regex, text, re.S)
print('You have now parsed the 154 articles so only the body of content remains. All metadata has been removed.')
print('Here is an example of a parsed article:', articles[0])
Now I want to perform the exact same thing on all my .txt files in that folder, but I can't figure out how to. I've been playing around with For loops but with little success. Currently I have this:
import os
import re
finished_years_proper= os.listdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
os.chdir(r'C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Manual scrape\Finished years proper')
print('There are %s .txt files in this folder.' % len(finished_years_proper))
if i.endswith(".txt"):
with open(finished_years_proper + i, 'r') as all_years:
for line in all_years:
start = r'Body\n\n\n'
docs = re.findall(start, all_years)
end = r'Load-Date:'
docs = re.findall(end, all_years)
regex = start+r'(.+?)'+end
articles = re.findall(regex, all_years, re.S)
However, I'm returning a type error:
File "C:\Users\garet\OneDrive - University of Exeter\Masters\Year Two\Dissertation\Method\Python\untitled1.py", line 15, in <module>
with open(finished_years_proper + i, 'r') as all_years:
TypeError: can only concatenate list (not "str") to list
I'm unsure how to proceed... I've seen on other forums that I should convert something into a string, but I'm not sure what to convert or even if this is the right way to proceed. Any help with this would be really appreciated!
After taking Benedictanjw's into my codes I've ended up with this:
Hi, this is what I ended up with:
all_years= []
for fyp in finished_years_proper: #fyp is each text file in folder
with open(fyp, 'r') as year:
for line in year: #line is each element in each text file in folder
start = r'Body\n\n\n'
docs = re.findall(start, line)
end = r'Load-Date:'
docs = re.findall(end, line)
regex = start+r'(.+?)'+end
articles = re.findall(regex, line, re.S)
all_years.append(articles) #append strings to reflect RegEx
parsed_documents= all_years.append(articles)
print(parsed_documents) #returns None. Apparently this is okay.
Does the 'None' mean that the parsing of each file is successful (as in it emulates the result I had when I tested the RegEx on a single file)? And if so, how can I visualise my output without returning None. Many thanks in advance!!
The problem shows because finished_years_proper is a list and in your line:
with open(finished_years_proper + i, 'r') as all_years:
you are trying to concatenate i with that list. I presume you had accidentally defined i elsewhere as a string. I guess you probably want to do something like:
all_years = []
for fyp in finished_years_proper:
with open(fyp, 'r') as year:
for line in year:
... # your regex search on year
all_years.append(xxx)
I'm kind of on a time crunch, but this was one of my problems in my homework assignment. I am stuck, and I don't know what to do or how to proceed.
Our assignment was to open various text files and within each of the text files, we are supposed to add each word into a dictionary in which the key is the document number it came from, and the value is the word.
For example, one text file would be:
1
Hello, how are you?
I am fine and you?
Each of the text files begin with a number corresponding to it's title (for example, "document1.txt" begins with "1", "document2.txt" begins with "2", etc)
My teacher gave us this coding to help with stripping the punctuation and the lines, but I am having a hard time figuring out where to implement it.
data = re.split("[ .,:;!?\s\b]+|[\r\n]+", line)
data = filter(None, data)
I don't really understand where the filter(None, data) stuff comes into play, because all it does is return a code line of what it represents in memory.
Here's my code so far:
def invertFile(list_of_file_names):
import re
diction = {}
emplist = []
fordiction = []
for x in list_of_file_names:
afile = open(x, 'r')
with afile as f:
for line in f:
savedSort = filterText(f)
def filterText(line):
import re
word_delimiters = [' ', ',', ';', ':', '.','?','!']
data = re.split("[ .,:;!?\s\b]+|[\r\n]+", f)
key, value = data[0], data[1:]
diction[key] = value
How do I make it so each word is appended into a dictionary, where the key is the document it comes from, and the value are the words in the document? Thank you.
I need help with concatenating two text files based on common strings.
My first txt file looks like this:
Hello abc
Wonders xyz
World abc
And my second txt file looks like this:
abc A
xyz B
abc C
I want my output file to be:
Hello abc A
Wonders xyz B
World abc C
My Code goes something like this:
a = open("file1","r")
b = open("file2","r")
c = open("output","w")
for line in b:
chk = line.split(" ")
for line_new in a:
chk_new = line_new.split(" ")
if (chk_new[0] == chk[1]):
c.write(chk[0])
c.write(chk_new[0])
c.write(chk_new[1])
But when I use this code, I get the output as:
Hello abc A
Wonders xyz B
Hello abc C
Line 3 mismatch occurs. What should I do to get it the correct way?
I'm afraid you are mistaken, your code does not produce the output you say it does.
Partly because a file can only be read once, with the exception being if you move the read cursor back to the beginning of the file (file.seek(0), docs).
Partly because the second element of a line in the first file ends with a newline character, thus you are comparing e.g. "abc" with "abc\n" etc. which will never be true.
Hence the output file will be completely empty.
So how do you solve the problem? Reading a file more than once seems overly complicated, don't do that. I suggest you do something along the lines of:
# open all the files simultaneously
with open('file1', 'r') as (f1
), open('file2', 'r') as (f2
), open('output', 'w') as (outf
):
lines_left = True
while lines_left:
f1_line = f1.readline().rstrip()
# check if there's more to read
if len(f1_line) != 0:
f1_line_tokens = f1_line.split(' ')
# no need to strip the line from the second file
f2_line_tokens = f2.readline().split(' ')
if f1_line_tokens[1] == f2_line_tokens[0]:
outf.write(f1_line + ' ' + f2_line_tokens[1])
else:
lines_left = False
I've tested it on your example input and it produces the correct output (where file1 is the first example file and file2 is the second). If we talk about huge files (millions of lines), this version will be considerably faster than aarons. In other cases the performance difference will be negligible.
The open streams aren't safe and you can only read a file once. Do this:
aLines = []
bLines = []
with open("file1","r") as a:
for line in a:
aLines.append(line.strip().split(" "))
with open("file2","r") as b:
for line in b:
bLines.append(line.strip().split(" "))
bLines.reverse()
with open("output","w") as c:
for chk in aLines:
chk_new = bLines.pop()
if chk_new[0] == chk[1]:
c.write(chk[0])
c.write(chk_new[0])
c.write(chk_new[1])
I need to search for a name in a file and in the line starting with that name, I need to replace the fourth item in the list that is separated my commas. I have began trying to program this with the following code, but I have not got it to work.
with open("SampleFile.txt", "r") as f:
newline=[]
for word in f.line():
newline.append(word.replace(str(String1), str(String2)))
with open("SampleFile.txt", "w") as f:
for line in newline :
f.writelines(line)
#this piece of code replaced every occurence of String1 with String 2
f = open("SampleFile.txt", "r")
for line in f:
if line.startswith(Name):
if line.contains(String1):
newline = line.replace(str(String1), str(String2))
#this came up with a syntax error
You could give some dummy data which would help people to answer your question. I suppose you to backup your data: You can save the edited data to a new file or you can backup the old file to a backup folder before working on the data (think about using "from shutil import copyfile" and then "copyfile(src, dst)"). Otherwise by making a mistake you could easily ruin your data without being able to easily restore them.
You can't replace the string with "newline = line.replace(str(String1), str(String2))"! Think about "strong" as your search term and a line like "Armstrong,Paul,strong,44" - if you replace "strong" with "weak" you would get "Armweak,Paul,weak,44".
I hope the following code helps you:
filename = "SampleFile.txt"
filename_new = filename.replace(".", "_new.")
search_term = "Smith"
with open(filename) as src, open(filename_new, 'w') as dst:
for line in src:
if line.startswith(search_term):
items = line.split(",")
items[4-1] = items[4-1].replace("old", "new")
line = ",".join(items)
dst.write(line)
If you work with a csv-file you should have a look at the csv module.
PS My files contain the following data (the filenames are not in the files!!!):
SampleFile.txt SampleFile_new.txt
Adams,George,m,old,34 Adams,George,m,old,34
Adams,Tracy,f,old,32 Adams,Tracy,f,old,32
Smith,John,m,old,53 Smith,John,m,new,53
Man,Emily,w,old,44 Man,Emily,w,old,44
The purpose of this script is to parse a text file (sys.argv[1]), extract certain strings, and print them in columns. I start by printing the header. Then I open the file, and scan through it, line by line. I make sure that the line has a specific start or contains a specific string, then I use regex to extract the specific value.
The matching and extraction work fine.
My final print statement doesn't work properly.
import re
import sys
print("{}\t{}\t{}\t{}\t{}".format("#query", "target", "e-value",
"identity(%)", "score"))
with open(sys.argv[1], 'r') as blastR:
for line in blastR:
if line.startswith("Query="):
queryIDMatch = re.match('Query= (([^ ])+)', line)
queryID = queryIDMatch.group(1)
queryID.rstrip
if line[0] == '>':
targetMatch = re.match('> (([^ ])+)', line)
target = targetMatch.group(1)
target.rstrip
if "Score = " in line:
eValue = re.search(r'Expect = (([^ ])+)', line)
trueEvalue = eValue.group(1)
trueEvalue = trueEvalue[:-1]
trueEvalue.rstrip()
print('{0}\t{1}\t{2}'.format(queryID, target, trueEvalue), end='')
The problem occurs when I try to print the columns. When I print the first 2 columns, it works as expected (except that it's still printing new lines):
#query target e-value identity(%) score
YAL002W Paxin1_129011
YAL003W Paxin1_167503
YAL005C Paxin1_162475
YAL005C Paxin1_167442
The 3rd column is a number in scientific notation like 2e-34
But when I add the 3rd column, eValue, it breaks down:
#query target e-value identity(%) score
YAL002W Paxin1_129011
4e-43YAL003W Paxin1_167503
1e-55YAL005C Paxin1_162475
0.0YAL005C Paxin1_167442
0.0YAL005C Paxin1_73182
I have removed all new lines, as far I know, using the rstrip() method.
At least three problems:
1) queryID.rstrip and target.rstrip are lacking closing ()
2) Something like trueEValue.rstrip() doesn't mutate the string, you would need
trueEValue = trueEValue.rstrip()
if you want to keep the change.
3) This might be a problem, but without seeing your data I can't be 100% sure. The r in rstrip stands for "right". If trueEvalue is 4e-43\n then it is true the trueEValue.rstrip() would be free of newlines. But the problem is that your values seem to be something like \n43-43. If you simply use .strip() then newlines will be removed from either side.