Use Python to parse comma separated string with text delimiter coming from stdin - string

I have a csv file that is being fed to my Python script via stdin.
This is a comma separated file with quotations as text delimiter.
Here is an example line:
457,"Last,First",NYC
My script so far, splits each line by looking for commas, but how do I make it aware of the text delimiter quotes?
My current script:
for line in sys.stdin:
line = line.strip()
line.split(',')
print line
The code splits the name into two since it does not recognize the quotations enclosing that text field. I need the name to remain as a single element.
If it matters, the data is being fed through stdin within a hadoop-streaming program.
Thanks!

Well, you could do it more manually, with something like this:
row = []
enclosed = False
word = ''
for character in sys.stdin:
if character == '"':
enclosed = not enclosed
elif character = ',' and not enclosed:
row.append(word)
word = ''
else:
word += character
Haven't tested nor thought about it for too long but seems to me it could work. Probably someone more into Pythonist sintax could fine something better for doing the trick although ;)

Attempting to answer my own question. If I read right, it may be possible to send a streaming input into csv reader like so:
for line in csv.reader(sys.stdin):
print line

Related

How to read a text file and insert the data into next line on getting \n character

I have a text file where data is comma delimited with a litral \n character in between, i would like to insert the data into newline just after getting the \n character.
text file sample:
'what,is,your,name\n','my,name,is,david.hough\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,eric.knot\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,fisher.cold\n','i,am,a,software,prof\n',..
expected:
I need the output in the below form.
'what,is,your,name',
'my,name,is,david.hough',
'i,am,a,software,prof',
Tried:
file1 = open("test.text", "r")
Lines = file1.readlines()
for line in Lines:
print(line)
result:
'what,is,your,name\n','my,name,is,david.hough\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,eric.knot\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,fisher.cold\n','i,am,a,software,prof\n',..
well my comment does exactly what you asked, break lines at \n. your data is structured quite weirdly, but if you want the expected result that badly you can use regex
import re
file1 = open("test.text","r")
Lines = re.findall(r'\'.*?\',',file1.read().replace("\\n",""))
for line in Lines:
print(line)
Well you don't need push data to the other line manually. The \n does that work when you run the code.
I guess the problem is that you used quotes very frequently, try using a single pair of quotes and use \n after the first sentence and yeah without white space
'what,is,your,name\nmy,name,is,david.hough\ni,am,a,software,prof'

extract words from a text file and print netxt line

sample input
in parsing a text file .txt = ["'blah.txt'", "'blah1.txt'", "'blah2.txt'" ]
the expected output in another text file out_path.txt
blah.txt
blah1.txt
blah2.txt
Code that I tried, this just appends "[]" to the input file. While I also tried perl one liner replacing double and single quotes.
read_out_fh = open('out_path.txt',"r")
for line in read_out_fh:
for word in line.split():
curr_line = re.findall(r'"(\[^"]*)"', '\n')
print(curr_line)
this happens because while you reading a file it will be taken as string and not as a list even if u kept the formatting of a list. thats why you getting [] while doing re.for line in read_in_fh: here you are taking each letters in the string thats why you are not getting the desired output. so iwrote something first to transform the string into a list. while doing that i also eliminated "" and '' as you mensioned. then wrote it in to a new file example.txt.
Note: change the file name according to your files
read_out_fh = open('file.txt',"r")
for line in read_out_fh:
line=line.strip("[]").replace('"','').replace("'",'').split(", ")
with open("example.txt", "w") as output:
for word in line:
#print(word)
output.write(word+'\n')
example.txt(outputfile)
blah.txt
blah1.txt
blah2.txt
The code below works out for your example you gave in the question:
# Content of textfile.txt:
asdasdasd=["'blah.txt'", "'blah1.txt'", "'blah2.txt'"]asdasdasd
# Code:
import re
read_in_fh = open('textfile.txt',"r")
write_out_fh = open('out_path.txt', "w")
for line in read_in_fh:
find_list = re.findall(r'\[(".*?"*)\]', line)
for element in find_list[0].split(","):
element_formatted = element.replace('"','').replace("'","").strip()
write_out_fh.write(element_formatted + "\n")
write_out_fh.close()

How to modify and print list items in python?

I am a beginner in python, working on a small logic, i have a text file with html links in it, line by line. I have to read each line of the file, and print the individual links with same prefix and suffix,
so that the model looks like this.
<item>LINK1</item>
<item>LINK2</item>
<item>LINK3</item>
and so on.
I have tried this code, but something is wrong in my approach,
def file_read(fname):
with open(fname) as f:
#Content_list is the list that contains the read lines.
content_list = f.readlines()
for i in content_list:
print(str("<item>") + i + str("</item>"))
file_read(r"C:\Users\mandy\Desktop\gd.txt")
In the output, the suffix was not as expected, as i am a beginner, can anyone sort this out for me?
<item>www.google.com
</item>
<item>www.bing.com
</item>
I think when you use .readLine you also put the end of line character into i.
If i understand you correctly and you want to print
item www.google.com item
Then try
https://www.journaldev.com/23625/python-trim-string-rstrip-lstrip-strip
print(str("") + i.strip() + str(""))
When you use the readlines() method, it also includes the newline character from your file ("\n") before parsing the next line.
You could use a method called .strip() which strips off spaces or newline characters from the beginning and end of each line which would correctly format your code.
def file_read(fname):
with open(fname) as f:
#Content_list is the list that contains the read lines.
content_list = f.readlines()
for i in content_list:
print(str("<item>") + i.strip() + str("</item>"))
file_read(r"C:\Users\mandy\Desktop\gd.txt")
I assume you wanted to print in the following way
www.google.com
When you use readlines it gives extra '\n' at end of each line. to avoid that you can strip the string and in printing you can use fstrings.
with open(fname) as f:
lin=f.readlines()
for i in lin:
print(f"<item>{i.strip()}<item>")
Another method:
with open('stacksource') as f:
lin=f.read().splitlines()
for i in lin:
print(f"<item>{i}<item>")
Here splitlines() splits the lines and gives a list

opening text file and change it in python

I have a big text file like this example:
example:
</Attributes>
FovCount,555
FovCounted,536
ScanID,1803C0555
BindingDensity,0.51
as you see some lines are empty, some are comma separated and some others have different format.
I would like to open the file and look for the lines which start with these 3 words: FovCount, FovCounted and BindingDensity. if the line start with one of them I want to get the number after the comma. from the number related to FovCount and FovCounted I will make criteria and at the end the results is a list with 2 items: criteria and BD (which is the number after BindingDensity). I made the following function in python but it does not return what I want. do you know how to fix it?
def QC(file):
with open(file) as f:
for line in f.split(","):
if line.startswith("FovCount"):
FC = line[1]
elif line.startswith("FovCounted"):
FCed = line[1]
criteria = FC/FCed
elif line.startswith("BindingDensity"):
BD = line[1]
return [criteria, BD]
You are splitting the file into lines separated by a comma (,). But lines aren't separated by a command, they are separated by a newline character (\n).
Try changing f.split(",") to f.read().split("\n") or you can use f.readlines() which basically does the same thing.
You can then split each line into comma-separated segments using segments = line.split(",").
You can check if the first segment matches your text criteria: if segments[0] == "FovCounted", etc.
You can then obtain the value by getting the second segment: value = segments[1].

str.format places last variable first in print

The purpose of this script is to parse a text file (sys.argv[1]), extract certain strings, and print them in columns. I start by printing the header. Then I open the file, and scan through it, line by line. I make sure that the line has a specific start or contains a specific string, then I use regex to extract the specific value.
The matching and extraction work fine.
My final print statement doesn't work properly.
import re
import sys
print("{}\t{}\t{}\t{}\t{}".format("#query", "target", "e-value",
"identity(%)", "score"))
with open(sys.argv[1], 'r') as blastR:
for line in blastR:
if line.startswith("Query="):
queryIDMatch = re.match('Query= (([^ ])+)', line)
queryID = queryIDMatch.group(1)
queryID.rstrip
if line[0] == '>':
targetMatch = re.match('> (([^ ])+)', line)
target = targetMatch.group(1)
target.rstrip
if "Score = " in line:
eValue = re.search(r'Expect = (([^ ])+)', line)
trueEvalue = eValue.group(1)
trueEvalue = trueEvalue[:-1]
trueEvalue.rstrip()
print('{0}\t{1}\t{2}'.format(queryID, target, trueEvalue), end='')
The problem occurs when I try to print the columns. When I print the first 2 columns, it works as expected (except that it's still printing new lines):
#query target e-value identity(%) score
YAL002W Paxin1_129011
YAL003W Paxin1_167503
YAL005C Paxin1_162475
YAL005C Paxin1_167442
The 3rd column is a number in scientific notation like 2e-34
But when I add the 3rd column, eValue, it breaks down:
#query target e-value identity(%) score
YAL002W Paxin1_129011
4e-43YAL003W Paxin1_167503
1e-55YAL005C Paxin1_162475
0.0YAL005C Paxin1_167442
0.0YAL005C Paxin1_73182
I have removed all new lines, as far I know, using the rstrip() method.
At least three problems:
1) queryID.rstrip and target.rstrip are lacking closing ()
2) Something like trueEValue.rstrip() doesn't mutate the string, you would need
trueEValue = trueEValue.rstrip()
if you want to keep the change.
3) This might be a problem, but without seeing your data I can't be 100% sure. The r in rstrip stands for "right". If trueEvalue is 4e-43\n then it is true the trueEValue.rstrip() would be free of newlines. But the problem is that your values seem to be something like \n43-43. If you simply use .strip() then newlines will be removed from either side.

Resources