python3 file.readline EOF? - python-3.x

I am having trouble determining when I have reached the end of a file in python with file.readline
fi = open('myfile.txt', 'r')
line = fi.readline()
if line == EOF: //or something similar
dosomething()
c = fp.read()
if c is None:
will not work because then I will loose data on the next line, and if a line only has a carriage return I will miss an empty line.
I have looked a dozens or related posts, and they all just use the inherent loops that just break when they are done. I am not looping so this doesn't work for me. Also I have file sizes in the GB with 100's of thousands of lines. A script could spend days processing a file. So I need to know how to tell when I am at the end of the file in python3. Any help is appreciated. Thank you!

I ran in to this same exact problem. My specific issue was iteration over two files, where the shorter one was only supposed to read a line on specific reads of the longer file.
As some mentioned here the natural pythonic way to iterate line by line is to, well, just iterate. My solution to stick with this 'naturalness' was to just utilize the iterator property of a file manually. Something like this:
with open('myfile') as lines:
try:
while True: #Just to fake a lot of readlines and hit the end
current = next(lines)
except StopIteration:
print('EOF!')
You can of course embellish this with your own IOWrapper class, but this was enough for me. Just replace all calls to readline to calls of next, and don't forget to catch the StopIteration.

The simplest way to check whether you've reached EOF with fi.readline() is to check the truthiness of the return value;
line = fi.readline()
if not line:
dosomething() # EOF reached
Reasoning
According to the official documentation
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line is represented by '\n', a string containing only a single newline.
and the only falsy string in python is the empty string ('').

You can use the output of the tell() function to determine if the last readline changed the current position of the stream.
fi = open('myfile.txt', 'r')
pos = fi.tell()
while (True):
li = fi.readline()
newpos = fi.tell()
if newpos == pos: # stream position hasn't changed -> EOF
break
else:
pos = newpos
According to the Python Tutorial:
f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.
...
In text files (those opened without a b in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from the f.tell(), or zero.
Since the value returned from tell() can be used to seek(), they would have to be unique (even if we can't guarantee what they correspond to). Therefore, if the value of tell() before and after a readline() is unchanged, the stream position is unchanged, and the EOF has been reached (or some other I/O exception of course). Reading an empty line will read at least the newline and advance the stream position.

This is a demonstrative example using f.tell() and f.read() with a chunk of data:
Assuming my input.txt file contain:
hello
hi
hoo
foo
bar
Test:
with open('input.txt', 'r') as f:
# Read chunk of data
chunk = 4
while True:
line = f.read(chunk)
if not line:
line = "i've read Nothing"
print("EOF reached. What i read when i reach EOF:", line)
break
else:
print('Read: {} at position: {}'.format(line.replace('\n', ''), f.tell()))
Will output:
Read: hell at position: 4
Read: ohi at position: 9
Read: hoo at position: 14
Read: foo at position: 19
Read: bar at position: 24
EOF reached. What i read when i reach EOF: i've read Nothing

with open(FILE_PATH, 'r') as fi:
for line in iter(fi.readline, ''):
parse(line)

Related

Read unknown number of ints separated by spaces or ends of lines

I am trying to find a way to get this right, I found some bits that answer this question only partially, such as:
from sys import stdin
lines = stdin.read().splitlines()
but this would only take in ints separated by lines
inp = list(map(int,input().split()))
while this only reads ints separated by spaces
I am stuck on this and couldn't figure out an intersection for the two. I am trying to learn EOF function.
import sys
numbers = []
for line in sys.stdin:
numbers += [int(number) for number in line.split()]
print(numbers)
Keep in mind that you have to explicitly send EOF on the terminal for the loop to finish (ctrl+d in bash). Otherwise, it will be stuck in for loop forever.

Having to run readlines() followed by read(). Is there a more efficient approach

this feels like a daft problem but here goes.
I have a lot of data files to process, each file has a variable number of lines of preamble before the main data. Processing requires that I find some values in the preamble and then read the main data into a pandas df.
From the preamble I need the number of lines which I can do:
with open(csvfile) as f:
data = f.readlines()
for num, line in enumerate(data, 0):
if end_preamble in line:
lines = num
I also need to find some values in the preamble which are needed to process the subsequent data. I can also do this:
with open(csvfile) as f:
data = f.read()
term1 = re.finall...(regex term)..
both of these work, but read() doesn't return line numbers since (as I understand) it interprets the text as a single line. Conversely readlines() can't be regexed for a string (I think because it's not stored - but I may well be wrong).
I have hack for now of:
with open(csvfile) as f:
data = f.read(250)
lines = data.count('\n')
term1 = re.finall.....
This works since most of the time the preamble is less than 250 bytes long. But if I have a file with a very short or very long preamble this won't work.
The files aren't huge so I can use readlines() and also use read() but reading the file twice seems like an inefficient way to accomplish what seems a relatively trivial task. Is there a more efficient method of combining the two needs?
Use readline() instead of readlines(). It will allow you to read any amount of rows but only preamble of the file (not the whole file):
with open(csvfile) as f:
num = 0
while end_preamble not in f.readline():
num += 1
As a result you get num which is the number of the last row in the preamble.
EDIT.
If you want to open file only once (error-prone way) you may do it like this:
with open(csvfile, mode='rb') as f:
preamble = b''
line = f.readline()
while end_preamble.encoded('UTF-8') not in line:
preamble += line
preamble = preamble.decoded('UTF-8')
data = pandas.read_table(f, ...)

Why is my string nil?

I made this simple program that reads characters until the enter key is pressed
var data: string
while true:
var c = readChar(stdin) # read char
case c
of '\r': # if enter, stop
break
else: discard
data.add(c) # add the read character to the string
echo data
But when it tries to echo data, it crashes
> ./program
hello
Traceback (most recent call last)
program.nim(11) program
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
This means data is nil. But every time I press enter a character, it adds the character to data. Something goes wrong, but where?
data is initially nil when you define it as var data: string. Instead you can use var data = "" to make it an initialized string.
The stream stdin buffers all the characters until the newline key is pressed, then it will submit the character(s). I expected the behavior to be reading direct characters.
That means that \r will never be the case, it will try to add a character to data but data is nil, so that fails. I thought it failed at the echo statement.
To demonstrate, this code works:
var data = ""
while true:
var c = readChar(stdin) # read char
case c
of '\e': # if escape, stop
break
else:
data.add(c) # add the read character to the string
echo data

Handling piped io in Lua

I have searched a lot for this but can't find answers anywhere. I am trying to do something like the following:
cat somefile.txt | grep somepattern | ./script.lua
I haven't found a single resource on handling piped io in Lua, and can't figure out how to do it. Is there a good way, non hackish way to do tackle it? Preferably buffered for lower memory usage, but I'll settle for reading the whole file at once if thats the only alternative.
It would be really disappointing to have to write it into a temp file and then load it into the program.
Thanks in advance.
The standard lirary has an io.stdin and an io.stdout that you can use for input and output without havig to resort to temporary files. You can also use io.read isntead of someFile:read and it will read from stdin by default.
http://www.lua.org/pil/21.1.html
The buffering is responsibility of the operating system that is providing the pipes. You don't need to worry too much about it when writing your programs.
edit: Apparently when you mentioned buffering you were thinking about reading part of the file as opposed to loading the whole file into a string. io.read can take a numeric parameter to read up to a certain number of bytes from input, returning nil if no characters could be read.
local size = 2^13 -- good buffer size (8K)
while true do
local block = io.read(size)
if not block then break end
io.write(block)
end
Another (simpler) alternative is the io.lines() iterator but without a filename inside the parentheses. Example:
for line in io.lines() do
print(line)
end
UPDATE: To get a number of characters you can write a wrapper around this. Example:
function io.chars(n,filename)
n = n or 1 --default number of characters to read at a time
local chars = ''
local wrap, yield = coroutine.wrap, coroutine.yield
return wrap(function()
for line in io.lines(filename) do
line = chars .. line .. '\n'
while #line >= n do
yield(line:sub(1,n))
line = line:sub(n+1)
end
chars = line
end
if chars ~= '' then yield(chars) end
end)
end
for text in io.chars(30) do
io.write(text)
end

Python 3- check if buffered out bytes form a valid char

I am porting some code from python 2.7 to 3.4.2, I am struck at the bytes vs string complication.
I read this 3rd point in the wolf's answer
Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in binary mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
So, when I buffer read a file (say - 1 byte each time) & the very first characters happens to be a 6-byte unicode how do I figure out how many more bytes to be read? Because if I do not read till the complete char, it will be skipped from processing; as next time read(x) will read x bytes relative to it's last position (i.e. halfway between it char's byte equivalent)
I tried the following approach:
import sys, os
def getBlocks(inputFile, chunk_size=1024):
while True:
try:
data=inputFile.read(chunk_size)
if data:
yield data
else:
break
except IOError as strerror:
print(strerror)
break
def isValid(someletter):
try:
someletter.decode('utf-8', 'strict')
return True
except UnicodeDecodeError:
return False
def main(src):
aLetter = bytearray()
with open(src, 'rb') as f:
for aBlock in getBlocks(f, 1):
aLetter.extend(aBlock)
if isValid(aLetter):
# print("char is now a valid one") # just for acknowledgement
# do more
else:
aLetter.extend( getBlocks(f, 1) )
Questions:
Am I doomed if I try fileHandle.seek(-ve_value_here, 1)
Python must be having something in-built to deal with this, what is it?
how can I really test if the program meets its purpose of ensuring complete characters are read (right now I have only simple english files)
how can I determine best chunk_size to make program faster. I mean reading 1024 bytes where first 1023 bytes were 1-byte-representable-char & last was a 6-byter leaves me with the only option of reading 1 byte each time
Note: I can't prefer buffered reading as I do not know range of input file sizes in advance
The answer to #2 will solve most of your issues. Use an IncrementalDecoder via codecs.getincrementaldecoder. The decoder maintains state and only outputs fully decoded sequences:
#!python3
import codecs
import sys
byte_string = '\u5000\u5001\u5002'.encode('utf8')
# Get the UTF-8 incremental decoder.
decoder_factory = codecs.getincrementaldecoder('utf8')
decoder_instance = decoder_factory()
# Simple example, read two bytes at a time from the byte string.
result = ''
for i in range(0,len(byte_string),2):
chunk = byte_string[i:i+2]
result += decoder_instance.decode(chunk)
print('chunk={} state={} result={}'.format(chunk,decoder_instance.getstate(),ascii(result)))
result += decoder_instance.decode(b'',final=True)
print(ascii(result))
Output:
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result=''
chunk=b'\x80\xe5' state=(b'\xe5', 0) result='\u5000'
chunk=b'\x80\x81' state=(b'', 0) result='\u5000\u5001'
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result='\u5000\u5001'
chunk=b'\x82' state=(b'', 0) result='\u5000\u5001\u5002'
'\u5000\u5001\u5002'
Note after the first two bytes are processed the internal decoder state just buffers them and appends no characters to the result. The next two complete a character and leave one in the internal state. The last call with no additional data and final=True just flushes the buffer. It will raise an exception if there is an incomplete character pending.
Now you can read your file in whatever chunk size you want, pass them all through the decoder and be sure that you only have complete code points.
Note that with Python 3, you can just open the file and declare the encoding. The chunk you read will actually be processed Unicode code points using an IncrementalDecoder internally:
input.csv (saved in UTF-8 without BOM)
我是美国人。
Normal text.
code
with open('input.txt',encoding='utf8') as f:
while True:
data = f.read(2) # reads 2 Unicode codepoints, not bytes.
if not data: break
print(ascii(data))
Result:
'\u6211\u662f'
'\u7f8e\u56fd'
'\u4eba\u3002'
'\nN'
'or'
'ma'
'l '
'te'
'xt'
'.'

Resources