How can I expand List capacity in Python? - python-3.x

read = open('700kLine.txt')
# use readline() to read the first line
line = read.readline()
aList = []
for line in read:
try:
num = int(line.strip())
aList.append(num)
except:
print ("Not a number in line " + line)
read.close()
print(aList)
There is 700k Line in that file (every single line has max 2 digits number)
I can only get ~280k Line in that file to in my aList.
So, How can I expand aList capacity 280k to 700k or more? (Is there a different solution for this case?)
Hello, I just solved that problem. Thanks for all your helps. That was an obvious buffer problem.
Solution is just increasing the size of buffer.
link is here
Increase output buffer when running or debugging in PyCharm

Please try this.
filename = '700kLine.txt'
with open(filename) as f:
data = f.readlines()
print(data)
print(type(data)) #stores the data in a list

Yes, you can.
Once a list is defined, you can add, edit or delete its elements. To add more elements at the end, use the append function:
MyList.append(data)
Where MyList is the name of the list and data is the element you want to add.

I tried to re-create your problem:
# creating 700kLine file
with open('700kLine.txt', 'w') as f:
for i in range(700000):
f.write(str(i+1) + '\n')
# creating list from file entries
aList = []
with open('700kLine.txt', 'r') as f:
for line in f:
num = int(line.strip())
aList.append(num)
# print(aList)
print(aList[:30])
Jupyter notebook throws an error while printing all 700K lines due to too much memory used. If you really want to print all 700k values, run the python script from terminal.

It could be that your computer ran out of memory processing the file? I have tried generating an infinite loop appending a single digit to the list and I ended up with 47 million-ish len(list) >> 47119572, the code I use to test as below.
I tried this code on an online REPL and it came to a significantly lower 'len(list)`.
list = []
while True:
try:
if len(list) > 0:
list.append(list[-1] + 1)
else:
list.append(1)
except MemoryError:
print("memory error, last count is: ", list[-1])
raise MemoryError
Maybe try saving bits of data read instead of reading the whole file at once?
Just my assumption.

Related

Check for non-floats in a csv file python3

I'm trying to read a csv file, and create a 2 dimensional list from the values stored inside.
However I'm running into trouble when I try to check whether or not the values stored can be converted into floats.
Here is the function I have written, which reads the file and creates a list.
def readfile(amount, name):
tempfile = open(name).readlines()[1:] #First value in line is never a float, hence the [1:]
rain_list = []
count = 0.0
for line in tempfile:
line = line.rstrip()
part = line.split(",")
try:
part = float(part)
except ValueError:
print("ERROR: invalid float in line: {}".format(line))
rain_list.append(part[amount])
count += 1
if count == 0:
print("ERROR in reading the file.")
tempfile.close()
return rain_list
It might be a little messy, since it's essentially a patchwork of different possible solutions I have tried.
The values it gets are the name of the file (name) and the amount of values it reads from the file (amount).
Has anyone got an idea why this does not work as I expect it to work?
part is a list of strings. To check & convert for all floats, you'd have to do:
part = [float(x) for x in part]
(wrapped in your exception block)
BTW you should use the csv module to read comma-separated files. It's built-in. Also using enumerate would allow to be able to print the line where the error occurs, not only the data:
reader = csv.reader(tempfile) # better: pass directly the file handle
# and use next(reader) to discard the title line
for lineno,line in enumerate(reader,2): # lineno starts at 2 because of title line
try:
line = [float(x) for x in line]
except ValueError:
print("ERROR: invalid float in line {}: {}".format(lineno,line))

Python 3.6.1: Code does not execute after a for loop

I've been learning Python and I wanted to write a script to count the number of characters in a text and calculate their relative frequencies. But first, I wanted to know the length of the file. My intention is that, while the script goes from line to line counting all the characters, it would print the current line and the total number of lines, so I could know how much it is going to take.
I executed a simple for loop to count the number of lines, and then another for loop to count the characters and put them in a dictionary. However, when I run the script with the first for loop, it stops early. It doesn't even go into the second for loop as far as I know. If I remove this loop, the rest of the code goes on fine. What is causing this?
Excuse my code. It's rudimentary, but I'm proud of it.
My code:
import string
fname = input ('Enter a file name: ')
try:
fhand = open(fname)
except:
print ('Cannot open file.')
quit()
#Problematic bit. If this part is present, the script ends abruptly.
#filelength = 0
#for lines in fhand:
# filelength = filelength + 1
counts = dict()
currentline = 1
for line in fhand:
if len(line) == 0: continue
line = line.translate(str.maketrans('','',string.punctuation))
line = line.translate(str.maketrans('','',string.digits))
line = line.translate(str.maketrans('','',string.whitespace))
line = line.translate(str.maketrans('','',""" '"’‘“” """))
line = line.lower()
index = 0
while index < len(line):
if line[index] not in counts:
counts[line[index]] = 1
else:
counts[line[index]] += 1
index += 1
print('Currently at line: ', currentline, 'of', filelength)
currentline += 1
listtosort = list()
totalcount = 0
for (char, number) in list(counts.items()):
listtosort.append((number,char))
totalcount = totalcount + number
listtosort.sort(reverse=True)
for (number, char) in listtosort:
frequency = number/totalcount*100
print ('Character: %s, count: %d, Frequency: %g' % (char, number, frequency))
It looks fine the way you are doing it, however to simulate your problem, I downloaded and saved a Guttenberg text book. It's a unicode issue. Two ways to resolve it. Open it as a binary file or add the encoding. As it's text, I'd go the utf-8 option.
I'd also suggest you code it differently, below is the basic structure that closes the file after opening it.
filename = "GutenbergBook.txt"
try:
#fhand = open(filename, 'rb')
#open read only and utf-8 encoding
fhand = open(filename, 'r', encoding = 'utf-8')
except IOError:
print("couldn't find the file")
else:
try:
for line in fhand:
#put your code here
print(line)
except:
print("Error reading the file")
finally:
fhand.close()
For the op, this is a specific occasion. However, for visitors, if your code below the for state does not execute, it is not a python built-in issue, most likely to be: an exception error handling in parent caller.
Your iteration is inside a function, which is called inside a try except block of caller, then if any error occur during the loop, it will get escaped.
This issue can be hard to find, especially when you dealing with intricate architecture.

IndexError: list index out of range, but list length OK

New to programming, looking for a deeper understanding on whats happening.
Goal: open a file and print the first 10 lines. (similar to head command)
Code:
with open('file') as f:
for i in range(0,10):
print([line.strip('\n') for line in f][i])
Result: prints first line fine, then returns the out of range error
File: Is a simple text file with 20 lines, no more than 50 chars per line
FYI - Removed range line and printed both type(list) and length(20). Printed specific indexes without issue (unless >1 in a row)
Able to get the desired result with different code, but trying to improve using with/as
You can actually iterate over a file. Which is what you should be doing here.
with open('file') as f:
for i, line in enumerate(file, start=1):
# Get out of the loop if we hit 10 lines
if i >= 10:
break
# Line already has a '\n' at the end
print(line, end='')
The reason that your code is failing is because of your list comprehension:
[line.strip('\n') for line in f]
The first time through your loop that consumes all of the lines in your file. Now your file has no more lines, so the next time through it creates a list of all the lines in your file and tries to get the [1]st element. But that doesn't exist because there are no lines at the end of your file.
If you wanted to keep your code mostly as-is you could do
lines = [line.rstrip('\n') for line in f]
for i in range(10):
print(lines[i])
But that's also silly, because you could just do
lines = f.readlines()
But that's also silly if you just want up to the 10th line, because you could do this:
with open('file') as f:
print('\n'.join(f.readlines()[:10]))
Some further explanation:
The shortest and worst way you could fix your code is by adding one line of code:
with open('file') as f:
for i in range(0,10):
f.seek(0) # Add this line
print([line.strip('\n') for line in f][i])
Now your code will work - but this is a horrible way to get your code to work. The reason that your code isn't working the way you expect in the first place is that files are consumable iterators. That means that when you read from them eventually you run out of things to read. Here's a simple example:
import io
file = io.StringIO('''
This is is a file
It has some lines
okay, only three.
'''.strip())
for line in file:
print(file.tell(), repr(line))
This outputs
18 'This is is a file\n'
36 'It has some lines\n'
53 'okay, only three.'
Now if you try to read from the file:
print(file.read())
You'll see that it doesn't output anything. That's because you've "consumed" the file. I mean obviously it's still on disk, but the iterator has reached the end of the file. But as shown, you can seek in the file.
print(file.tell())
file.seek(0)
print(file.tell())
print(file.read())
And you'll see your entire file printed. But what about those other positions?
file.seek(36)
print(file.read()) # => okay, only three.
As a side note, you can also specify how much to read:
file.seek(36)
print(file.read(4)) # => okay
print(file.tell()) # => 40
So when we read from a file or iterate over it we consume the iterator and get to the end of the file. Let's put your new tools to work and go back to your original code and explore what's happening.
with open('file') as f:
print(f.tell())
lines = [line.rstrip('\n') for line in f]
print(f.tell())
print(len([line for line in f]))
print(lines)
You'll see that you're at a different location in the file. And the second list comprehension produces an empty list. That's because when a list comprehension is evaluated it executes immediately. So when you do this:
for i in range(10):
print([line.strip('\n') for line in f][i])
What you're doing the first time, i = 0 and then the list comprehension reads to the end of the file. Now it takes the [0]th element of the list, or the first line in the file. But your file iterator is at the end of the file.
So now we get back to the beginning of the list and i = 1. Now we iterate to the end of the file, but we're already at the end so there are no lines to read, and we've got an empty list [] that we try to get the [0]th element of. But there's nothing there. So we get an IndexError.
List comprehensions can be useful, but when you're beginning it's usually much easier to write a for loop and then turn it into a list comprehension. So you might write something like this:
with open('file') as f:
for i, line in enumerate(file, start=10):
if i < 10:
print(line.rstrip())
Now, we shouldn't print inside a list comprehension, so instead we'll collect everything. We start out by putting what we want:
[line.rstrip()
Now add the for bit:
[line.rstrip() for i, line in enumerate(f)
And finally add the filter and our closing brace:
[line.rstrip() for i, line in enumerate(f) if i < 10]
For more on list comprehensions, this is a fantastic resource: http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/

Python IndexError: list index out of range large file

I have a very large file ~40GB and 674,877,098 lines I want to read and extract specific columns from. I can get about 3GB of data transferred then I get the following error.
Traceback (most recent call last):
File "C:\Users\Codes\Read_cat_write.py", line 44, in <module>
tid = int(columns[2])
IndexError: list index out of range
Sample of data that is being read in.
1,100000000,100000000,39,2.704006988169216e15,310057,0
2,100000001,100000000,38,2.650346740514816e15,303904,0.01
3,100000002,100000000,37,2.136985003098112e15,245039,0.03
4,100000003,100000000,36,2.29479163101184e15,263134,0.05
5,100000004,100000000,35,1.834645477916672e15,210371,0.06
6,100000005,100000000,34,1.814063860416512e15,208011,0.08
7,100000006,100000000,33,1.808883592986624e15,207417,0.1
8,100000007,100000000,32,1.806241248575488e15,207114,0.12
9,100000008,100000000,31,1.651783621410816e15,189403,0.14
10,100000009,100000000,30,1.634821184946176e15,187458,0.16
Code
from itertools import islice
F = r'C:\Users\Outfiles\comp_cat_raw.txt'
w = open(r'C:\Users\Outfiles\comp_cat_3col.txt','a')
def filesave(TID,M,R):
X = str(TID)
Y = str(M)
Z = str(R)
w.write(X)
w.write('\t')
w.write(Y)
w.write('\t')
w.write(Z)
w.write('\n')
N = 680000000
f = open(F) #Opens file
f.readline() # Strips Header
nlines = islice(f, N) #slices file to only read N lines
for line in nlines:
if line !='':
line = line.strip()
line = line.replace(',',' ') # Replace comma with space
columns = line.split() # Splits into column
tid = int(columns[2])
m = float(columns[4])
r = float(columns[6])
filesave(tid,m,r)
w.close()
I have looked at the file being read in at the point where the error occurs, but I don't see anything wrong with the file so I am at a loss as to the cause of this error.
Chances are, there is some line with maybe one single comma in there, or none, or an empty line, whatever. Probably just put a try-except statement around the statement and catch the index error, probably printing out the line in question, and you should be done. Besides that, there are some things in your code, that might be worth to improve.
Have a look at the csv module especially. It has some optimized C-code exactly for what you want to do, so it should be much faster. This answer shows mainly how to write the iteration with csv.
This whole slice construction seems to be superfluous. A simple for line in f: will do and is the most efficient way to handle this iteration.
Use line.split(',') directly, instead of replacing them first with spaces.
Use with open(F) as f: instead of calling close yourself. For this script it might make no difference, but this way you make sure, that you e.g. don't create open file handles in case of errors.

How do I write a Python program that computes the average from a .dat file?

I have this so far but I don't know how to write over the .dat file:
def main():
fname = input("Enter filename:")
infile = open(fname, "r")
data = infile.read()
print(data)
for line in infile.readlines():
score = int(line)
counts[score] = counts[score]+1
infile.close()
total=0
for c in enumerate(counts):
total = total + i*c
average = float(total)/float(sum(counts))
print(average)
main()
Here is my .dat file:
4
3
5
6
7
My statistics professor expects us to learn Python to compute the mean and standard deviation. All I need to know is how to do the mean and then I've got the rest figured out. I want to know how does Python write over each line in a .dat file. Could someone tell me how to fix this code? I've never done programming before.
To answer your question, as I understand it, in three parts:
How to read the file in
in your example you use
infile.read()
which reads the entire contents of the file into a string and takes you to the end of file. Therefore the following
infile.readlines()
will read nothing more. You should omit the first read().
How to compute the mean
There are many ways to do this in python - more or less elegant - and also I guess it depends on exactly what the problem is. But in the simplest case you can just sum and count the values as you go , then divide sum by count at the end to get the result:
infile = open("d.dat", "r")
total = 0.0
count = 0
for line in infile.readlines():
print ("reading in line: ",line)
try:
line_value = float(line)
total += line_value
count += 1
print ("value = ",line_value, "running total =",total, "valid lines read = ",count)
except:
pass #skipping non-numeric lines or characters
infile.close()
The try/except part is just in case you have lines or characters in the file that can't be turned into floats, these will be skipped.
How to write to the .dat file
Finally you seem to be asking how to write the result back out to the d.dat file. Not sure whether you really need to do this, it should be acceptable to just display the result as in the above code. However if you do need to write it back to the same file, just close it after reading from it, reopen it for writing (in 'append' mode so output goes to the end of the file), and output the result using write().
outfile = open("d.dat","a")
outfile.write("\naverage = final total / number of data points = " + str(total/count)+"\n")
outfile.close()
fname = input("Enter filename:")
infile = open(fname, "r")
data = infile.readline() #Reads first line
print(data)
data = infile.readline() #Reads second line
print(data)
You can put this in a loop.
Also, these values will come in as Strings convert them to floats using float(data) each time.
Also, the guys over at StackOverflow are not as bad at math as you think. This could have easily been answered there. (And maybe in a better fashion)

Resources