python: How to read a file and store each line using map function? - python-3.x

I'm trying to reconvert a program that I wrote but getting rid of all for loops.
The original code reads a file with thousands of lines that are structured like:
Ex. 2 lines of a file:
As you can see, the first line starts with LPPD;LEMD and the second line starts with DAAE;LFML. I'm only interested in the very first and second element of each line.
The original code I wrote is:
# Libraries
import sys
from collections import Counter
import collections
from itertools import chain
from collections import defaultdict
import time
# START
# #time=0
start = time.time()
# Defining default program argument
if len(sys.argv)==1:
fileName = "file.txt"
else:
fileName = sys.argv[1]
takeOffAirport = []
landingAirport = []
# Reading file
lines = 0 # Counter for file lines
try:
with open(fileName) as file:
for line in file:
words = line.split(';')
# Relevant data, item1 and item2 from each file line
origin = words[0]
destination = words[1]
# Populating lists
landingAirport.append(destination)
takeOffAirport.append(origin)
lines += 1
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
airports_dict = defaultdict(list)
# Merge lists into a dictionary key:value
for key, value in chain(Counter(takeOffAirport).items(),
Counter(landingAirport).items()):
# 'AIRPOT_NAME':[num_takeOffs, num_landings]
airports_dict[key].append(value)
# Sum key values and add it as another value
for key, value in airports_dict.items():
#'AIRPOT_NAME':[num_totalMovements, num_takeOffs, num_landings]
airports_dict[key] = [sum(value),value]
# Sort dictionary by the top 10 total movements
airports_dict = sorted(airports_dict.items(),
key=lambda kv:kv[1], reverse=True)[:10]
airports_dict = collections.OrderedDict(airports_dict)
# Print results
print("\nAIRPORT"+ "\t\t#TOTAL_MOVEMENTS"+ "\t#TAKEOFFS"+ "\t#LANDINGS")
for k in airports_dict:
print(k,"\t\t", airports_dict[k][0],
"\t\t\t", airports_dict[k][1][1],
"\t\t", airports_dict[k][1][0])
# #time=1
end = time.time()- start
print("\nAlgorithm execution time: %0.5f" % end)
print("Total number of lines read in the file: %u\n" % lines)
airports_dict.clear
takeOffAirport.clear
landingAirport.clear
My goal is to simplify the program using map, reduce and filter. So far I have sorted teh creation of the two independent lists, one for each first element of each file line and another list with the second element of each file line by using:
# Creates two independent lists with the first and second element from each line
takeOff_Airport = list(map(lambda sub: (sub[0].split(';')[0]), lines))
landing_Airport = list(map(lambda sub: (sub[0].split(';')[1]), lines))
I was hoping to find the way to open the file and achieve the exact same result as the original code by been able to opemn the file thru a map() function, so I could pass each list to the above defined maps; takeOff_Airport and landing_Airport.

So if we have a file as such
line 1
line 2
line 3
line 4
and we do like this
open(file_name).read().split('\n')
we get this
['line 1', 'line 2', 'line 3', 'line 4', '']
Is this what you wanted?
Edit 1
I feel this is somewhat reduntant but since map applies a function to each element of an iterator we will have to have our file name in a list, and we ofcourse define our function
def open_read(file_name):
return open(file_name).read().split('\n')
print(list(map(open_read, ['test.txt'])))
This gets us
>>> [['line 1', 'line 2', 'line 3', 'line 4', '']]

So first off, calling split('\n') on each line is silly; the line is guaranteed to have at most one newline, at the end, and nothing after it, so you'd end up with a bunch of ['all of line', ''] lists. To avoid the empty string, just strip the newline. This won't leave each line wrapped in a list, but frankly, I can't imagine why you'd want a list of one-element lists containing a single string each.
So I'm just going to demonstrate using map+strip to get rid of the newlines, using operator.methodcaller to perform the strip on each line:
from operator import methodcaller
def readFile(fileName):
try:
with open(fileName) as file:
return list(map(methodcaller('strip', '\n'), file))
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
Sadly, since your file is context managed (a good thing, just inconvenient here), you do have to listify the result; map is lazy, and if you didn't listify before the return, the with statement would close the file, and pulling data from the map object would die with an exception.
To get around that, you can implement it as a trivial generator function, so the generator context keeps the file open until the generator is exhausted (or explicitly closed, or garbage collected):
def readFile(fileName):
try:
with open(fileName) as file:
yield from map(methodcaller('strip', '\n'), file)
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
yield from will introduce a tiny amount of overhead over directly iterating the map, but not much, and now you don't have to slurp the whole file if you don't want to; the caller can just iterate the result and get a split line on each iteration without pulling the whole file into memory. It does have the slight weakness that opening the file will be done lazily, so you won't see the exception (if there is any) until you begin iterating. This can be worked around, but it's not worth the trouble if you don't really need it.
I'd generally recommend the latter implementation as it gives the caller flexibility. If they want a list anyway, they just wrap the call in list and get the list result (with a tiny amount of overhead). If they don't, they can begin processing faster, and have much lower memory demands.
Mind you, this whole function is fairly odd; replacing IOErrors with prints and (implicitly) returning None is hostile to API consumers (they now have to check return values, and can't actually tell what went wrong). In real code, I'd probably just skip the function and insert:
with open(fileName) as file:
for line in map(methodcaller('strip', '\n'), file)):
# do stuff with line (with newline pre-stripped)
inline in the caller; maybe define split_by_newline = methodcaller('split', '\n') globally to use a friendlier name. It's not that much code, and I can't imagine that this specific behavior is needed in that many independent parts of your file, and inlining it removes the concerns about when the file is opened and closed.

Related

Problem with reading text then put the text to the list and sort them in the proper way

Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
This is the question my problem is I cannot write a proper code and gathering true data, always my code gives me 4 different lists for each raw!
** This is my code**
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
line =line.split()
if line in last:
print(true)
else:
lst.append(line)
print(lst)
*** the text is here, please copy and paste in text editor***
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
You are not checking the presence of individual words in the list, but rather the presence of the entire list of words in that line.
With some modifications, you can achieve what you are trying to do this way:
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
print(lst)
However, a few things I would like to point out looking at your code:
Why are you using rstrip() instead of strip()?
It is better to use list = [] as opposed to your lst = list(). It is shorter, faster, more Pythonic and avoids the use of this confusing lst variable.
You should want to remove punctuation marks attached to words, eg: ,.: which do not get removed by split()
If you want a loop body to not do anything, use pass. Why are you printing true? Also, in Python, it's True and not true.

Nested For loop over csv files

I have 2 .csv datasets from the same source. I was attempting to check if any of the items from the first dataset are still present in the second.
#!/usr/bin/python
import csv
import json
import click
#click.group()
def cli(*args, **kwargs):
"""Command line tool to compare and generate a report of item that still persists from one report to the next."""
pass
#click.command(help='Compare the keysets and return a list of keys old keys still active in new keyset.')
#click.option('--inone', '-i', default='keys.csv', help='specify the file of the old keyset')
#click.option('--intwo', '-i2', default='keys2.csv', help='Specify the file of the new keyset')
#click.option('--output', '-o', default='results.json', help='--output, -o, Sets the name of the output.')
def compare(inone, intwo, output):
csvfile = open(inone, 'r')
csvfile2 = open(intwo, 'r')
jsonfile = open(output, 'w')
reader = csv.DictReader(csvfile)
comparator = csv.DictReader(csvfile2)
for line in comparator:
for row in reader:
if row == line:
print('#', end='')
json.dump(row, jsonfile)
jsonfile.write('\n')
print('|', end='')
print('-', end='')
cli.add_command(compare)
if __name__ == '__main__':
cli()
say each csv files has 20 items in it. it will currently iterate 40 times and end when I was expecting it to iterate 400 times and create a report of items remaining.
Everything but the iteration seems to be working. anyone have thoughts on a better approach?
Iterating 40 times sounds just about right - when you iterate through your DictReader, you're essentially iterating through the wrapped file lines, and once you're done iterating it doesn't magically reset to the beginning - the iterator is done.
That means that your code will start iterating over the first item in the comparator (1), then iterate over all items in the reader (20), then get the next line from the comparator(1), then it won't have anything left to iterate over in the reader so it will go to the next comparator line and so on until it loops over the remaining comparator lines (18) - resulting in total of 40 loops.
If you really want to iterate over all of the lines (and memory is not an issue), you can store them as lists and then you get a new iterator whenever you start a for..in loop, so:
reader = list(csv.DictReader(csvfile))
comparator = list(csv.DictReader(csvfile2))
Should give you an instant fix. Alternatively, you can reset your reader 'steam' after the loop with csvfile.seek(0).
That being said, if you're going to compare lines only, and you expect that not many lines will differ, you can load the first line in csv.reader() to get the 'header' and then forgo the csv.DictReader altogether by comparing the lines directly. Then when there is a change you can pop in the line into the csv.reader() to get it properly parsed and then just map it to the headers table to get the var names.
That should be significantly faster on large data sets, plus seeking through the file can give you the benefit of never having the need to store in memory more data than the current I/O buffer.

IndexError: list index out of range, but list length OK

New to programming, looking for a deeper understanding on whats happening.
Goal: open a file and print the first 10 lines. (similar to head command)
Code:
with open('file') as f:
for i in range(0,10):
print([line.strip('\n') for line in f][i])
Result: prints first line fine, then returns the out of range error
File: Is a simple text file with 20 lines, no more than 50 chars per line
FYI - Removed range line and printed both type(list) and length(20). Printed specific indexes without issue (unless >1 in a row)
Able to get the desired result with different code, but trying to improve using with/as
You can actually iterate over a file. Which is what you should be doing here.
with open('file') as f:
for i, line in enumerate(file, start=1):
# Get out of the loop if we hit 10 lines
if i >= 10:
break
# Line already has a '\n' at the end
print(line, end='')
The reason that your code is failing is because of your list comprehension:
[line.strip('\n') for line in f]
The first time through your loop that consumes all of the lines in your file. Now your file has no more lines, so the next time through it creates a list of all the lines in your file and tries to get the [1]st element. But that doesn't exist because there are no lines at the end of your file.
If you wanted to keep your code mostly as-is you could do
lines = [line.rstrip('\n') for line in f]
for i in range(10):
print(lines[i])
But that's also silly, because you could just do
lines = f.readlines()
But that's also silly if you just want up to the 10th line, because you could do this:
with open('file') as f:
print('\n'.join(f.readlines()[:10]))
Some further explanation:
The shortest and worst way you could fix your code is by adding one line of code:
with open('file') as f:
for i in range(0,10):
f.seek(0) # Add this line
print([line.strip('\n') for line in f][i])
Now your code will work - but this is a horrible way to get your code to work. The reason that your code isn't working the way you expect in the first place is that files are consumable iterators. That means that when you read from them eventually you run out of things to read. Here's a simple example:
import io
file = io.StringIO('''
This is is a file
It has some lines
okay, only three.
'''.strip())
for line in file:
print(file.tell(), repr(line))
This outputs
18 'This is is a file\n'
36 'It has some lines\n'
53 'okay, only three.'
Now if you try to read from the file:
print(file.read())
You'll see that it doesn't output anything. That's because you've "consumed" the file. I mean obviously it's still on disk, but the iterator has reached the end of the file. But as shown, you can seek in the file.
print(file.tell())
file.seek(0)
print(file.tell())
print(file.read())
And you'll see your entire file printed. But what about those other positions?
file.seek(36)
print(file.read()) # => okay, only three.
As a side note, you can also specify how much to read:
file.seek(36)
print(file.read(4)) # => okay
print(file.tell()) # => 40
So when we read from a file or iterate over it we consume the iterator and get to the end of the file. Let's put your new tools to work and go back to your original code and explore what's happening.
with open('file') as f:
print(f.tell())
lines = [line.rstrip('\n') for line in f]
print(f.tell())
print(len([line for line in f]))
print(lines)
You'll see that you're at a different location in the file. And the second list comprehension produces an empty list. That's because when a list comprehension is evaluated it executes immediately. So when you do this:
for i in range(10):
print([line.strip('\n') for line in f][i])
What you're doing the first time, i = 0 and then the list comprehension reads to the end of the file. Now it takes the [0]th element of the list, or the first line in the file. But your file iterator is at the end of the file.
So now we get back to the beginning of the list and i = 1. Now we iterate to the end of the file, but we're already at the end so there are no lines to read, and we've got an empty list [] that we try to get the [0]th element of. But there's nothing there. So we get an IndexError.
List comprehensions can be useful, but when you're beginning it's usually much easier to write a for loop and then turn it into a list comprehension. So you might write something like this:
with open('file') as f:
for i, line in enumerate(file, start=10):
if i < 10:
print(line.rstrip())
Now, we shouldn't print inside a list comprehension, so instead we'll collect everything. We start out by putting what we want:
[line.rstrip()
Now add the for bit:
[line.rstrip() for i, line in enumerate(f)
And finally add the filter and our closing brace:
[line.rstrip() for i, line in enumerate(f) if i < 10]
For more on list comprehensions, this is a fantastic resource: http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/

How to print multiple lines from a file python

I'm trying to print several lines from a text file onto python, where it is outputted. My current code is:
f = open("sample.txt", "r").readlines()[2 ,3]
print(f)
However i'm getting the error message of:
TypeError: list indices must be integers, not tuple
Is there anyway of fixing this or printing multiple lines from a file without printing them out individually?
You are trying to pass a tuple to the [...] subscription operation; 2 ,3 is a tuple of two elements:
>>> 2 ,3
(2, 3)
You have a few options here:
Use slicing to take a sublist from all the lines. [2:4] slices from the 3rd line and includes the 4th line:
f = open("sample.txt", "r").readlines()[2:4]
Store the lines and print specific indices, one by one:
f = open("sample.txt", "r").readlines()
print f[2].rstrip()
print f[3].rstrip()
I used str.rstrip() to remove the newline that's still part of the line before printing.
Use itertools.islice() and use the file object as an iterable; this is the most efficient method as no lines need to be stored in memory for more than just the printing work:
from itertools import islice
with open("sample.txt", "r") as f:
for line in islice(f, 2, 4):
print line.rstrip()
I also used the file object as a context manager to ensure it is closed again properly once the with block is done.
Assign the whole list of lines to a variable, and then print lines 2 and 3 separately.
with open("sample.txt", "r") as fin:
lines = fin.readlines()
print(lines[2])
print(lines[3])

Python IndexError: list index out of range large file

I have a very large file ~40GB and 674,877,098 lines I want to read and extract specific columns from. I can get about 3GB of data transferred then I get the following error.
Traceback (most recent call last):
File "C:\Users\Codes\Read_cat_write.py", line 44, in <module>
tid = int(columns[2])
IndexError: list index out of range
Sample of data that is being read in.
1,100000000,100000000,39,2.704006988169216e15,310057,0
2,100000001,100000000,38,2.650346740514816e15,303904,0.01
3,100000002,100000000,37,2.136985003098112e15,245039,0.03
4,100000003,100000000,36,2.29479163101184e15,263134,0.05
5,100000004,100000000,35,1.834645477916672e15,210371,0.06
6,100000005,100000000,34,1.814063860416512e15,208011,0.08
7,100000006,100000000,33,1.808883592986624e15,207417,0.1
8,100000007,100000000,32,1.806241248575488e15,207114,0.12
9,100000008,100000000,31,1.651783621410816e15,189403,0.14
10,100000009,100000000,30,1.634821184946176e15,187458,0.16
Code
from itertools import islice
F = r'C:\Users\Outfiles\comp_cat_raw.txt'
w = open(r'C:\Users\Outfiles\comp_cat_3col.txt','a')
def filesave(TID,M,R):
X = str(TID)
Y = str(M)
Z = str(R)
w.write(X)
w.write('\t')
w.write(Y)
w.write('\t')
w.write(Z)
w.write('\n')
N = 680000000
f = open(F) #Opens file
f.readline() # Strips Header
nlines = islice(f, N) #slices file to only read N lines
for line in nlines:
if line !='':
line = line.strip()
line = line.replace(',',' ') # Replace comma with space
columns = line.split() # Splits into column
tid = int(columns[2])
m = float(columns[4])
r = float(columns[6])
filesave(tid,m,r)
w.close()
I have looked at the file being read in at the point where the error occurs, but I don't see anything wrong with the file so I am at a loss as to the cause of this error.
Chances are, there is some line with maybe one single comma in there, or none, or an empty line, whatever. Probably just put a try-except statement around the statement and catch the index error, probably printing out the line in question, and you should be done. Besides that, there are some things in your code, that might be worth to improve.
Have a look at the csv module especially. It has some optimized C-code exactly for what you want to do, so it should be much faster. This answer shows mainly how to write the iteration with csv.
This whole slice construction seems to be superfluous. A simple for line in f: will do and is the most efficient way to handle this iteration.
Use line.split(',') directly, instead of replacing them first with spaces.
Use with open(F) as f: instead of calling close yourself. For this script it might make no difference, but this way you make sure, that you e.g. don't create open file handles in case of errors.

Resources