Looping over large CSV file

Looping over large CSV file - python-3.x

I am trying to loop over a large CSV file, writing all the lines but the variable names to a new file, while playing around with efficient ways of doing so. I'm using islice from itertools. Does anyone have any tips for a more efficient way than my code below?
from itertools import islice
var = len(csv)
with open("csv_file1.csv") as file1, open("trial1.csv", 'w') as file2:
head1 = list(islice(file1, var))[0].split(",")
while (var > 1):
for line in head1:
file2.write(str(head1))
file2.write("\n")
var = var - 1
print(var)
file2.close()

Use csv module, as suggested in comments
Wrap your incoming file into a generator, which is a good practice for dealing with any stream, including a csv file
def read_csv(filename):
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
yield row
After that read_csv("csv_file1.csv") gives you a generator that you can use either in for-loop or apply map/filter functions to it, depending on the logic of row transformation.

Related

Stuck in a forever loop

I am trying to make the below code loop through all the pdfs in my directory, extract the text from these pdfs and print them in once block using the code below.
I am currently getting stuck in a forever while loop. Additionally how can my code be modified to perform the same function using the for loop?
import glob
import PyPDF2
pdfs=glob.glob("/private/babik/*.pdf")
file_name = "Announcement"
index = 0
while index<=len(pdfs):
pdfFileObj = open(str(pdfs[index]), 'rb')
# creating a pdf reader objecct
pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()
index+=1

In Python spaces at the beginning of line impact the way the code will be executed.
You have to reformat the spacing of your code to get out of the forever loop indenting all lines after while index<=len(pdfs): by four spaces (four spaces is the Python standard indentation).
You need indentation of lines after the : of for, while, if, ... to indicate which lines are part of the for, while, if, ... block.
And if you don't need the indices to index some another list as these one you loop over use always a for loop instead of a while one as suggested in the answer by tdelaney.

Your while loop contains a single line pdfFileObj = open(str(pdfs[index]), 'rb') which does not increment index. Since index never changes, the while never terminates.
Python's for loop is a better way to process the items of a list. You could rewrite your code to get rid of index completely.
import glob
import PyPDF2
pdfs=glob.glob("/private/babik/*.pdf")
for pdf in pdfs:
with open(pdf, 'rb') as pdfFileObj:
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())

you are not increasing the "index" in the while loop,
you should write
index = 0
while index<=len(pdfs):
pdfFileObj = open(str(pdfs[index]), 'rb')
index=index+1
or alternatively you can use the for loop in this way, iterating directly on the pdfs list
for pdf in pdfs:
pdfFileObj = open(str(pdf), 'rb')

Write into CSV only if row does not exist

I am saving an array into a csv file, using this code:
def save_annotations(**kwargs):
ann = request.get_json()
print(ann)
filename = ann[3].split('.')[0]
run_id = ann[4]
run_number = ann[4].split('/')[0]
exp_id = ann[4].split('/')[1]
ann_type = ann[2]
if ann_type == 'wrongDetection':
with open(f"/code/data/mlruns/{run_number}/{exp_id}/wrong_annotations_{filename}_{run_id.replace('/', '_')}.csv",'a') as w_ann:
writer = csv.writer(w_ann, delimiter=',')
writer.writerow(ann[0:2])
w_ann.close()
else:
with open(f"/code/data/mlruns/{run_number}/{exp_id}/new_detections_{filename}_{run_id.replace('/', '_')}.csv",'a') as w_ann:
writer = csv.writer(w_ann, delimiter=',')
writer.writerow(ann[0:2])
w_ann.close()
However, I don't want repeated rows in my csv file. I only want to write to csv if ann[0] and ann[1] are not in the csv already.
What would be the best approach to do this?
kind regards

One way to do this would be to collect the already existing values in a set, and check new values to see if they are in the set before processing. You would need a set for each csv file.
For example:
def build_set(filename):
with open(filename, 'r') as f:
reader = csv.reader(f)
# Skip header row, if necessary
next(reader)
return {tuple(row[0:2]) for row in reader}
Then in your function you could do:
if tuple(ann[0:2]) in set_for_this_file:
continue
set_for_this_file.add(tuple(ann[0:2]))
# write data to file
Building the sets would require reading through all the csv files each time the program is executed, which might be inefficient if the files were large and/or numerous.
A more efficient approach might be to store the data in a database table, with columns for ann[0], ann[1], anntype, exp_id, run_mumber and run_id. Add a unique constraint over these columns and you would have the same functionality.

Merge line in csv file python

I have this in csv file:
Titre,a,b,c,d,e
01,jean,paul,,
01,,,jack,
02,jeanne,jack,,
02,,,jean
and i want :
Titre,a,b,c,d,e
01,jean,paul,jack,
02,jeanne,jack,,jean
can you help me ?

In general, a good approach is to read the csv file and iterate through the rows using Python's CSV module.
CSV will create an iterator that will let you loop through your file like this:
import csv
with open('your filename.csv', 'r') as infile:
reader = csv.reader(infile)
for line in reader:
for value in line:
# Do your thing
You're going to need to construct a new data set that has different properties. The requirements you described:
Ignore any empty cells
Any time you encounter a row that has a new index number, add a new row to your new data set
Any time you encounter a row that has an index number you've seen before, add it to the row that you already created (except for that index number value itself)
I'm not writing that part of the code for you because you need to learn and grow. It's a good task for a beginner.
Once you've constructed that data set, it will look like this:
example_processed_data = [["Titre","a","b","c","d","e"],
["01","jean","paul","jack"],
["02","jeanne","jack","","jean"]]
You can then create a CSV writer, and create your outfile by iterating over that data, similarly to how you iterated over the infile:
with open('outfile.csv', 'w') as outfile:
writer = csv.writer(outfile)
for line in example_processed_data:
writer.writerow(line)
print("Done! Wrote", len(example_processed_data), "lines to outfile.csv.")

Original order of columns in csv not retained in unicodecsv.DictReader

I am trying read a CSV file into python 3 using unicodecsv library. Code follows :
with open('filename.csv', 'rb') as f:
reader = unicodecsv.DictReader(f)
Student_Data = list(reader)
But the order of the columns in the CSV file is not retained when I output any element from the Student_Data. The output contains any random order of the columns. Is there anything wrong with the code? How do I fix this?

As stated in csv.DictReader documentation, the DictReader object behaves like a dict - so it is not ordered.
You can obtain the list of the fieldnames with:
reader.fieldnames
But if you only want to obtain a list of the field values, in original order, you can just use a normal reader:
with open('filename.csv', 'rb') as f:
reader = unicodecsv.reader(f)
for row in reader:
Student_Data = row

Sort excel worksheet using python

I have an excel sheet like this:
I would like to output the data into an excel file like this:
Basically, for the common elements in column 2,3,4, I want to contacenate the values in the 5th column.
Please suggest, how could I do this ?

The easiest way to approach an issue like this is exporting the spreadsheet to CSV first, in order to help ensure that it imports correctly into Python.
Using a defaultdict, you can create a dictionary that has unique keys and then iterate through lines adding the final column's values to a list.
Finally you can write it back out to a CSV format:
from collections import defaultdict
results = defaultdict(list)
with open("in_file.csv") as f:
header = f.readline()
for line in f:
cols = line.split(",")
key = ",".join(cols[0:4])
results[key].append(cols[4])
with open("out_file.csv", "w") as f:
f.write(header)
for k, v in results.iteritems():
line = '{},"{}",\n'.format(k, ", ".join(v))
f.write(line)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Looping over large CSV file - python-3.x

Related

Stuck in a forever loop

Write into CSV only if row does not exist

Merge line in csv file python

Original order of columns in csv not retained in unicodecsv.DictReader

Sort excel worksheet using python

Categories

Resources