CSV file appears empty when it's not - python-3.x

def main():
driver = ts_functions.start_selenium()
# A good practice to access the elements in the csv is using a dict
keys = ['age', 'numeration', 'popularity', 'country', 'num_of_comments', 'time', 'sex', 'text']
while True:
# For a file to change you gotta close it first maybe?
with open('secrets.csv', 'a', encoding='utf-8', newline='') as sf:
secrets = ts_functions.get_secrets(driver)
new_secrets = []
writer = csv.DictWriter(sf, keys)
not_first_run = exists('last_secret.csv')
if not_first_run:
with open('last_secret.csv', 'r', encoding='utf-8', newline='') as rlf:
reader = csv.DictReader(rlf)
for row in reader:
only_secret = row #There should be 1 element but it's empty
for secret in secrets:
if not_first_run:
if only_secret == secret: #<---- Error IS HERE
break
new_secrets.append(secret)
new_secrets.reverse()
for secret in new_secrets:
writer.writerow(secret)
# I use 'w' here because I only want the last message of the last while loop
# so it's ok if it erases the previous content
with open('last_secret.csv', 'w', encoding='utf-8', newline='') as wlf:
writer = csv.DictWriter(wlf, keys)
# I write the only line in the file here but then it appears empty even though
# I can see the file being created and line being written
writer.writerow(new_secrets[-1])
sleep(10)
This is my main function, the two helper functions do their job properly, but if neccessary I can share them, my program tries to scrape data from a website that uploads messages live(like twitch comments) and then store the messages(secrets) in one file(secrets.csv), then I want to use the other csv file(last_secret.csv) to identify wheter a message is new or has already been scraped(I store the newest message there and compare them to the next batch of messages, when it finds one that is the same, it stops).
The problem is that the in the second run when I try to read from the latter file the reader object is empty and an error occurs.
I checked the helper functions and they are fine, the problem is with the CSVs I changed the 'last_secret.csv' mode from 'w' to 'a' and it didn't solve anything. I checked the file's contents in the same run(the 1st one) just after writing the line in it and it also appeared empty.
This is my firt question srry for the bad formatting :p

Related

CSV appears empty just after I write sth onto it

while True:
with open('secrets.csv', 'a', encoding='utf-8', newline='') as sf:
secrets = ts_functions.get_secrets(driver)
new_secrets = []
writer_1 = csv.DictWriter(sf, keys)
not_first_run = exists('last_secret.csv')
# THIS IF STATEMENT RUNS FROM THE 2ND ITERATION ONWARDS, NOT IN THE 1ST ONE
if not_first_run:
with open('last_secret.csv', 'r', encoding='utf-8', newline='') as rlf:
reader = csv.DictReader(rlf)
for row in reader:
# THERE SHOULD ONLY BE 1 ROW BUT THE READER APPEARS EMPTY
only_secret = row
for secret in secrets:
if not_first_run:
if only_secret == secret:
break
new_secrets.append(secret)
new_secrets.reverse()
for secret in new_secrets:
writer_1.writerow(secret)
with open('last_secret.csv', 'w', encoding='utf-8', newline='') as wlf:
writer_2 = csv.DictWriter(wlf, keys)
# HERE I WRITE ONE ROW(JUST ONE) IN THE 'last_secret.csv' FILE BUT WHEN THE SECOND
# ITERATION CHECKS THE IF STATEMENT THERE IS AN ERROR BECAUSE THE READER IS EMPTY
writer_2.writerow(new_secrets[-1])
sleep(10)
My program is about web scraping but I got that part covered(I can post the related functions too if needed) but I think the problem most likely is just how I am writing the 'last_secret.csv' file. When I open the file myself in Visual Studio there is the line Ii am expecting but when the reader object reads it It appears as empty and that causes an error because I didn't declare the only_secret variable(it's None).
As I said before I was expecting the reader object to read 1 row(the only one I write onto the file) but instead It reads nothingg as if the file where empty, when I check the file for myself the line is there but the reader says otherwise. I tried changing the write mode to append but It changed nothing.
Thx in advance

Python problems writing rows in CSV

I have this script that reads a CSV and saves the second column to a list, I'm trying to get it to write the contents of the list to a new CSV. The problem is every entry should have its own row but the new file sets everything into the same row.
I've tried moving the second with open code to within the first with open and I've tried adding a for loop to the second with open but no matter what I try I don't get the right results.
Here is the code:
import csv
col_store=[]
with open('test-data.csv', 'r') as rf:
reader = csv.reader(rf)
for row in reader:
col_store.append(row[1])
with open('meow.csv', 'wt') as f:
csv_writer = csv.writer(f)
csv_writer.writerows([col_store])
In your case if you have a column of single letters/numbers then Y.R answer will work.
To have a code that works in all cases, use this.
with open('meow.csv', 'wt') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(([_] for _ in col_store))
From here it is mentioned that writerows expect an an iterable of row objects. Every row object should be an iterable of strings or numbers for Writer objects
The problem is that you are using 'writerows' treating 'col_store' as a list with one item.
The simplest approach to fixing this is calling
csv_writer.writerows(col_store)
# instead of
csv_writer.writerows([col_store])
However, this will lead to a probably unwanted result - having blank lines between the lines.
To solve this, use:
with open('meow.csv', 'wt', newline='') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(col_store)
For more about this, see CSV file written with Python has blank lines between each row
Note: writerows expects 'an iterable of row objects' and 'row objects must be an interable of strings or numbers'.
(https://docs.python.org/3/library/csv.html)
Therefore, in the generic case (trying to write integers for examlpe), you should use Sam's solution.

python: How to read a file and store each line using map function?

I'm trying to reconvert a program that I wrote but getting rid of all for loops.
The original code reads a file with thousands of lines that are structured like:
Ex. 2 lines of a file:
As you can see, the first line starts with LPPD;LEMD and the second line starts with DAAE;LFML. I'm only interested in the very first and second element of each line.
The original code I wrote is:
# Libraries
import sys
from collections import Counter
import collections
from itertools import chain
from collections import defaultdict
import time
# START
# #time=0
start = time.time()
# Defining default program argument
if len(sys.argv)==1:
fileName = "file.txt"
else:
fileName = sys.argv[1]
takeOffAirport = []
landingAirport = []
# Reading file
lines = 0 # Counter for file lines
try:
with open(fileName) as file:
for line in file:
words = line.split(';')
# Relevant data, item1 and item2 from each file line
origin = words[0]
destination = words[1]
# Populating lists
landingAirport.append(destination)
takeOffAirport.append(origin)
lines += 1
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
airports_dict = defaultdict(list)
# Merge lists into a dictionary key:value
for key, value in chain(Counter(takeOffAirport).items(),
Counter(landingAirport).items()):
# 'AIRPOT_NAME':[num_takeOffs, num_landings]
airports_dict[key].append(value)
# Sum key values and add it as another value
for key, value in airports_dict.items():
#'AIRPOT_NAME':[num_totalMovements, num_takeOffs, num_landings]
airports_dict[key] = [sum(value),value]
# Sort dictionary by the top 10 total movements
airports_dict = sorted(airports_dict.items(),
key=lambda kv:kv[1], reverse=True)[:10]
airports_dict = collections.OrderedDict(airports_dict)
# Print results
print("\nAIRPORT"+ "\t\t#TOTAL_MOVEMENTS"+ "\t#TAKEOFFS"+ "\t#LANDINGS")
for k in airports_dict:
print(k,"\t\t", airports_dict[k][0],
"\t\t\t", airports_dict[k][1][1],
"\t\t", airports_dict[k][1][0])
# #time=1
end = time.time()- start
print("\nAlgorithm execution time: %0.5f" % end)
print("Total number of lines read in the file: %u\n" % lines)
airports_dict.clear
takeOffAirport.clear
landingAirport.clear
My goal is to simplify the program using map, reduce and filter. So far I have sorted teh creation of the two independent lists, one for each first element of each file line and another list with the second element of each file line by using:
# Creates two independent lists with the first and second element from each line
takeOff_Airport = list(map(lambda sub: (sub[0].split(';')[0]), lines))
landing_Airport = list(map(lambda sub: (sub[0].split(';')[1]), lines))
I was hoping to find the way to open the file and achieve the exact same result as the original code by been able to opemn the file thru a map() function, so I could pass each list to the above defined maps; takeOff_Airport and landing_Airport.
So if we have a file as such
line 1
line 2
line 3
line 4
and we do like this
open(file_name).read().split('\n')
we get this
['line 1', 'line 2', 'line 3', 'line 4', '']
Is this what you wanted?
Edit 1
I feel this is somewhat reduntant but since map applies a function to each element of an iterator we will have to have our file name in a list, and we ofcourse define our function
def open_read(file_name):
return open(file_name).read().split('\n')
print(list(map(open_read, ['test.txt'])))
This gets us
>>> [['line 1', 'line 2', 'line 3', 'line 4', '']]
So first off, calling split('\n') on each line is silly; the line is guaranteed to have at most one newline, at the end, and nothing after it, so you'd end up with a bunch of ['all of line', ''] lists. To avoid the empty string, just strip the newline. This won't leave each line wrapped in a list, but frankly, I can't imagine why you'd want a list of one-element lists containing a single string each.
So I'm just going to demonstrate using map+strip to get rid of the newlines, using operator.methodcaller to perform the strip on each line:
from operator import methodcaller
def readFile(fileName):
try:
with open(fileName) as file:
return list(map(methodcaller('strip', '\n'), file))
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
Sadly, since your file is context managed (a good thing, just inconvenient here), you do have to listify the result; map is lazy, and if you didn't listify before the return, the with statement would close the file, and pulling data from the map object would die with an exception.
To get around that, you can implement it as a trivial generator function, so the generator context keeps the file open until the generator is exhausted (or explicitly closed, or garbage collected):
def readFile(fileName):
try:
with open(fileName) as file:
yield from map(methodcaller('strip', '\n'), file)
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
yield from will introduce a tiny amount of overhead over directly iterating the map, but not much, and now you don't have to slurp the whole file if you don't want to; the caller can just iterate the result and get a split line on each iteration without pulling the whole file into memory. It does have the slight weakness that opening the file will be done lazily, so you won't see the exception (if there is any) until you begin iterating. This can be worked around, but it's not worth the trouble if you don't really need it.
I'd generally recommend the latter implementation as it gives the caller flexibility. If they want a list anyway, they just wrap the call in list and get the list result (with a tiny amount of overhead). If they don't, they can begin processing faster, and have much lower memory demands.
Mind you, this whole function is fairly odd; replacing IOErrors with prints and (implicitly) returning None is hostile to API consumers (they now have to check return values, and can't actually tell what went wrong). In real code, I'd probably just skip the function and insert:
with open(fileName) as file:
for line in map(methodcaller('strip', '\n'), file)):
# do stuff with line (with newline pre-stripped)
inline in the caller; maybe define split_by_newline = methodcaller('split', '\n') globally to use a friendlier name. It's not that much code, and I can't imagine that this specific behavior is needed in that many independent parts of your file, and inlining it removes the concerns about when the file is opened and closed.

Write into CSV only if row does not exist

I am saving an array into a csv file, using this code:
def save_annotations(**kwargs):
ann = request.get_json()
print(ann)
filename = ann[3].split('.')[0]
run_id = ann[4]
run_number = ann[4].split('/')[0]
exp_id = ann[4].split('/')[1]
ann_type = ann[2]
if ann_type == 'wrongDetection':
with open(f"/code/data/mlruns/{run_number}/{exp_id}/wrong_annotations_{filename}_{run_id.replace('/', '_')}.csv",'a') as w_ann:
writer = csv.writer(w_ann, delimiter=',')
writer.writerow(ann[0:2])
w_ann.close()
else:
with open(f"/code/data/mlruns/{run_number}/{exp_id}/new_detections_{filename}_{run_id.replace('/', '_')}.csv",'a') as w_ann:
writer = csv.writer(w_ann, delimiter=',')
writer.writerow(ann[0:2])
w_ann.close()
However, I don't want repeated rows in my csv file. I only want to write to csv if ann[0] and ann[1] are not in the csv already.
What would be the best approach to do this?
kind regards
One way to do this would be to collect the already existing values in a set, and check new values to see if they are in the set before processing. You would need a set for each csv file.
For example:
def build_set(filename):
with open(filename, 'r') as f:
reader = csv.reader(f)
# Skip header row, if necessary
next(reader)
return {tuple(row[0:2]) for row in reader}
Then in your function you could do:
if tuple(ann[0:2]) in set_for_this_file:
continue
set_for_this_file.add(tuple(ann[0:2]))
# write data to file
Building the sets would require reading through all the csv files each time the program is executed, which might be inefficient if the files were large and/or numerous.
A more efficient approach might be to store the data in a database table, with columns for ann[0], ann[1], anntype, exp_id, run_mumber and run_id. Add a unique constraint over these columns and you would have the same functionality.

Nested For loop over csv files

I have 2 .csv datasets from the same source. I was attempting to check if any of the items from the first dataset are still present in the second.
#!/usr/bin/python
import csv
import json
import click
#click.group()
def cli(*args, **kwargs):
"""Command line tool to compare and generate a report of item that still persists from one report to the next."""
pass
#click.command(help='Compare the keysets and return a list of keys old keys still active in new keyset.')
#click.option('--inone', '-i', default='keys.csv', help='specify the file of the old keyset')
#click.option('--intwo', '-i2', default='keys2.csv', help='Specify the file of the new keyset')
#click.option('--output', '-o', default='results.json', help='--output, -o, Sets the name of the output.')
def compare(inone, intwo, output):
csvfile = open(inone, 'r')
csvfile2 = open(intwo, 'r')
jsonfile = open(output, 'w')
reader = csv.DictReader(csvfile)
comparator = csv.DictReader(csvfile2)
for line in comparator:
for row in reader:
if row == line:
print('#', end='')
json.dump(row, jsonfile)
jsonfile.write('\n')
print('|', end='')
print('-', end='')
cli.add_command(compare)
if __name__ == '__main__':
cli()
say each csv files has 20 items in it. it will currently iterate 40 times and end when I was expecting it to iterate 400 times and create a report of items remaining.
Everything but the iteration seems to be working. anyone have thoughts on a better approach?
Iterating 40 times sounds just about right - when you iterate through your DictReader, you're essentially iterating through the wrapped file lines, and once you're done iterating it doesn't magically reset to the beginning - the iterator is done.
That means that your code will start iterating over the first item in the comparator (1), then iterate over all items in the reader (20), then get the next line from the comparator(1), then it won't have anything left to iterate over in the reader so it will go to the next comparator line and so on until it loops over the remaining comparator lines (18) - resulting in total of 40 loops.
If you really want to iterate over all of the lines (and memory is not an issue), you can store them as lists and then you get a new iterator whenever you start a for..in loop, so:
reader = list(csv.DictReader(csvfile))
comparator = list(csv.DictReader(csvfile2))
Should give you an instant fix. Alternatively, you can reset your reader 'steam' after the loop with csvfile.seek(0).
That being said, if you're going to compare lines only, and you expect that not many lines will differ, you can load the first line in csv.reader() to get the 'header' and then forgo the csv.DictReader altogether by comparing the lines directly. Then when there is a change you can pop in the line into the csv.reader() to get it properly parsed and then just map it to the headers table to get the var names.
That should be significantly faster on large data sets, plus seeking through the file can give you the benefit of never having the need to store in memory more data than the current I/O buffer.

Resources