How do I delete rows in one CSV based on another CSV - python-3.x

I am working with two CSV files, both contain only one column of data, but are over 50,000 rows. I need to compare the data from CSV1 against CSV2 and remove any data that displays in both of these files. I would like to print out the final list of data as a 3rd CSV file if possible.
The CSV files contain usernames. I have tried running deduplication scripts but realize that this does not remove entries found in both CSV files entirely since it only removes the duplication of a username. This is what I have been currently working with but I can already tell that this isn't going to give me the results I am looking for.
import csv
AD_AccountsCSV = open("AD_Accounts.csv", "r")
BA_AccountsCSV = open("BA_Accounts.csv", "r+")
def Remove(x,y):
final_list =[]
for item in x:
if item not in y:
final_list.append(item)
for i in y:
if i not in x:
final_list.append(i)
print (final_list)
The way that I wrote this code would print the results within the terminal after running the script but I realize that my output may be around 1,000 entries.

# define the paths
fpath1 = "/path/to/file1.csv"
fpath2 = "/path/to/file2.csv"
fpath3 = "/path/to/your/file3.csv"
with open(fpath1) as f1, open(fpath2) as f2, open(fpath3, "w") as f3:
l1 = f1.readlines()
l2 = f2.readlines()
not_in_both = [x for x in set(l1 + l2) if x in l1 and x in l2]
for x in not_in_both:
print(x, file=f3)
The with open() as ... clause takes care of closing the file.
You can combine several file openings under with.
Assuming, that the elements in the files are the only elements per line, I used simple readlines() (which automatically removes the newline character at the end). Otherwise it becomes more complicated in this step.
List-expressions make it nice to filter lists by conditions.
Default end='\n' in print() adds newline at end of each print.
In the way you did
For formatting code, please follow official style guides, e.g.
https://www.python.org/dev/peps/pep-0008/
def select_exclusive_accounts(path_to_f1,path_to_f2, path_to_f3):
# you have quite huge indentations - use 4 spaces!
with open(path_to_f1) as f1, open(path_to_f2) as f2, \
open(path_to_f3, "w") as f3:
for item in in_f1:
if item not in in_f2:
f3.write(item)
for i in in_f2:
if i not in in_f1:
f3.write(item)
select_exclusive_accounts("AD_Accounts.csv",
"BA_Accounts.csv",
"exclusive_accounts.csv")
Also here no imports not needed because these are standard Python commands.

Related

Replacing "DoIt.py" script with flexible functions that match DFs on partial string matching of column names [Python3] [Pandas] [Merge]

I spent too much time trying to write a generic solution to a problem (below this). I ran into a couple issues, so I ended up writing a Do-It script, which is here:
# No imports necessary
# set file paths
annofh="/Path/To/Annotation/File.tsv"
datafh="/Path/To/Data/File.tsv"
mergedfh="/Path/To/MergedOutput/File.tsv"
# Read all the annotation data into a dict:
annoD={}
with open(annofh, 'r') as annoObj:
h1=annoObj.readline()
for l in annoObj:
l=l.strip().split('\t')
k=l[0] + ':' + l[1] + ' ' + l[3] + ' ' + l[4]
annoD[k]=l
keyset=set(annoD.keys())
with open(mergedfh, 'w') as oF:
with open(datafh, 'r') as dataObj:
h2=dataObj.readline().strip(); oF.write(h2 + '\t'+ h1) # write the header line to the output file
for l in dataObj:
l=l.strip().split('\t') # Read through the data to be annotated line-by-line:
if "-" in l[13]:
pos=l[13].split('-')
l[13]=pos[0]
key=l[12][3:] + ":" + l[13] + " " + l[15] + " " + l[16]
if key in annoD.keys():
l = l + annoD[key]
oF.write('\t'.join(l) + '\n')
else:
oF.write('\t'.join(l) + '\n')
The function of DoIt.py (which functions correctly, above ^ ) is simple:
first read a file containing annotation information into a dictionary.
read through the data to be annotated line-by-line, and add annotation info. to the data by matching a string constructed by pasting together 4 columns.
As you can see, this script contains index positions, that I obtained by writing a quick awk one-liner, finding the corresponding columns in both files, then putting these into the python script.
Here's the thing. I do this kind of task all the time. I want to write a robust solution that will enable me to automate this task, *even if column names vary. My first goal is to use partial string matching; but eventually it would be nice to be even more robust.
I got part of the way to doing this, but at present the below solution is actually no better than the DoIt.py script...
# Across many projects, the correct columns names vary.
# For example, the name might be "#CHROM" or "Chromosome" or "CHR" for the first DF, But "Chrom" for the second df.
# in any case, if I conduct str.lower() then search for a substring, it should match any of the above options.
MasterColNamesList=["chr", "pos", "ref", "alt"]
def selectFields(h, columnNames):
##### currently this will only fix lower case uppercase problems. need to fix to catch any kind of mapping issue, like a partial string match (e.g., chr will match #CHROM)
indices=[]
h=map(str.lower,h)
for fld in columnNames:
if fld in h:
indices.append(h.index(fld))
#### Now, this will work, but only if the field names are an exact match.
return(indices)
def MergeDFsByCols(DF1, DF2, colnames): # <-- Single set of colnames; no need to use indices
pass
# eventually, need to write the merge statement; I could paste the cols together to a string and make that the indices for both DFs, then match on the indices, for example.
def mergeData(annoData, studyData, MasterColNamesList):
####
import pandas as pd
aDF=pd.read_csv(annoData, header=True, sep='\t')
sDF=pd.read_csv(studyData, header=True, sep='\t')
####
annoFieldIdx=selectFields(list(aVT.columns.values), columnNames1) # currently, columnNames1; should be MasterColNamesList
dataFieldIdx=selectFields(list(sD.columns.values), columnNames2)
####
mergeDFsByCols(aVT, sD):
Now, although the above works, it is actually no more automated than the DoIt.py script, because the columnNames1 and 2 are specific to each file and still need to be found manually ...
What I want to be able to do is enter a list of generic strings that, if processed, will result in the correct columns being pulled from both files, then merge the pandas DFs on those columns.
Greatly appreciate your help.

Writing each sublist in a list of lists to a separate CSV

I have a list of lists containing a varying number of strings in each sublist:
tq_list = [['The mysterious diary records the voice.', 'Italy is my favorite country', 'I am happy to take your donation', 'Any amount will be greatly appreciated.'], ['I am counting my calories, yet I really want dessert.', 'Cats are good pets, for they are clean and are not noisy.'], ['We have a lot of rain in June.']]
I would like to create a new CSV file for each sublist. All I have so far is a way to output each sublist as a row in the same CSV file using the following code:
name_list = ["sublist1","sublist2","sublist3"]
with open("{}.csv".format(*name_list), "w", newline="") as f:
writer = csv.writer(f)
for row in tq_list:
writer.writerow(row)
This creates a single CSV file named 'sublist1.csv'.
I've toyed around with the following code:
name_list = ["sublist1","sublist2","sublist3"]
for row in tq_list:
with open("{}.csv".format(*name_list), "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(row)
Which also only outputs a single CSV file named 'sublist1.csv', but with only the values from the last sublist. I feel like this is a step in the right direction, but obviously not quite there yet.
What the * in "{}.csv".format(*name_list) in your code actually does is this: It unpacks the elements in name_list to be passed into the function (in this case format). That means that format(*name_list) is equivalent to format("sublist1", "sublist2", "sublist3"). Since there is only one {} in your string, all arguments to format except "sublist1" are essentially discarded.
You might want to do something like this:
for index, row in enumerate(tq_list):
with open("{}.csv".format(name_list[index]), "w", newline="") as f:
...
enumerate returns a counting index along with each element that it iterates over so that you can keep track of how many elements there have already been. That way you can write into a different file each time. You could also use zip, another handy function that you can look up in the Python documentation.

Nested For loop over csv files

I have 2 .csv datasets from the same source. I was attempting to check if any of the items from the first dataset are still present in the second.
#!/usr/bin/python
import csv
import json
import click
#click.group()
def cli(*args, **kwargs):
"""Command line tool to compare and generate a report of item that still persists from one report to the next."""
pass
#click.command(help='Compare the keysets and return a list of keys old keys still active in new keyset.')
#click.option('--inone', '-i', default='keys.csv', help='specify the file of the old keyset')
#click.option('--intwo', '-i2', default='keys2.csv', help='Specify the file of the new keyset')
#click.option('--output', '-o', default='results.json', help='--output, -o, Sets the name of the output.')
def compare(inone, intwo, output):
csvfile = open(inone, 'r')
csvfile2 = open(intwo, 'r')
jsonfile = open(output, 'w')
reader = csv.DictReader(csvfile)
comparator = csv.DictReader(csvfile2)
for line in comparator:
for row in reader:
if row == line:
print('#', end='')
json.dump(row, jsonfile)
jsonfile.write('\n')
print('|', end='')
print('-', end='')
cli.add_command(compare)
if __name__ == '__main__':
cli()
say each csv files has 20 items in it. it will currently iterate 40 times and end when I was expecting it to iterate 400 times and create a report of items remaining.
Everything but the iteration seems to be working. anyone have thoughts on a better approach?
Iterating 40 times sounds just about right - when you iterate through your DictReader, you're essentially iterating through the wrapped file lines, and once you're done iterating it doesn't magically reset to the beginning - the iterator is done.
That means that your code will start iterating over the first item in the comparator (1), then iterate over all items in the reader (20), then get the next line from the comparator(1), then it won't have anything left to iterate over in the reader so it will go to the next comparator line and so on until it loops over the remaining comparator lines (18) - resulting in total of 40 loops.
If you really want to iterate over all of the lines (and memory is not an issue), you can store them as lists and then you get a new iterator whenever you start a for..in loop, so:
reader = list(csv.DictReader(csvfile))
comparator = list(csv.DictReader(csvfile2))
Should give you an instant fix. Alternatively, you can reset your reader 'steam' after the loop with csvfile.seek(0).
That being said, if you're going to compare lines only, and you expect that not many lines will differ, you can load the first line in csv.reader() to get the 'header' and then forgo the csv.DictReader altogether by comparing the lines directly. Then when there is a change you can pop in the line into the csv.reader() to get it properly parsed and then just map it to the headers table to get the var names.
That should be significantly faster on large data sets, plus seeking through the file can give you the benefit of never having the need to store in memory more data than the current I/O buffer.

How to iterate through very large text files and merge lines only when a certain condition exists

The underlying purpose is for me to become a python expert (long run) and the immediate goal is as follows...
I want to merge two massive lists into one. The lists are very large text files comprised of millions of lines which would look like the following-
bigfile1 bigfile2
(10,'red','blue','orange') (10,'31','false','true')
(11,'black','blue','green') (11,'88','true','true')
(12,'blue','blue','green') random junk once in a while
(13,'red','blue','yellow') (12,'3','false','false')
(14,'brown','red','red') (15,'6','true','true')
Using Python, I would like to:
merge the lines from each list and write them to a new list if the "usernumbers" before the first comma are the same.
Have program complete before our sun runs out of
hydrogen, so that ruled out iterating every line through every line.
I then learned about (a,b) in zip but then i had a new problem. I want to merge the lines from the lists only when the first number in the lines are the same...and within the lists, some numbers are skipped in one list, duplicates, garbled trash once in a while, etc. so I can't just go line by line and can't figure out if there's a way to iterate through only a or b when using zip. I realize these are files that should be in a database and you query from there, but it's an exercise for me to learn more Python.
I'm using Python3.4 on windows. If anyone has suggestions on completing the following, or start from scratch, I would greatly appreciate it!
I want the lines with same usernumbers to be merged together in a new file. My current code follows:
list1 = open('bigfile1.txt', 'r', errors = 'ignore')
list2 = open('bigfile2.txt', 'r', errors = 'ignore')
for a,b in zip(list1,list2):
c = (''.join(a.split("(")[1:])).rstrip()
d = ''.join(c.split(",")[:1])
e = (''.join(b.split("(")[1:])).rstrip()
f = ''.join(e.split(",")[:1])
if d == f:
#FILE.write()
print (a,b)
elif d != f:
##### I'm STUCK!! #####
#FILE.close()
list1.close()
list2.close()
Note: adding a wrapper to an iterator will reduce performance. Sorry.
You can skip lines of the file by wrapping the file iterator, this just means writing your own generator with a conditional:
from itertools import chain
def ignore_junk(file_iter):
for line in file_iter:
if line[0] == "(" and line[1:3].isdigit():
yield line
pair_rows_iter = zip(ignore_junk(list1),ignore_junk(list2))
all_lines_iter = itertools.chain.from_iterable(pair_rows_iter)
new_file.writerows(all_lines_iter)

reading data from a file and storing them in a list of lists Python

I have a file data.txt containing following lines :
I would like to extract the lines of this file into a list of lists, each line is a list that will be contained within ListOfLines wich is a list of lists.
When there is no data on some cell I just want it to be -1.
I have tried this so far :
from random import randint
ListOfLines=[]
with open("C:\data.txt",'r') as file:
data = file.readlines()
for line in data :
y = line.split()
ListOfLines.append(y)
with open("C:\output.txt",'a') as output:
for x in range(0, 120):
# 'item' represente une ligne
for item in ListOfLines :
item[2] = randint(1, 1000)
for elem in item :
output.write(str(elem))
output.write(' ')
output.write('\n')
output.write('------------------------------------- \n')
How can I improve my program to contain less code and be faster ?
Thank you in advance :)
Well, sharing your sample data in an image don't make easy to working with it. Like this I don't even bother and I assume others do the same.
However, data = file.readlines() forces the content of the file into a list first, and then you iterate through that list. You could do that instantly with 'for line in file:'. That improves it a little.
You haven't mentioned what you want with the otput part which seems quite messy.

Resources