I am having a little trouble finding an efficient way to compare two files in order to create a third file.
I'm using Python 3.6
The first file is a list of IP addresses that I want to delete. The second file contains all of the DNS records associated with that IP address targeted for deletion.
If I find DNS record in the second file, I want to add the entire line to a third file.
This is sample of file 1:
IP
10.10.10.234
10.34.76.4
This is sample of file 2:
DNS Record Type,DNS Record,DNS Response,View
PTR,10.10.10.234,testing.example.com,internal
A,testing.example.com,10.10.10.234,internal
A,dns.google.com,8.8.8.8,external
This is what I'm trying to do. It is accurate, however it is taking forever. There are ~2 million lines in file 2 and 150K lines in file 1.
def create_final_stale_ip_file():
PD = set()
with open(stale_file) as f1:
reader1 = csv.DictReader(f1)
for row1 in reader1:
with open(prod_dns) as f2:
reader2 = csv.DictReader(f2)
for row2 in reader2:
if row2['DNS Record Type'] == 'A':
if row1['IP'] == row2['DNS Response']:
PD.update([row2['View']+'del,'+row2['DNS Record Type']+','+row2['DNS Record']+','+row2['DNS Response']])
if row2['DNS Record Type'] == 'PTR':
if row1['IP'] == row2['DNS Record']:
PD.update([row2['View']+'del,'+row2['DNS Record Type']+','+row2['DNS Response']+','+row2['DNS Record']])
o1 = open(delete_file,'a')
for i in PD:
o1.write(i+'\n')
o1.close()
Thanks in advance!
You should read the whole IP file into a set first, and then check if the IPs on the second file are found in that set, since checking if an element exists in a set is very fast:
def create_final_stale_ip_file():
PD = set()
# It's much prettier and easier to manage the strings in one place
# and without using the + operator. Read about `str.format()`
# to understand how these work. They will be used later in the code
A_string = '{View}del,{DNS Record Type},{DNS Record},{DNS Response}'
PTR_string = '{View}del,{DNS Record Type},{DNS Response},{DNS Record}'
# We can open and create readers for both files at once
with open(stale_file) as f1, open(prod_dns) as f2:
reader1, reader2 = csv.DictReader(f1), csv.DictReader(f2)
# Read all IPs into a python set, they're fast!
ips = {row['IP'] for row in reader1}
# Now go through every line and simply check if the IP
# exists in the `ips` set we created above
for row in reader2:
if (row['DNS Record Type'] == 'A'
and row['DNS Response'] in ips):
PD.add(A_string.format(**row))
elif (row['DNS Record Type'] == 'PTR'
and row2['DNS Record'] in ips):
PD.add(PTR_string.format(**row))
# Finally, write all the lines to the file using `writelines()`.
# Also, it's always better to use `with open()`
with open(delete_file, 'a') as f:
f.writelines(PD)
As you see, I also changed some minor stuff, like:
write to a file using writelines()
open the last file using with open() for safety
we're only adding one element to our set, so use PD.add() instead of PD.update()
use Python's awesome str.format() to create much cleaner string formatting
Last but not least, I would actually split this into multiple functions, one for reading the files, one for going through the read dictionaries, etc, and each function taking the proper arguments instead of using global variable names like stale_file and prod_dns as you seem to be using. But that's up to you.
You can do it using grep very easily:
grep -xf file1 file2
This will give you a file with the lines of file2 which match lines in file1. From there it should be much easier to manipulate the text to the final form you need.
Related
I spent too much time trying to write a generic solution to a problem (below this). I ran into a couple issues, so I ended up writing a Do-It script, which is here:
# No imports necessary
# set file paths
annofh="/Path/To/Annotation/File.tsv"
datafh="/Path/To/Data/File.tsv"
mergedfh="/Path/To/MergedOutput/File.tsv"
# Read all the annotation data into a dict:
annoD={}
with open(annofh, 'r') as annoObj:
h1=annoObj.readline()
for l in annoObj:
l=l.strip().split('\t')
k=l[0] + ':' + l[1] + ' ' + l[3] + ' ' + l[4]
annoD[k]=l
keyset=set(annoD.keys())
with open(mergedfh, 'w') as oF:
with open(datafh, 'r') as dataObj:
h2=dataObj.readline().strip(); oF.write(h2 + '\t'+ h1) # write the header line to the output file
for l in dataObj:
l=l.strip().split('\t') # Read through the data to be annotated line-by-line:
if "-" in l[13]:
pos=l[13].split('-')
l[13]=pos[0]
key=l[12][3:] + ":" + l[13] + " " + l[15] + " " + l[16]
if key in annoD.keys():
l = l + annoD[key]
oF.write('\t'.join(l) + '\n')
else:
oF.write('\t'.join(l) + '\n')
The function of DoIt.py (which functions correctly, above ^ ) is simple:
first read a file containing annotation information into a dictionary.
read through the data to be annotated line-by-line, and add annotation info. to the data by matching a string constructed by pasting together 4 columns.
As you can see, this script contains index positions, that I obtained by writing a quick awk one-liner, finding the corresponding columns in both files, then putting these into the python script.
Here's the thing. I do this kind of task all the time. I want to write a robust solution that will enable me to automate this task, *even if column names vary. My first goal is to use partial string matching; but eventually it would be nice to be even more robust.
I got part of the way to doing this, but at present the below solution is actually no better than the DoIt.py script...
# Across many projects, the correct columns names vary.
# For example, the name might be "#CHROM" or "Chromosome" or "CHR" for the first DF, But "Chrom" for the second df.
# in any case, if I conduct str.lower() then search for a substring, it should match any of the above options.
MasterColNamesList=["chr", "pos", "ref", "alt"]
def selectFields(h, columnNames):
##### currently this will only fix lower case uppercase problems. need to fix to catch any kind of mapping issue, like a partial string match (e.g., chr will match #CHROM)
indices=[]
h=map(str.lower,h)
for fld in columnNames:
if fld in h:
indices.append(h.index(fld))
#### Now, this will work, but only if the field names are an exact match.
return(indices)
def MergeDFsByCols(DF1, DF2, colnames): # <-- Single set of colnames; no need to use indices
pass
# eventually, need to write the merge statement; I could paste the cols together to a string and make that the indices for both DFs, then match on the indices, for example.
def mergeData(annoData, studyData, MasterColNamesList):
####
import pandas as pd
aDF=pd.read_csv(annoData, header=True, sep='\t')
sDF=pd.read_csv(studyData, header=True, sep='\t')
####
annoFieldIdx=selectFields(list(aVT.columns.values), columnNames1) # currently, columnNames1; should be MasterColNamesList
dataFieldIdx=selectFields(list(sD.columns.values), columnNames2)
####
mergeDFsByCols(aVT, sD):
Now, although the above works, it is actually no more automated than the DoIt.py script, because the columnNames1 and 2 are specific to each file and still need to be found manually ...
What I want to be able to do is enter a list of generic strings that, if processed, will result in the correct columns being pulled from both files, then merge the pandas DFs on those columns.
Greatly appreciate your help.
I am working with two CSV files, both contain only one column of data, but are over 50,000 rows. I need to compare the data from CSV1 against CSV2 and remove any data that displays in both of these files. I would like to print out the final list of data as a 3rd CSV file if possible.
The CSV files contain usernames. I have tried running deduplication scripts but realize that this does not remove entries found in both CSV files entirely since it only removes the duplication of a username. This is what I have been currently working with but I can already tell that this isn't going to give me the results I am looking for.
import csv
AD_AccountsCSV = open("AD_Accounts.csv", "r")
BA_AccountsCSV = open("BA_Accounts.csv", "r+")
def Remove(x,y):
final_list =[]
for item in x:
if item not in y:
final_list.append(item)
for i in y:
if i not in x:
final_list.append(i)
print (final_list)
The way that I wrote this code would print the results within the terminal after running the script but I realize that my output may be around 1,000 entries.
# define the paths
fpath1 = "/path/to/file1.csv"
fpath2 = "/path/to/file2.csv"
fpath3 = "/path/to/your/file3.csv"
with open(fpath1) as f1, open(fpath2) as f2, open(fpath3, "w") as f3:
l1 = f1.readlines()
l2 = f2.readlines()
not_in_both = [x for x in set(l1 + l2) if x in l1 and x in l2]
for x in not_in_both:
print(x, file=f3)
The with open() as ... clause takes care of closing the file.
You can combine several file openings under with.
Assuming, that the elements in the files are the only elements per line, I used simple readlines() (which automatically removes the newline character at the end). Otherwise it becomes more complicated in this step.
List-expressions make it nice to filter lists by conditions.
Default end='\n' in print() adds newline at end of each print.
In the way you did
For formatting code, please follow official style guides, e.g.
https://www.python.org/dev/peps/pep-0008/
def select_exclusive_accounts(path_to_f1,path_to_f2, path_to_f3):
# you have quite huge indentations - use 4 spaces!
with open(path_to_f1) as f1, open(path_to_f2) as f2, \
open(path_to_f3, "w") as f3:
for item in in_f1:
if item not in in_f2:
f3.write(item)
for i in in_f2:
if i not in in_f1:
f3.write(item)
select_exclusive_accounts("AD_Accounts.csv",
"BA_Accounts.csv",
"exclusive_accounts.csv")
Also here no imports not needed because these are standard Python commands.
I have a list of texts (reviews_train) which I gathered from a text file (train.txt).
reviews_train = []
for line in open('C:\\Users\\Dell\\Desktop\\New Beginnings\\movie_data\\train.txt', 'r', encoding="utf8"):
reviews_train.append(line.strip())
Suppose reviews_train = ["Nice movie","Bad film",....]
I have another result.csv file which looks like
company year
a 2000
b 2001
.
.
.
What I want to do is add another column text to the existing file to look something like this.
company year text
a 2000 Nice movie
b 2001 Bad film
.
.
.
The items of the list should get appended in the new column one after the other.
I am really new to python. Can some one please tell me how to do it? Any help is really aprreciated.
EDIT: My question is not just about adding another column in the .csv file. The column should have the texts in the list appended row wise.
EDIT: I used the solution given by #J_H but I get this error
Use zip():
def get_rows(infile='result.csv'):
with open(infile) as fin:
sheet = csv.reader(fin)
for row in sheet:
yield list(row)
def get_lines(infile=r'C:\Users\Dell\Desktop\New Beginnings\movie_data\train.txt'):
return open(infile).readlines()
for row, line in zip(get_rows(), get_lines()):
row.append(line)
print(row)
With those 3-element rows in hand,
you could e.g. writerow().
EDIT
The open() in your question mentions 'r' and encoding='utf8',
which I suppressed since open() should default to using those.
Apparently you're not using the python3 mentioned in your tag,
or perhaps an ancient version.
PEPs 529 & 540 suggest that since 3.6 windows will default to UTF-8,
just like most platforms.
If your host manages to default to something crazy like CP1252,
then you will certainly want to override that:
return open(infile, encoding='utf8').readlines()
I have 2 .csv datasets from the same source. I was attempting to check if any of the items from the first dataset are still present in the second.
#!/usr/bin/python
import csv
import json
import click
#click.group()
def cli(*args, **kwargs):
"""Command line tool to compare and generate a report of item that still persists from one report to the next."""
pass
#click.command(help='Compare the keysets and return a list of keys old keys still active in new keyset.')
#click.option('--inone', '-i', default='keys.csv', help='specify the file of the old keyset')
#click.option('--intwo', '-i2', default='keys2.csv', help='Specify the file of the new keyset')
#click.option('--output', '-o', default='results.json', help='--output, -o, Sets the name of the output.')
def compare(inone, intwo, output):
csvfile = open(inone, 'r')
csvfile2 = open(intwo, 'r')
jsonfile = open(output, 'w')
reader = csv.DictReader(csvfile)
comparator = csv.DictReader(csvfile2)
for line in comparator:
for row in reader:
if row == line:
print('#', end='')
json.dump(row, jsonfile)
jsonfile.write('\n')
print('|', end='')
print('-', end='')
cli.add_command(compare)
if __name__ == '__main__':
cli()
say each csv files has 20 items in it. it will currently iterate 40 times and end when I was expecting it to iterate 400 times and create a report of items remaining.
Everything but the iteration seems to be working. anyone have thoughts on a better approach?
Iterating 40 times sounds just about right - when you iterate through your DictReader, you're essentially iterating through the wrapped file lines, and once you're done iterating it doesn't magically reset to the beginning - the iterator is done.
That means that your code will start iterating over the first item in the comparator (1), then iterate over all items in the reader (20), then get the next line from the comparator(1), then it won't have anything left to iterate over in the reader so it will go to the next comparator line and so on until it loops over the remaining comparator lines (18) - resulting in total of 40 loops.
If you really want to iterate over all of the lines (and memory is not an issue), you can store them as lists and then you get a new iterator whenever you start a for..in loop, so:
reader = list(csv.DictReader(csvfile))
comparator = list(csv.DictReader(csvfile2))
Should give you an instant fix. Alternatively, you can reset your reader 'steam' after the loop with csvfile.seek(0).
That being said, if you're going to compare lines only, and you expect that not many lines will differ, you can load the first line in csv.reader() to get the 'header' and then forgo the csv.DictReader altogether by comparing the lines directly. Then when there is a change you can pop in the line into the csv.reader() to get it properly parsed and then just map it to the headers table to get the var names.
That should be significantly faster on large data sets, plus seeking through the file can give you the benefit of never having the need to store in memory more data than the current I/O buffer.
I'm having some troubles processing some input.
I am reading data from a log file and store the different values according to the name.
So my input string consists of ip, name, time and a data value.
A log line looks like this and it has \t spacing:
134.51.239.54 Steven 2015-01-01 06:09:01 5423
I'm reading in the values using this code:
loglines = file.splitlines()
data_fields = loglines[0] # IP NAME DATE DATA
for loglines in loglines[1:]:
items = loglines.split("\t")
ip = items[0]
name = items[1]
date = items[2]
data = items[3]
This works quite well but I need to extract all names to a list but I haven't found a functioning solution.
When i use print name i get:
Steven
Max
Paul
I do need a list of the names like this:
['Steven', 'Max', 'Paul',...]
There is probably a simple solution and i haven't figured it out yet, but can anybody help?
Thanks
Just create an empty list and add the names as you loop through the file.
Also note that if that file is very large, file.splitlines() is probably not the best idea, as it reads the entire file into memory -- and then you basically copy all of that by doing loglines[1:]. Better use the file object itself as an iterator. And don't use file as a variable name, as it shadows the type.
with open("some_file.log") as the_file:
data_fields = next(the_file) # consumes first line
all_the_names = [] # this will hold the names
for line in the_file: # loops over the rest
items = line.split("\t")
ip, name, date, data = items # you can put all this in one line
all_the_names.append(name) # add the name to the list of names
Alternatively, you could use zip and map to put it all into one expression (using that loglines data), but you rather shouldn't do that... zip(*map(lambda s: s.split('\t'), loglines[1:]))[1]