Delta words between two TXT files - python-3.x

I would like to count the delta words between two files.
file_1.txt has content One file with some text and words..
file_1.txt has content One file with some text and additional words to be found..
diff command on Unix systems gives the following infos. difflib can give a similar output.
$ diff file_1.txt file_2.txt
1c1
< One file with some text and words.
---
> One file with some text and additional words to be found.
Is there an easy way to found the words added or removed between two files, or at least between two lines as git diff --word-diff does.

First of all you need to read your files into strings with open() where 'file_1.txt' is path to your file and 'r' is for "reading mode".
Similar for the second file. And don't forget to close() your files when you're done!
Use split(' ') function to split strings you have just read into lists of words.
file_1 = open('file_1.txt', 'r')
text_1 = file_1.read().split(' ')
file_1.close()
file_2 = open('file_2.txt', 'r')
text_2 = file_2.read().split(' ')
file_2.close()
Next step you need to get difference between text_1 and text_2 list variables (objects).
There are many ways to do it.
1)
You can use Counter class from collections library.
Pass your lists to the class's constructor, then find the difference by subtraction in straight and reverse order, call elements() method to get elements and list() to transform it to the list type.
from collections import Counter
text_count_1 = Counter(text_1)
text_count_2 = Counter(text_2)
difference = list((text_count_1 - text_count_2).elements()) + list((text_count_2 - text_count_1).elements())
Here is the way to calculate the delta words.
from collections import Counter
text_count_1 = Counter(text_1)
text_count_2 = Counter(text_2)
delta = len(list((text_count_2 - text_count_1).elements())) \
- len(list((text_count_1 - text_count_2).elements()))
print(delta)
2)
Use Differ class from difflib library. Pass both lists to compare() method of Differ class and then iterate it with for.
from difflib import Differ
difference = []
for d in Differ().compare(text_1, text_2):
difference.append(d)
Then you can count the delta words like this.
from difflib import Differ
delta = 0
for d in Differ().compare(text_1, text_2):
status = d[0]
if status == "+":
delta += 1
elif status == "-":
delta -= 1
print(delta)
3)
You can write difference method by yourself. For example:
def get_diff (list_1, list_2):
d = []
for item in list_1:
if item not in list_2:
d.append(item)
return d
difference = get_diff(text_1, text_2) + get_diff(text_2, text_1)
I think that there are other ways to do this. But I will limit by three.
Since you get the difference list you can manage the output like whatever you wish.

..and here is yet another way to do this with dict()
#!/usr/bin/python
import sys
def loadfile(filename):
h=dict()
f=open(filename)
for line in f.readlines():
words=line.split(' ')
for word in words:
h[word.strip()]=1
return h
first=loadfile(sys.argv[1])
second=loadfile(sys.argv[2])
print "in both first and second"
for k in first.keys():
if k and k in second.keys():
print k

Related

Instead of printing to console create a dataframe for output

I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')

How to separate lines of data read from a textfile? Customers with their orders

I have this data in a text file. (Doesn't have the spacing I added for clarity)
I am using Python3:
orders = open('orders.txt', 'r')
lines = orders.readlines()
I need to loop through the lines variable that contains all the lines of the data and separate the CO lines as I've spaced them.
CO are customers and the lines below each CO are the orders that customer placed.
The CO lines tells us how many lines of orders exist if you look at the index[7-9] of the CO string.
I illustrating this below.
CO77812002D10212020 <---(002)
125^LO917^11212020. <----line 1
235^IL993^11252020 <----line 2
CO77812002S10212020
125^LO917^11212020
235^IL993^11252020
CO95307005D06092019 <---(005)
194^AF977^06292019 <---line 1
72^L223^07142019 <---line 2
370^IL993^08022019 <---line 3
258^Y337^07072019 <---line 4
253^O261^06182019 <---line 5
CO30950003D06012019
139^LM485^06272019
113^N669^06192019
249^P530^07112019
CO37501001D05252020
479^IL993^06162020
I have thought of a brute force way of doing this but it won't work against much larger datasets.
Any help would be greatly appreciated!
You can use fileinput (source) to "simultaneously" read and modify your file. In fact, the in-place functionality that offers to modify a file while parsing it is implemented through a second backup file. Specifically, as stated here:
Optional in-place filtering: if the keyword argument inplace=True is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (...) by default, the extension is '.bak' and it is deleted when the output file is closed.
Therefore, you can format your file as specified this way:
import fileinput
with fileinput.input(files = ['orders.txt'], inplace=True) as orders_file:
for line in orders_file:
if line[:2] == 'CO': # Detect customer line
orders_counter = 0
num_of_orders = int(line[7:10]) # Extract number of orders
else:
orders_counter += 1
# If last order for specific customer has been reached
# append a '\n' character to format it as desired
if orders_counter == num_of_orders:
line += '\n'
# Since standard output is redirected to the file, print writes in the file
print(line, end='')
Note: it's supposed that the file with the orders is formatted exactly in the way you specified:
CO...
(order_1)
(order_2)
...
(order_i)
CO...
(order_1)
...
This did what I hoping to get done!
tot_customers = []
with open("orders.txt", "r") as a_file:
customer = []
for line in a_file:
stripped_line = line.strip()
if stripped_line[:2] == "CO":
customer.append(stripped_line)
print("customers: ", customer)
orders_counter = 0
num_of_orders = int(stripped_line[7:10])
else:
customer.append(stripped_line)
orders_counter +=1
if orders_counter == num_of_orders:
tot_customers.append(customer)
customer = []
orders_counter = 0

Is there a way to pass variable as counter to list index in python?

Sorry if i am asking very basic question but i am new to python and need help with below question
I am trying to write a file parser where i am counting number of occurrences(modified programs) mentioned in the file.
I am trying to then store all the occurrences in a empty list and putting counter for each occurrence.
Till here all is fine
Now i am trying to create files based on the names captured in the empty list and store the lines that are not matching between in separate file but i am getting error index out of range as when i am passing el[count] is taking count as string and not taking count's value.
Can some one help
import sys
import re
count =1
j=0
k=0
el=[]
f = open("change_programs.txt", 'w+')
data = open("oct-released_diff.txt",encoding='utf-8',errors='ignore')
for i in data:
if len(i.strip()) > 0 and i.strip().startswith("diff --git"):
count = count + 1
el.append(i)
fl=[]
else:
**filename = "%s.txt" % el[int (count)]**
h = open(filename, 'w+')
fl.append(i)
print(fl, file=h)
el = '\n'.join(el)
print(el, file=f)
print(filename)
data.close()

How to Read Multiple Files in a Loop in Python and get count of matching words

I have two text files and 2 lists (FIRST_LIST,SCND_LIST),i want to find out count of each file matching words from FIRST_LIST,SCND_LIST individually.
FIRST_LIST =
"accessorizes","accessorizing","accessorized","accessorize"
SCND_LIST=
"accessorize","accessorized","accessorizes","accessorizing"
text File1 contains:
This is a very good question, and you have received good answers which describe interesting topics accessorized accessorize.
text File2 contains:
is more applied,using accessorize accessorized,accessorizes,accessorizing
output
File1 first list count=2
File1 second list count=0
File2 first list count=0
File2 second list count=4
This code i have tried to achive this functionality but not able to get the expected output.
if any help appreciated
import os
import glob
files=[]
for filename in glob.glob("*.txt"):
files.append(filename)
# remove Punctuations
import re
def remove_punctuation(line):
return re.sub(r'[^\w\s]', '', line)
two_files=[]
for filename in files:
for line in open(filename):
#two_files.append(remove_punctuation(line))
print(remove_punctuation(line),end='')
two_files.append(remove_punctuation(line))
FIRST_LIST = "accessorizes","accessorizing","accessorized","accessorize"
SCND_LIST="accessorize","accessorized","accessorizes","accessorizing"
c=[]
for match in FIRST_LIST:
if any(match in value for value in two_files):
#c=match+1
print (match)
c.append(match)
print(c)
len(c)
d=[]
for match in SCND_LIST:
if any(match in value for value in two_files):
#c=match+1
print (match)
d.append(match)
print(d)
len(d)
Using Counter and some list comprehension is one of many different approaches to solve your problem.
I assume, your sample output being wrong since some words are part of both lists and both files but are not counted. In addition I added a second line to the sample strings in order to show how that is working with multi-line strings which might be the typical contents of a given file.
io.StringIO objects emulate your files, but working with real files from your file system works exactly the same since both provide a file-like object or file-like interface:
from collections import Counter
list_a = ["accessorizes", "accessorizing", "accessorized", "accessorize"]
list_b = ["accessorize", "accessorized", "accessorizes", "accessorizing"]
# added a second line to each string just for the sake
file_contents_a = 'This is a very good question, and you have received good answers which describe interesting topics accessorized accessorize.\nThis is the second line in file a'
file_contents_b = 'is more applied,using accessorize accessorized,accessorizes,accessorizing\nThis is the second line in file b'
# using io.StringIO to simulate a file input (--> file-like object)
# you should use `with open(filename) as ...` for real file input
file_like_a = io.StringIO(file_contents_a)
file_like_b = io.StringIO(file_contents_b)
# read file contents and split lines into a list of strings
lines_of_file_a = file_like_a.read().splitlines()
lines_of_file_b = file_like_b.read().splitlines()
# iterate through all lines of each file (for file a here)
for line_number, line in enumerate(lines_of_file_a):
words = line.replace('.', ' ').replace(',', ' ').split(' ')
c = Counter(words)
in_list_a = sum([v for k,v in c.items() if k in list_a])
in_list_b = sum([v for k,v in c.items() if k in list_b])
print("Line {}".format(line_number))
print("- in list a {}".format(in_list_a))
print("- in list b {}".format(in_list_b))
# iterate through all lines of each file (for file b here)
for line_number, line in enumerate(lines_of_file_b):
words = line.replace('.', ' ').replace(',', ' ').split(' ')
c = Counter(words)
in_list_a = sum([v for k,v in c.items() if k in list_a])
in_list_b = sum([v for k,v in c.items() if k in list_b])
print("Line {}".format(line_number))
print("- in list a {}".format(in_list_a))
print("- in list b {}".format(in_list_b))
# actually, your two lists are the same
lists_are_equal = sorted(list_a) == sorted(list_b)
print(lists_are_equal)

Merge Two wordlists into one file

I have two wordlists, as per examples below:
wordlist 1 :
code1
code2
code3
wordlist 2 :
11
22
23
I want to take wordlist 2 and put every number in a line with first line in wordlist 1
example of the output :
code111
code122
code123
code211
code222
code223
code311
.
.
Can you please help me with how to do it? Thanks!
You can run two nested for loops to iterate over both lists, and append the concatenated string to a new list.
Here is a little example:
## create lists using square brackets
wordlist1=['code1', ## wrap something in quotes to make it a string
'code2','code3']
wordlist2=['11','22','23']
## create a new empty list
concatenated_words=[]
## first for loop: one iteration per item in wordlist1
for i in range(len(wordlist1)):
## word with index i of wordlist1 (square brackets for indexing)
word1=wordlist1[i]
## second for loop: one iteration per item in wordlist2
for j in range(len(wordlist2)):
word2=wordlist2[j]
## append concatenated words to the initially empty list
concatenated_words.append(word1+word2)
## iterate over the list of concatenated words, and print each item
for k in range(len(concatenated_words)):
print(concatenated_words[k])
list1 = ["text1","text2","text3","text4"]
list2 = [11,22,33,44]
def iterativeConcatenation(list1, list2):
result = []
for i in range(len(list2)):
for j in range(len(list1)):
result = result + [str(list1[i])+str(list2[j])]
return result
have you figured it out? depends on if you want to input the names on each list, or do you want it to for instance automatically read then append or extend a new text file? I am working on a little script atm and a very quick and simple way, lets say u want all text files in the same folder that you have your .py file:
import os
#this makes a list with all .txt files in the folder.
list_names = [f for f in os.listdir(os.getcwd()) if f.endswith('.txt')]
for file_name in list_names:
with open(os.getcwd() + "/" + file_name) as fh:
words = fh.read().splitlines()
with open(outfile, 'a') as fh2:
for word in words:
fh2.write(word + '\n')

Resources