How to Read Multiple Files in a Loop in Python and get count of matching words - python-3.x

I have two text files and 2 lists (FIRST_LIST,SCND_LIST),i want to find out count of each file matching words from FIRST_LIST,SCND_LIST individually.
FIRST_LIST =
"accessorizes","accessorizing","accessorized","accessorize"
SCND_LIST=
"accessorize","accessorized","accessorizes","accessorizing"
text File1 contains:
This is a very good question, and you have received good answers which describe interesting topics accessorized accessorize.
text File2 contains:
is more applied,using accessorize accessorized,accessorizes,accessorizing
output
File1 first list count=2
File1 second list count=0
File2 first list count=0
File2 second list count=4
This code i have tried to achive this functionality but not able to get the expected output.
if any help appreciated
import os
import glob
files=[]
for filename in glob.glob("*.txt"):
files.append(filename)
# remove Punctuations
import re
def remove_punctuation(line):
return re.sub(r'[^\w\s]', '', line)
two_files=[]
for filename in files:
for line in open(filename):
#two_files.append(remove_punctuation(line))
print(remove_punctuation(line),end='')
two_files.append(remove_punctuation(line))
FIRST_LIST = "accessorizes","accessorizing","accessorized","accessorize"
SCND_LIST="accessorize","accessorized","accessorizes","accessorizing"
c=[]
for match in FIRST_LIST:
if any(match in value for value in two_files):
#c=match+1
print (match)
c.append(match)
print(c)
len(c)
d=[]
for match in SCND_LIST:
if any(match in value for value in two_files):
#c=match+1
print (match)
d.append(match)
print(d)
len(d)

Using Counter and some list comprehension is one of many different approaches to solve your problem.
I assume, your sample output being wrong since some words are part of both lists and both files but are not counted. In addition I added a second line to the sample strings in order to show how that is working with multi-line strings which might be the typical contents of a given file.
io.StringIO objects emulate your files, but working with real files from your file system works exactly the same since both provide a file-like object or file-like interface:
from collections import Counter
list_a = ["accessorizes", "accessorizing", "accessorized", "accessorize"]
list_b = ["accessorize", "accessorized", "accessorizes", "accessorizing"]
# added a second line to each string just for the sake
file_contents_a = 'This is a very good question, and you have received good answers which describe interesting topics accessorized accessorize.\nThis is the second line in file a'
file_contents_b = 'is more applied,using accessorize accessorized,accessorizes,accessorizing\nThis is the second line in file b'
# using io.StringIO to simulate a file input (--> file-like object)
# you should use `with open(filename) as ...` for real file input
file_like_a = io.StringIO(file_contents_a)
file_like_b = io.StringIO(file_contents_b)
# read file contents and split lines into a list of strings
lines_of_file_a = file_like_a.read().splitlines()
lines_of_file_b = file_like_b.read().splitlines()
# iterate through all lines of each file (for file a here)
for line_number, line in enumerate(lines_of_file_a):
words = line.replace('.', ' ').replace(',', ' ').split(' ')
c = Counter(words)
in_list_a = sum([v for k,v in c.items() if k in list_a])
in_list_b = sum([v for k,v in c.items() if k in list_b])
print("Line {}".format(line_number))
print("- in list a {}".format(in_list_a))
print("- in list b {}".format(in_list_b))
# iterate through all lines of each file (for file b here)
for line_number, line in enumerate(lines_of_file_b):
words = line.replace('.', ' ').replace(',', ' ').split(' ')
c = Counter(words)
in_list_a = sum([v for k,v in c.items() if k in list_a])
in_list_b = sum([v for k,v in c.items() if k in list_b])
print("Line {}".format(line_number))
print("- in list a {}".format(in_list_a))
print("- in list b {}".format(in_list_b))
# actually, your two lists are the same
lists_are_equal = sorted(list_a) == sorted(list_b)
print(lists_are_equal)

Related

Instead of printing to console create a dataframe for output

I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')

Merge only if two consecutives lines startwith at python and write the rest of text normally

Input
02000|42163,54|
03100|4|6070,00
03110|||6070,00|00|00|
00000|31751150201912001|01072000600074639|
02000|288465,76|
03100|11|9060,00
03110|||1299,00|00|
03110||||7761,00|00|
03100|29|14031,21
03110|||14031,21|00|
00000|31757328201912001|01072000601021393|
Code
prev = ''
with open('out.txt') as f:
for line in f:
if prev.startswith('03110') and line.startswith('03110'):
print(prev.strip()+ '|03100|XX|PARCELA|' + line)
prev = line
Hi, I have this code that search if two consecutives lines startswith 03110 and print those line, but I wanna transforme the code so it prints or write at .txt also the rest of the lines
Output should be like this
02000|42163,54|
03100|4|6070,00
03110|||6070,00|00|00|
00000|31751150201912001|01072000600074639|
02000|288465,76|
03100|11|9060,00
03110|||1299,00|00|3100|XX|PARCELA|03110||||7761,00|00|
03100|29|14031,21
03110|||14031,21|00|
00000|31757328201912001|01072000601021393|
I´m know that I´m getting only those two lines merged, because that is the command at print()
03110|||1299,00|00|3100|XX|PARCELA|03110||||7761,00|00|
But I don´t know to make the desire output, can anyone help me with my code?
# I assume the input is in a text file:
with open('myFile.txt', 'r') as my_file:
splited_line = [line.rstrip().split('|') for line in my_file] # this will split every line as a separate list
new_list = []
for i in range(len(splited_line)):
try:
if splited_line[i][0] == '03110' and splited_line[i-1][0] == '03110': # if the current line and the previous line start with 03110
first = '|'.join(splited_line[i-1])
second = '|'.join(splited_line[i])
newLine = first + "|03100|XX|PARCELA|"+ second
new_list.append(newLine)
elif splited_line[i][0] == '03110' and splited_line[i+1][0] == '03110': # to escape duplicating in the list
pass
else:
line = '|'.join(splited_line[i])
new_list.append(line)
except IndexError:
pass
# To write the new_list to text files
with open('new_file' , 'a') as f:
for item in new_list:
print(item)
f.write(item + '\n')

Skip lines with strange characters when I read a file

I am trying to read some data files '.txt' and some of them contain strange random characters and even extra columns in random rows, like in the following example, where the second row is an example of a right row:
CTD 10/07/30 05:17:14.41 CTD 24.7813, 0.15752, 1.168, 0.7954, 1497.¸ 23.4848, 0.63042, 1.047, 3.5468, 1496.542
CTD 10/07/30 05:17:14.47 CTD 23.4846, 0.62156, 1.063, 3.4935, 1496.482
I read the description of np.loadtxt and I have not found a solution for my problem. Is there a systematic way to skip rows like these?
The code that I use to read the files is:
#Function to read a datafile
def Read(filename):
#Change delimiters for spaces
s = open(filename).read().replace(':',' ')
s = s.replace(',',' ')
s = s.replace('/',' ')
#Take the columns that we need
data=np.loadtxt(StringIO(s),usecols=(4,5,6,8,9,10,11,12))
return data
This works without using csv like the other answer and just reads line by line checking if it is ascii
data = []
def isascii(s):
return len(s) == len(s.encode())
with open("test.txt", "r") as fil:
for line in fil:
res = map(isascii, line)
if all(res):
data.append(line)
print(data)
You could use the csv module to read the file one line at a time and apply your desired filter.
import csv
def isascii(s):
len(s) == len(s.encode())
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
if len(row)==expected_length and all((isascii(x) for x in row)):
'write row onto numpy array'
I got the ascii check from this thread
How to check if a string in Python is in ASCII?

Merge Two wordlists into one file

I have two wordlists, as per examples below:
wordlist 1 :
code1
code2
code3
wordlist 2 :
11
22
23
I want to take wordlist 2 and put every number in a line with first line in wordlist 1
example of the output :
code111
code122
code123
code211
code222
code223
code311
.
.
Can you please help me with how to do it? Thanks!
You can run two nested for loops to iterate over both lists, and append the concatenated string to a new list.
Here is a little example:
## create lists using square brackets
wordlist1=['code1', ## wrap something in quotes to make it a string
'code2','code3']
wordlist2=['11','22','23']
## create a new empty list
concatenated_words=[]
## first for loop: one iteration per item in wordlist1
for i in range(len(wordlist1)):
## word with index i of wordlist1 (square brackets for indexing)
word1=wordlist1[i]
## second for loop: one iteration per item in wordlist2
for j in range(len(wordlist2)):
word2=wordlist2[j]
## append concatenated words to the initially empty list
concatenated_words.append(word1+word2)
## iterate over the list of concatenated words, and print each item
for k in range(len(concatenated_words)):
print(concatenated_words[k])
list1 = ["text1","text2","text3","text4"]
list2 = [11,22,33,44]
def iterativeConcatenation(list1, list2):
result = []
for i in range(len(list2)):
for j in range(len(list1)):
result = result + [str(list1[i])+str(list2[j])]
return result
have you figured it out? depends on if you want to input the names on each list, or do you want it to for instance automatically read then append or extend a new text file? I am working on a little script atm and a very quick and simple way, lets say u want all text files in the same folder that you have your .py file:
import os
#this makes a list with all .txt files in the folder.
list_names = [f for f in os.listdir(os.getcwd()) if f.endswith('.txt')]
for file_name in list_names:
with open(os.getcwd() + "/" + file_name) as fh:
words = fh.read().splitlines()
with open(outfile, 'a') as fh2:
for word in words:
fh2.write(word + '\n')

Delta words between two TXT files

I would like to count the delta words between two files.
file_1.txt has content One file with some text and words..
file_1.txt has content One file with some text and additional words to be found..
diff command on Unix systems gives the following infos. difflib can give a similar output.
$ diff file_1.txt file_2.txt
1c1
< One file with some text and words.
---
> One file with some text and additional words to be found.
Is there an easy way to found the words added or removed between two files, or at least between two lines as git diff --word-diff does.
First of all you need to read your files into strings with open() where 'file_1.txt' is path to your file and 'r' is for "reading mode".
Similar for the second file. And don't forget to close() your files when you're done!
Use split(' ') function to split strings you have just read into lists of words.
file_1 = open('file_1.txt', 'r')
text_1 = file_1.read().split(' ')
file_1.close()
file_2 = open('file_2.txt', 'r')
text_2 = file_2.read().split(' ')
file_2.close()
Next step you need to get difference between text_1 and text_2 list variables (objects).
There are many ways to do it.
1)
You can use Counter class from collections library.
Pass your lists to the class's constructor, then find the difference by subtraction in straight and reverse order, call elements() method to get elements and list() to transform it to the list type.
from collections import Counter
text_count_1 = Counter(text_1)
text_count_2 = Counter(text_2)
difference = list((text_count_1 - text_count_2).elements()) + list((text_count_2 - text_count_1).elements())
Here is the way to calculate the delta words.
from collections import Counter
text_count_1 = Counter(text_1)
text_count_2 = Counter(text_2)
delta = len(list((text_count_2 - text_count_1).elements())) \
- len(list((text_count_1 - text_count_2).elements()))
print(delta)
2)
Use Differ class from difflib library. Pass both lists to compare() method of Differ class and then iterate it with for.
from difflib import Differ
difference = []
for d in Differ().compare(text_1, text_2):
difference.append(d)
Then you can count the delta words like this.
from difflib import Differ
delta = 0
for d in Differ().compare(text_1, text_2):
status = d[0]
if status == "+":
delta += 1
elif status == "-":
delta -= 1
print(delta)
3)
You can write difference method by yourself. For example:
def get_diff (list_1, list_2):
d = []
for item in list_1:
if item not in list_2:
d.append(item)
return d
difference = get_diff(text_1, text_2) + get_diff(text_2, text_1)
I think that there are other ways to do this. But I will limit by three.
Since you get the difference list you can manage the output like whatever you wish.
..and here is yet another way to do this with dict()
#!/usr/bin/python
import sys
def loadfile(filename):
h=dict()
f=open(filename)
for line in f.readlines():
words=line.split(' ')
for word in words:
h[word.strip()]=1
return h
first=loadfile(sys.argv[1])
second=loadfile(sys.argv[2])
print "in both first and second"
for k in first.keys():
if k and k in second.keys():
print k

Resources