Hot to create a python code that user input matches the list of word and file.
for example
list = ["banana", "apple"]
file = open("file_path", "r")
search_word = input("Search the word you want to search: ")
for search_word in file.read()
and search_word in list
print("Search word is in the list and in the file")
else:
print("Search word is not matches")
If the search word is not in the list then don't see a need to search the contents of the file for a match since it fails the first condition. Can test that condition and if false then don't need to search the file.
You could do a simple string test if it is found in the file contents.
if search_word in data:
print("match")
However, words that contain the search word will also match (e.g. pineapple would match with apple).
You can use a regular expression to check if the word is contained anywhere in the file. The \b metacharacter matches at the beginning or end of a word so for example apple won't match the word pineapple. The (?i) flag does a case-insensitive search so the word apple matches Apple, etc.
You can try something like this.
import re
list = ["banana", "apple"]
search_word = input("Search the word you want to search: ")
if search_word not in list:
# if not in list then don't need to search the file
found = False
else:
with open("file_path", "r") as file:
data = file.read()
found = re.search(fr'(?i)\b{search_word}\b', data)
# now print results if there was a match or not
if found:
print("Search word is in the list and in the file")
else:
print("Search word does not match")
Related
I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')
I am trying to filter sentences from my pandas data-frame having 50 million records using keyword search. If any words in sentence starts with any of these keywords.
WordsToCheck=['hi','she', 'can']
text_string1="my name is handhit and cannary"
text_string2="she can play!"
If I do something like this:
if any(key in text_string1 for key in WordsToCheck):
print(text_string1)
I get False positive as handhit as hit in the last part of word.
How can I smartly avoid all such False positives from my result set?
Secondly, is there any faster way to do it in python? I am using apply function currently.
I am following this link so that my question is not a duplicate: How to check if a string contains an element from a list in Python
If the case is important you can do something like this:
def any_word_starts_with_one_of(sentence, keywords):
for kw in keywords:
match_words = [word for word in sentence.split(" ") if word.startswith(kw)]
if match_words:
return kw
return None
keywords = ["hi", "she", "can"]
sentences = ["Hi, this is the first sentence", "This is the second"]
for sentence in sentences:
if any_word_starts_with_one_of(sentence, keywords):
print(sentence)
If case is not important replace line 3 with something like this:
match_words = [word for word in sentence.split(" ") if word.lower().startswith(kw.lower())]
I am trying to search for similar words in Python given a wildcard pattern, for example in a text file similar to a dictionary- I could search for r?v?r? and the correct output would be words such as 'rover', 'raver', 'river'.
This is the code I have so far but it only works when I type in the full word and not the wildcard form.
name = input("Enter the name of the words file:\n"
pattern = input("Enter a search pattern:\n")`
textfile = open(name, 'r')
filetext = textfile.read()
textfile.close()
match = re.findall(pattern, filetext)
if match is True:
print(match)
else:
print("Sorry, matches for ", pattern, "could not to be found")
Use dots for blanks
name = input("Enter the name of the words file:\n"
pattern = input("Enter a search pattern:\n")`
textfile = open(name, 'r')
filetext = textfile.read()
textfile.close()
re.findall('r.v.r',filetext)
if match is True:
print(match)
else:
print("Sorry, matches for ", pattern, "could not to be found")
Also, match is a string, so you want to do
if match!="" or if len(match)>0
,whichever one suits your code.
The question is to:
Firstly,find the number of all words in a text file
Secondly, delete the common words like, a, an , and, to, in, at, but,... (it is allowed to write a list of these words)
Thirdly, find the number of the remaining words (unique words)
Make a list of them
the file name should be used as the parameter of the function
I have done the first part of the question
import re
file = open('text.txt', 'r', encoding = 'latin-1')
word_list = file.read().split()
for x in word_list:
print(x)
res = len(word_list)
print ('The number of words in the text:' + str(res))
def uncommonWords (file):
uncommonwords = (list(file))
for i in uncommonwords:
i += 1
print (i)
The code shows till the number of the words and nothing appears after that.
you can do it like this
# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])
# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
for line in text_file:
for word in line.split():
words_in_file.add(word)
# remove common words from word list
unique_words = words_in_file - stop_words
print(list(unique_words))
First, you may want to get rid of punctuation : as showed in this answer, you should do :
nonPunct = re.compile('.*[A-Za-z0-9].*')
filtered = [w for w in text if nonPunct.match(w)]
then, you could do
from collections import Counter
counts = Counter(filtered)
you can then access the list of unique words with list(counts.keys()) and then you can chose to ignore the words you don't want with
[word for word in list(counts.keys()) if word not in common_words]
Hope this answers your question.
I am trying to use grep in Python to search for words in a text file. I tried something like this -
subprocess.call(['/bin/grep', str(word), "textFile.txt"])
This line is printing all output on console. Also, it is returning true even if the word is not matching exactly. For example, it returns a word even for this match -
xxxwordsxxx
def find_words(in_file, out_file):
for word in in_file:
word = word.rstrip()
subprocess.call(["grep", "-w", word, "textFile.txt"])
edit
My in_file and textFile.txt are the same.
How do I implement a search for the exact word? If this is not a correct way, is there any other way I could do this search? (It is a huge text file and I have to find duplicates of all the words in the file)
Try using parameter -w:
import subprocess
word = input("select word to filter: ")
subprocess.call(['/bin/grep', "-w", word, "textFile.txt"]) #str is not needed
You can use the .split() method to iterate over individual words from the line. For example:
string = "My Name Is Josh"
substring = "Name"
for word in string.split():
if substring == word:
print("Match Found")