Haystack EdgeNgramField exact match - django-haystack

I am using haystack with whoosh backend. And set up a EdgeNgramField. But when i search for word colored it includes the results for word one-colored too.
class TagIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True)
name = indexes.EdgeNgramField( model_attr='name')
name_tr = indexes.EdgeNgramField(model_attr='name_tr')
def get_model(self):
return Tag
I have tried adding double quotes but the results are the same.
How can i get exact matches. Do i need to write custom search query?

Related

Regular expression match / break

I am doing text analysis on SEC filings (e.g., 10-K), and the documents I have are the complete submission. The complete filing submission includes the 10-K, plus several other documents. Each document resides within the tags ‘<DOCUMENT>’ and ‘</DOCUMENT>’.
What I want: To count the number of words in the 10-K only before the first instance of ‘</DOCUMENT>’
How I want to accomplish it: I want to use a for loop, with a regex (regex_end10k) to indicate where to stop the for loop.
What is happening: No matter where I put my regex match break, the program counts all of the words in the entire document. I have no error, however I cannot get the desired results.
How I know this: I have manually trimmed one filing, while retaining the full document (results below). When I manually remove the undesired documents after the first instance of ‘</DOCUMENT>’, I yield about 750,000 fewer words.
Current output
Note: Apparently I don't have enough SO reputation to embed a screenshot in my post; it defaults to a link.
What I have tried: several variations of where to put the regex match break. No matter what, it almost always counts the entire document. I believe that the two functions may be performed over the entire document. I have tried putting the break statement within get_text_from_html() so that count_words() only performs on the 10-K, but I have had no luck.
The code below is a snippet from a larger function. It's purpose is to (1) strip html tags and (2) count the number of words in the text. If I can provide any additional information, please let me know and I'll update my post.
The remaining code (not shown) extracts firm and report identifiers, (e.g., ‘file’ or ‘cik’) from the header section between tags ‘<SEC-HEADER>’ and ‘</SEC-HEADER>’. Using the same logic, when extracting header information, I use a regex match break logic and it works perfectly. I need help trying to understand why this same logic isn’t working when I try to count the number of words and how to correct my code. Any help is appreciated.
regex_end10k = re.compile(r'</DOCUMENT>', re.IGNORECASE)
for line in f:
def get_text_from_html(html:str):
doc = lxml.html.fromstring(html)
for table in doc.xpath('.//table'): # optional: removes tables from HTML source code
table.getparent().remove(table)
for tag in ["a", "p", "div", "br", "h1", "h2", "h3", "h4", "h5"]:
for element in doc.findall(tag):
if element.text:
element.text = element.text + "\n"
else:
element.text = "\n"
return doc.text_content()
to_clean = f.read()
clean = get_text_from_html(to_clean)
#print(clean[:20000])
def count_words(clean):
words = re.findall(r"\b[a-zA-Z\'\-]+\b",clean)
word_count = len(words)
return word_count
header_vars["words"] = count_words(clean)
match = regex_end10k.search(line) # This should do it, but it doesn't.
if match:
break
You dont need regx, just split your orginal string, and then in the part before count the words, simple example above:
text = 'Text before <DOCUMENT> text after'
splited_text = text.split('<DOCUMENT>')
splited_text_before = splited_text[0]
count_words = len(splited_text_before.split())
print(splited_text_before)
print(count_words)
output
Text before
2

Find and replace text and wrap in "href"

I am trying to find specific word in a div (id="Test") that starts with "a04" (no case). I can find and replace the words found. But I am unable to correctly use the word found in a "href" link.
I am trying the following working code that correctly identifies my search criteria. My current code is working as expected but I would like help as i do not know how to used the found work as the url id?
var test = document.getElementById("test").innerHTML
function replacetxt(){
var str_rep = document.getElementById("test").innerHTML.replace(/a04(\w)+/g,'TEST');
var temp = str_rep;
//alert(temp);
document.getElementById("test").innerHTML = temp;
}
I would like to wrap the found word in an href but i do not know how to use the found word as the url id (url.com?id=found word).
Can someone help point out how to reference the found work please?
Thanks
If you want to use your pattern with the capturing group, you could move the quantifier + inside the group or else you would only get the value of the last iteration.
\ba04(\w+)
\b word boundary to prevent the match being part of a longer word
a04 Match literally
(\w+) Capture group 1, match 1+ times a word character
Regex demo
Then you could use the first capturing group in the replacement by referring to it with $1
If the string is a04word, you would capture word in group 1.
Your code might look like:
function replacetxt(){
var elm = document.getElementById("test");
if (elm) {
elm.innerHTML = elm.innerHTML.replace(/\ba04(\w+)/g,'TEST');
}
}
replacetxt();
<div id="test">This is text a04word more text here</div>
Note that you don't have to create extra variables like var temp = str_rep;

Sitecore 7 content search Starts with function

I am working with sitecore 7 content search.
var webIndex = ContentSearchManager.GetIndex("sitecore_web_index");
using (var context = webIndex.CreateSearchContext())
{
var results = context.GetQueryable<SearchResultItem>().Where(i =>
i.Content.Contains(mysearchterm));
}
sitecore performing contains operation on the content string, content contains the whole content of the page and does not return the result as I expect, for example searching for "hr" also returning results containing "through" in content, I tried using startswith but that just matches the start of the whole content string, I tried "Equal" but that matches the whole word, is there any way to search content where a word starts with search term?
Define '^' as the first character of a search phrase, it means "Starts With". for example to define all terms starting with "hr", just add '^' to search keyword like this "^hr".

Is there a way to prevent partial word matching using Sitecore Search and Lucene?

Is there a way when using Sitecore Search and Lucene to not match partial words? For example when searching for "Bos" I would like to NOT match the word "Boston". Is there a way to require the entire word to match? Here is a code snippet. I am using FieldQuery.
bool _foundHits = false;
_index = SearchManager.GetIndex("product_version_index");
using (IndexSearchContext _searchContext = _index.CreateSearchContext())
{
QueryBase _query = new FieldQuery("title", txtProduct.Text.Trim());
SearchHits _hits = _searchContext.Search(_query, 1000);
...
}
You may want to try something like this to get the query you want to run. It will put the + in (indicating a required term) and quote the term, so it should exactly match what you're looking for, its worked for me. Providing you're passing in BooleanClause.Occur.MUST.
protected BooleanQuery GetBooleanQuery(string fieldName, string term, BooleanClause.Occur occur)
{
QueryParser parser = new QueryParser(fieldName, new StandardAnalyzer());
BooleanQuery query = new BooleanQuery();
query.Add(parser.Parse(term), occur);
return query;
}
Essentially so your query ends up being parsed to +title:"Bos", you could also download Luke and play around with the query syntax in there, its easier if you know what the syntax should be and then work backwards to see what query objects will generate that.
You have to place the query in double quotes for the exact match results. Lucene supports many such opertators and boolean parameters that can be found here: http://lucene.apache.org/core/2_9_4/queryparsersyntax.html
It depends on field type. If you have memo or text field then partial matching is applied. If you want exact matching use string field instead. There you can find some details: https://www.cmsbestpractices.com/bug-how-to-fix-solr-exact-string-matching-with-sitecore/ .

Django: How to get string from CharField object?

I have a Django model, which is essentially a list of words. It is very simple and is defined as follows:
class Word(models.Model):
word = models.CharField(max_length=100)
I need the particular word as a string object. (I am replacing all the characters in the word with asterisks for a simple Hangman game.) However, I cannot seem to get the word as a string object.
I tried adding this method to the Word class, but that did not seem to work either.
def __str__(self):
return str(self.word)
I get an empty string object instead of the word.
What do I need to do to get the value of the CharField as a string object? Thanks for your help.
Edit: The weird thing for me is that it has no problem returning an integer. For example, if I were do do something like this:
word = Word(pk=2821)
print word.id
... it would print 2821, the id of that particular record.
Are you sure you're fetching the Word objects correctly? The line in your example word = Word(pk=2821) creates a new Word object with a pk of 2821 and a blank word field. If you've fetched an actual Word object from the database that has a value in its word field, then word.word should return a string. E.g.
>>>> w1 = Word(pk=5, word='eggs')
>>>> w1.word
'eggs'
>>>> w1.save()
>>>> w2 = Word.objects.get(pk=5)
>>>> w2.word
u'eggs'
Can you also verify that the words are being correctly stored in your database by connecting to it with a DB client and looking at the output of:
SELECT word FROM yourappname_word LIMIT 20;
Like I said, word.word should work, so the problem might lie in how you're saving or fetching your Word objects.

Resources