Numbers and special characters in search strings in Haystack+Whoosh - django-haystack

I'm tring to understand how haystack search works.
I hava a model Order with field Order.no where the order number is stored in a form 'ABC/2013/11/1', 'ABC/2013/11/2' ...
I want to implement autocomplete on this field using Haystack with Whoosh backend (django-haystack 2.1.0, celery-haystack 0.7.2, Whoosh 2.5.5, Django 1.6). My search_index.py looks like this:
class OrderIndex(CelerySearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
name_auto = indexes.EdgeNgramField(model_attr='name')
def get_model(self):
return Order
When I try
SearchQuerySet().autocomplete(name_auto='ABC/2013')
I recieve both ABC/2013/11/1 and ABC/2013/11/2 and it's ok
When I try
SearchQuerySet().autocomplete(name_auto='ABC/2013/11')
I still recieve both ABC/2013/11/1 and ABC/2013/11/2 and it's also ok
but when I try
SearchQuerySet().autocomplete(name_auto='ABC/2013/11/1')
I still recieve both ABC/2013/11/1 and ABC/2013/11/2 I don't understand why.
I also notice that when I change the number format for the whole project for '1/ABC/2013/10' ... query like
SearchQuerySet().autocomplete(name_auto='1/')
doesn't return any results and query like
SearchQuerySet().autocomplete(name_auto='1/ABC')
return both '1/ABC/2013/10' and '2/ABC/2013/10'.
Maybe I'm missing something related to numbers and/or special characters in Haystack queries/ search strings. Thanks for any help.

The reason is two-fold actually: First is that the forward-slash ("/") is a reserved character in Whoosh, so it is ignored. Second is that Whoosh also ignores single character search terms.
So your query,
'ABC/2013/11/1'
if stripped off the slashes,
'ABC 2013 11 1'
and then single characters,
'ABC 2013 11'
Looks just as if you are searching for
'ABC/2013/11' -> 'ABC 2013 11'
Funny the documentation seems to be mum about this strange behavior.

Related

How to get a substring with Regex in Python

I am trying to formnulate a regex to get the ids from the below two strings examples:
/drugs/2/drug-19904-5106/magnesium-oxide-tablet/details
/drugs/2/drug-19906/magnesium-moxide-tablet/details
In the first case, I should get 19904-5106 and in the second case 19906.
So far I tried several, the closes I could get is [drugs/2/drug]-.*\d but would return g-19904-5106 and g-19907.
Please any help to get ride of the "g-"?
Thank you in advance.
When writing a regex expression, consider the patterns you see so that you can align it correctly. For example, if you know that your desired IDs always appear in something resembling ABCD-1234-5678 where 1234-5678 is the ID you want, then you can use that. If you also know that your IDs are always digits, then you can refine the search even more
For your example, using a regex string like
.+?-(\d+(?:-\d+)*)
should do the trick. In a python script that would look something like the following:
match = re.search(r'.+?-(\d+(?:-\d+)*)', my_string)
if match:
my_id = match.group(1)
The pattern may vary depending on the depth and complexity of your examples, but that works for both of the ones you provided
This is the closest I could find: \d+|.\d+-.\d+

Python: getting the youtube author image from the video link

Hello so i try to scrape off the author image url from the given video link using the urllib3 module, but due to different lengths of the url it causes to join other properties like the width and height
https://yt3.ggpht.com/ytc/AKedOLS-Bwwebj7zfYDDo43sYPxD8LN7q4Lq4EvqfyoDbw=s400-c-k-c0x00ffffff-no-rj","width":400,"height"
instead of this author image link which i want :
https://yt3.ggpht.com/ytc/AKedOLS-Bwwebj7zfYDDo43sYPxD8LN7q4Lq4EvqfyoDbw=s400-c-k-c0x00ffffff-no-rj
the code that i worked
import re
import urllib.request
def get_author(uri):
html = urllib.request.urlopen(uri)
author_image = re.findall(r'yt3.ggpht.com/(\S{99})', html.read().decode())
return f"https://yt3.ggpht.com/{author_image[1]}"
sorry for my bad english, thanks in advance =)
If you are not sure about the length of the match, do not hardcode the amount of chars to be matched. {99} is not going to work with arbitrary strings.
Besides, you want to match the string in a mark-up text and you need to be sure you only match until the delimiting char. If it is a " char, then match until that character.
Also, dots in regex are special and you need to escape them to match literal dots.
Besides, findall is used to match all occurrences, you can use re.search to get the first one to free up some resources.
So, a fix could look like
def get_author(uri):
html = urllib.request.urlopen(uri)
author_image = re.search(r'yt3\.ggpht\.com/[^"]+', html.read().decode())
if author_image:
return f"https://{author_image.group()}"
return None # you will need to handle this condition in your later code
Here, author_image is the regex match data object, and if it matches, you need to prepend the match value (author_image.group()) with https:// and return the value, else, you need to return some default value to check later in the code (here, None).

Way to find a number at the end of a string in Smalltalk

I have different commands my program is reading in (i.e., print, count, min, max, etc.). These words can also include a number at the end of them (i.e., print3, count1, min2, max6, etc.). I'm trying to figure out a way to extract the command and the number so that I can use both in my code.
I'm struggling to figure out a way to find the last element in the string in order to extract it, in Smalltalk.
You didn't told which incarnation of Smalltalk you use, so I will explain what I would do in Pharo, that is the one I'm familiar with.
As someone that is playing with Pharo a few months at most, I can tell you the sheer amount of classes and methods available can feel overpowering at first, but the environment actually makes easy to find things. For example, when you know the exact input and output you want, but doesn't know if a method already exists somewhere, or its name, the Finder actually allow you to search by giving a example. You can open it in the world menu, as shown bellow:
By default it seeks selectors (method names) matching your input terms:
But this default is not what we need right now, so you must change the option in the upper right box to "Examples", and type in the search field a example of the input, followed by the output you want, both separated by a ".". The input example I used was the string 'max6', followed by the desired result, the number 6. Pharo then gives me a list of methods that match that:
To get what would return us the text part, you can make a new search, changing the example output from number 6 to the string 'max':
Fortunately there is several built-in methods matching the description of your problem.
There are more elegant ways, I suppose, but you can make use of the fact that String>>#asNumber only parses the part it can recognize. So you can do
'print31' reversed asNumber asString reversed asNumber
to give you 31. That only works if there actually is a number at the end.
This is one of those cases where we can presume the input data has a specific form, ie, the only numbers appear at the end of the string, and you want all those numbers. In that case it's not too hard to do, really, just:
numText := 'Kalahari78' select: [ :each | each isDigit ].
num := numText asInteger. "78"
To get the rest of the string without the digits, you can just use this:
'Kalahari78' withoutTrailingDigits. "Kalahari"6
As some of the Pharo "OGs" pointed out, you can take a look at the String class (just type CMD-Return, type in String, hit Return) and you will find an amazing number of methods for all kinds of things. Usually you can get some ideas from those. But then there are times when you really just need an answer!

Questions regarding Python replace specific texts

I'm writing a script to scrape from another website with Python, and I am facing this question that I have yet to figure out a method to resolve it.
So say I have set to replace this particular string with something else.
word_replace_1 = 'dv'
namelist = soup.title.string.replace(word_replace_1,'11dv')
The script works fine, when the titles are dv234,dv123 etc.
The output will be 11dv234, 11dv123.
However if the titles are, dv234, mixed with dvab123, even though I did not set dvab to be replaced with anything, the script is going to replace it to 11dvab123. What should I do here?
Also, if the title is a combination of alphabits,numbers and Korean characters, say DAV123ㄱㄴㄷ,
how exactly should I make it to only spitting out DAV123, and adding - in between alphabits and numbers?
Python - making a function that would add "-" between letters
This gives me the idea to add - in between all characters, but is there a method to add - between character and number?
the only way atm I can think of is creating a table of replacing them, for example something like this
word_replace_3 = 'a1'
word_replace_4 = 'a2'
.......
and then print them out as
namelist3 = soup.title.string.replace(word_replace_3,'a-1').replace(word_replace_4,'a-2')
This is just slow and not efficient. What would be the best method to resolve this?
Thanks.

In Sphinx Search, how do I add "hashtag" to the charset_table?

I would like people to be able to search #photography as well as photography. Those should be treated as two different words in Sphinx. By default, #photography maps to photography, and I can't search for hashtags.
I read on this page that you can add the hash tag to the charset_table to accomplish this. I am completely clueless on how to do that. I don't know unicode, and I don't know what my charset_table should be.
Can someone tell me what my charset_table should be? Thanks.
# charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
Note: I plan on using real-time index. (not sure if this makes a difference)
It's U+0023 according to the Unicode table. So the final config should be like
charset_table = 0..9, A..Z->a..z, _, a..z, U+23, U+410..U+42F->U+430..U+44F, U+430..U+44F
Don't forget about charset_type variable. AFAIK, this example charset_table is for utf-8. Besides this, you should delete U+23 from blend_chars variable to allow Sphinx to index it as a legit character.
I would like people to be able to search #photography as well as photography. Those should be treated as two different words in Sphinx. By default, #photography maps to photography, and I can't search for hashtags.
good day.
i think it wiil some workaround for you problem, but:
it's bad way to call search function directly from user query.
before call search function in sphinx engine, you need to make some kind of processing on user string.
for example you may check user string for some kind of special characters and delete special characters from query. aftet you may call search function with proceeded query.
good luck.

Resources