Django haystack, how to match parts of words? - django-haystack

I'm using haystack 1.2.7 + whoosh 2.4.0 in Django 1.4 (Python is 2.7)
Example: Search query "sear" should match items containing "search" and "sear" and "searching" (etc).
my settings:
HAYSTACK_SITECONF = 'verticalsoftware.search.search_sites'
HAYSTACK_SEARCH_ENGINE = 'whoosh'
HAYSTACK_WHOOSH_PATH = 'C:/whoosh/prodeo_index'
HAYSTACK_INCLUDE_SPELLING = True
search index:
class GalleryIndex(SearchIndex):
text = indexes.CharField(document=True, use_template=True)
content_auto = indexes.NgramField(model_attr='title')
def index_queryset(self):
"""Used when the entire index for model is updated."""
return Gallery.objects.filter(date_added__lte=datetime.datetime.now())
also tried with EdgeNgramField and/or RealTimeSearchIndex
custom urlCONF:
from django.conf.urls.defaults import *
from verticalsoftware.search.views import SearchWithRequest
urlpatterns = patterns('haystack.views',
url(r'^$', SearchWithRequest(), name='haystack_search'),
)
custom view:
from haystack.views import SearchView
import operator
from haystack.query import SearchQuerySet, SQ
class SearchWithRequest(SearchView):
__name__ = 'SearchWithRequest'
def build_form(self, form_kwargs=None):
if form_kwargs is None:
form_kwargs = {}
if self.searchqueryset is None:
sqs = SearchQuerySet().filter(reduce(operator.__or__, [SQ(text=word.strip()) for word in self.request.GET.get("q").split(' ')]))
form_kwargs['searchqueryset'] = sqs
return super(SearchWithRequest, self).build_form(form_kwargs)
for sqs I've tried everything imaginable, using filter and autocomplete as seen in the docs and every relevant forum post I could find; using __startswith and __contains in combination with my content_auto or text field didn't help at all (the latter would not match anything at all; while the former only matched 1 character or the complete string)
the variant pasted above at least has the benefit of returning results for strings with spaces (each word still has to fully match the corresponding database entry, ergo the need for this post)
any help will be IMMENSELY appreciated

late to the party, but suggesting to change your main document field (text) to an EdgeNgramField or NgramField, otherwise the searched index is not capable of matching word fragments, only complete word matching is possible with the CharField.
also, playing in the django shell is sometimes usefull, when debugging haystack:
./manage.py shell
from haystack.query import SearchQuerySet
s = SearchQuerySet()
s.auto_query('sear')
s.auto_query('sear').count()
...

Related

Indexing and searching in documents using Pylucene

I would like to index some documents and then search them for specific terms and retrieve their position in the documents. I am very unsuccessful in this task as all the examples are in JAVA and more importantly they use older version of lucene which is very different from the current version of lucene.
This is my snippet that creates the index:
import pandas as pd
import operator
import lucene
from java.io import StringReader
from java.io import File
from org.apache.lucene.analysis.en import EnglishAnalyzer
from org.apache.lucene.document import Document, Field, FieldType
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.index import DirectoryReader, PostingsEnum, IndexOptions, IndexWriter, IndexWriterConfig
from org.apache.lucene.store import FSDirectory, ByteBuffersDirectory
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.util import Version, BytesRefIterator
# Init
if not lucene.getVMEnv():
lucene.initVM(vmargs=['-Djava.awt.headless=true'])
directory = ByteBuffersDirectory()
iconfig = IndexWriterConfig(EnglishAnalyzer())
iwriter = IndexWriter(directory, iconfig)
ft = FieldType()
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
ft.setStored(True)
ft.setTokenized(True)
ft.setStoreTermVectors(True)
ft.setStoreTermVectorOffsets(True)
ft.setStoreTermVectorPositions(True)
ts = ["this bernhard is the text to be index text",
"this claudia is the text to be indexed"]
for t in ts:
doc = Document()
doc.add(Field("content", t, ft))
iwriter.addDocument(doc)
iwriter.commit()
iwriter.close()
This is the part of code that I try to start reading the index to extract the position of a term:
analyzer = StandardAnalyzer()
reader = DirectoryReader.open(directory)
searcher = IndexSearcher(DirectoryReader.open(directory))
searcher.setSimilarity(BM25Similarity(1.2,0.75))
query = QueryParser('content', analyzer).parse("world")
scoreDocs = searcher.search(query, 10).scoreDocs # it returns TopDocs object containing scoreDocs and totalHits
## scoreDoc object contains docId and score
print('total hit:', searcher.search(query, 10).totalHits)
print("%s total matching documents" % (len(scoreDocs)))
for scoreDoc in scoreDocs:
print(scoreDoc)
fields = reader.getTermVectors(scoreDoc.doc)
print('fields:', fields.terms('content'))
fieldsIter = fields.iterator()
terms = reader.getTermVector(scoreDoc.doc, "content")
termsIter = terms.iterator();
print('terms.position:', terms.hasPositions())
However, it is incomplete and I do not know how to complete the code. Any help is appreciated.

Extract item for each spider in scrapy project

I have over a dozen spiders in a scrapy project with variety of items being extracted from different sources, including others elements mostly i have to copy same regex code over and over again in each spider for example
item['element'] = re.findall('my_regex', response.text)
I use this regex to get same element which is defined in scrapy items, is there a way to avoid copying? where do i put this in project so that i don't have to copy this in each spider and only add those that are different.
my project structure is default
any help is appreciated thanks in advance
So if I understand your question correctly, you want use the same regular expression across multiple spiders.
You can do this:
create a python module called something like regex_to_use
inside that module place your regular expression.
example:
# regex_to_use.py
regex_one = 'test'
You can access this express this one in your spiders.
# spider.py
import regex_to_use
import re as regex
find_string = regex.search(regex_to_use.regex_one, ' this is a test')
print(find_string)
# output
<re.Match object; span=(11, 15), match='test'>
You could also do something like this in your regex_to_use module
# regex_to_use.py
import re as regex
class CustomRegularExpressions(object):
def __init__(self, text):
"""
:param text: string containing the variable to search for
"""
self._text = text
def search_text(self):
find_xyx = regex.search('test', self._text)
return find_xyx
and you would call it this way in your spiders:
# spider.py
from regex_to_use import CustomRegularExpressions
find_word = CustomRegularExpressions('this is a test').search_text()
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
If you have multiple regular expressions you could do something like this:
# regex_to_use.py
import re as regex
class CustomRegularExpressions(object):
def __init__(self, text):
"""
:param text: string containing the variable to search for
"""
self._text = text
def search_text(self, regex_to_use):
regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
find_xyx = regex.search(expression, self._text)
return find_xyx
# spider.py
from regex_to_use import CustomRegularExpressions
find_word = CustomRegularExpressions('this is a test').search_text('regex_one')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
You can also use a staticmethod in the class CustomRegularExpressions
# regex_to_use.py
import re as regex
class CustomRegularExpressions:
#staticmethod
def search_text(regex_to_use, text_to_search):
regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
find_xyx = regex.search(expression, text_to_search)
return find_xyx
# spider.py
from regex_to_use import CustomRegularExpressions
# find_word would be replaced with item['element']
# this is a test would be replaced with response.text
find_word = CustomRegularExpressions.search_text('regex_one', 'this is a test')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
If you use docstrings in the function search_text() you can see the regular expressions in the Python dictionary.
Showing how all this works...
This is a python project that I wrote and published. Take a look at the folder utilities. In this folder I have functions that I can use throughout my code without having to copy and paste the same code over and over.
There is a lot of common data that is usual to use across multiple spiders, like regex or even XPath.
It's a good idea to isolate them.
You can use something like this:
/project
/site_data
handle_responses.py
...
/spiders
your_spider.py
...
Isolate functionalities with a common purpose.
# handle_responses.py
# imports ...
from re import search
def get_specific_commom_data(text: str):
# probably is a good idea handle predictable errors here (`try except`)
return search('your_regex', text)
And just use where is needed that functionality.
# your_spider.py
# imports ...
import scrapy
from site_data.handle_responses import get_specific_commom_data
class YourSpider(scrapy.Spider):
# ... previous code
def your_method(self, response):
# ... previous code
item['element'] = get_specific_commom_data(response.text)
Try to keep it simple and do what you need to solve your problem.
I can copy regex in multiple spiders instead of importing object from other .py files, i understand they have the use case but here i don't want to add anything to any of the spiders but still want the element in result
There are some good answers to this but don't really solve the problem so after searching for days i have come to this solution i hope its useful for others looking for similar answer.
#middlewares.py
import yourproject.items import youritem()
#find the function and add your element
def process_spider_output(self, response, result, spider):
item = YourItem()
item['element'] = re.findall('my_regex', response.text)
now uncomment middleware from
#settings.py
SPIDER_MIDDLEWARES = {
'yourproject.middlewares.YoursprojectMiddleware': 543,
}
For each spider you will get element in result data, i am still searching for better solution and i will update the answer because it slows the spider,

(Groovy) Finding all Confluence spaces with a certain user group in it

as the title states, I want to be able to iterate through my Confluence System and find all spaces, in which a certain user group is in.
I am able to find a user group in a single space with the code below, but I can not seem to find an answer how to do this with ALL spaces.
import com.atlassian.confluence.spaces.SpaceManager
import com.atlassian.sal.api.component.ComponentLocator
import com.atlassian.confluence.security.SpacePermissionManager
import com.atlassian.confluence.security.SpacePermission
import com.atlassian.user.GroupManager
import com.atlassian.confluence.core.ContentPermissionManager
import com.atlassian.confluence.internal.security.SpacePermissionContext
def spaceManager = ComponentLocator.getComponent(SpaceManager)
def spacePermissionManager = ComponentLocator.getComponent(SpacePermissionManager)
def groupManager = ComponentLocator.getComponent(GroupManager)
def targetSpace = spaceManager.getSpace("NameOfSpace")
def targetGroup = groupManager.getGroup("UserGroup")
if (spacePermissionManager.getGroupsWithPermissions(targetSpace).contains(targetGroup)) {
//do something (in my case, remove User Group)
}
I tried it with "def allSpaces = spaceManager.getAllSpaces()" and substituted it into the getGroupsWithPermissions() method with no success.
Thanks!
Have you tried SpacePermissionManager#getAllPermissionsForGroup? I haven't tested this with Groovy, but at least in Java, it returns a list of the space permissions attached with the user group.
Not all SpacePermissions will be attached to actual spaces so you will most likely need to loop through the list and filter results where getSpace() is not null.

Twitter API, Searching with dollar signs

This code opens a twitter listener, and the search terms are in the variable, upgrades_str. Some searches work, and some don't. I added AMZN to the upgrades list just to be sure there's a frequently used term since this is using an open Twitter stream, and not searching existing tweets.
Below, I think we only need to review numbers 2 and 4.
I'm using Python 3.5.2 :: Anaconda 4.0.0 (64-bit) on Windows 10.
Variable searches
Searching with: upgrades_str: ['AMZN', 'SWK', 'AIQUY', 'SFUN', 'DOOR'] = returns tweets such as 'i'm tired of people'
Searching with: upgrades_str: ['$AMZN', '$SWK', '$AIQUY', '$SFUN', '$DOOR'] = returns tweets as as 'Chicago to south Florida. Hiphop lives'. This search is the one I wish worked.
Explicit searches
Searching by replacing the variable 'upgrades_str' with the explicit string: ['AMZN', 'SWK', 'AIQUY', 'SFUN', 'DOOR'] = returns 'After being walked in on twice, I have finally figured out how to lock the door here in Sweden'. This one at least has the search term 'door'.
Searching by replacing the variable 'upgrades_str' with the explicit string: ['$AMZN', '$SWK', '$AIQUY', '$SFUN', '$DOOR'] = returns '$AMZN $WFM $KR $REG $KIM: Amazon’s Whole Foods buy puts shopping centers at risk as real'. So the explicit call works, but not the identical variable.
Explicitly searching for ['$AMZN'] = returns a good tweet: 'FANG setting up really good for next week! Added $googl jun23 970c avg at 4.36. $FB $AMZN'.
Explicitly searching for ['cool'] returns 'I can’t believe I got such a cool Pillow!'
import tweepy
import dataset
from textblob import TextBlob
from sqlalchemy.exc import ProgrammingError
import json
db = dataset.connect('sqlite:///tweets.db')
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
if status.retweeted:
return
description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
blob = TextBlob(text)
sent = blob.sentiment
if geo is not None:
geo = json.dumps(geo)
if coords is not None:
coords = json.dumps(coords)
table = db['tweets']
try:
table.insert(dict(
user_description=description,
user_location=loc,
coordinates=coords,
text=text,
geo=geo,
user_name=name,
user_created=user_created,
user_followers=followers,
id_str=id_str,
created=created,
retweet_count=retweets,
user_bg_color=bg_color,
polarity=sent.polarity,
subjectivity=sent.subjectivity,
))
except ProgrammingError as err:
print(err)
def on_error(self, status_code):
if status_code == 420:
return False
access_token = 'token'
access_token_secret = 'tokensecret'
consumer_key = 'consumerkey'
consumer_secret = 'consumersecret'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=upgrades_str, languages=['en'])
Here's the answer, in case someone has the problem in the future: "Note that punctuation is not considered to be part of a #hashtag or #mention, so a track term containing punctuation will not match either #hashtags or #mentions." From: https://dev.twitter.com/streaming/overview/request-parameters#track
And for multiple terms, the string, which was converted from a list, needs to be changed to ['term1,term2']. Just strip out the apostrophes and spaces:
upgrades_str = re.sub('[\' \[\]]', '', upgrades_str)
upgrades_str = '[\''+format(upgrades_str)+'\']'

Alternatives to string.replace() method that allows for multiple sub-string search and replace

Python newbie here, I'm trying to search a video API which, for some reason won't allow me to search video titles with certain characters in the video title such as : or |
Currently, I have a function which calls the video API library and searches by title, which looks like this:
def videoNameExists(vidName):
vidName = vidName.encode("utf-8")
bugFixVidName = vidName.replace(":", "")
search_url ='http://cdn-api.ooyala.com/v2/syndications/49882e719/feed?pcode=1xeGMxOt7GBjZPp2'.format(bugFixVidName) #this URL is altered to protect privacy for this post
Is there an alternative to .replace() (or a way to use it that I'm missing) that would let me search for more than one sub-string at the same time?
Take a look a the Python re module, specifically at the method re.sub().
Here's an example for your case:
import re
def videoNameExists(vidName):
vidName = vidName.encode("utf-8")
# bugFixVidName = vidName.replace(":", "")
bugFixVidName = re.sub(r'[:|]', "", vidName)
search_url ='http://cdn-api.ooyala.com/v2/syndications/49882e719/feed?pcode=1xeGMxOt7GBjZPp2'.format(bugFixVidName) #this URL is altered to protect privacy for this post

Resources