Django-haystack not updating index - django-haystack

Using django-haystack 2.0.0 and xapian-haystack 2.0.0, migrated all code from 1.1.5 as it said in docs.
Now my search_indexes.py looks like:
from haystack import indexes
from app.models import Post
class PostIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
def get_model(self):
return Post
def index_queryset(self, using=None):
"""Used when the entire index for model is updated."""
return self.get_model().objects.filter(visible=True)
But when I go rebuild_index, it says:
Are you sure you wish to continue? [y/N] y
Removing all documents from your index because you said so. All
documents removed.
With verbosity:
Skipping '<class 'django.contrib.auth.models.Permission'>' - no index.
Skipping '<class 'django.contrib.auth.models.Group'>' - no index.
...
Skipping '<class 'app.models.Post'>' - no index.
So, I don't know why haystack doesn't index this model.

You have to actually add fields to your index, so under "text" add:
post1 = indexes.CharField(model_attr='postfield1', null=True)
And then in your post_text.txt index file template:
{{object.post1}}
Now it should work.

You have not added which fields you want to search for, in your
search_indexes.py
file.
You have to do like:
class PostIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
data = indexes.CharField(model_attr='data', blank=True, null=True)
def get_model(self):
return Post
def index_queryset(self, using=None):
"""Used when the entire index for model is updated."""
return self.get_model().objects.filter(visible=True)
and then create a directory structure like templates/search/indexes/'app_name'/post_text.txt. Now run command ./manage.py rebuild_index.

Related

How to get/import the Scrapy item list from items.py to pipelines.py?

In my items.py:
class NewAdsItem(Item):
AdId = Field()
DateR = Field()
AdURL = Field()
In my pipelines.py:
import sqlite3
from scrapy.conf import settings
con = None
class DbPipeline(object):
def __init__(self):
self.setupDBCon()
self.createTables()
def setupDBCon(self):
# This is NOT OK!
# I want to get the items already HERE!
dbfile = settings.get('SQLITE_FILE')
self.con = sqlite3.connect(dbfile)
self.cur = self.con.cursor()
def createTables(self):
# OR optionally HERE.
self.createDbTable()
...
def process_item(self, item, spider):
self.storeInDb(item)
return item
def storeInDb(self, item):
# This is OK, I CAN get the items in here, using:
# item.keys() and/or item.values()
sql = "INSERT INTO {0} ({1}) VALUES ({2})".format(self.dbtable, ','.join(item.keys()), ','.join(['?'] * len(item.keys())) )
...
How can I get the item list names (like "AdId" etc) from items.py, before process_item() (in pipelines.py) is executed?
I use scrapy runspider myspider.py for execution.
I already tried to add "item" and/or "spider" like this def setupDBCon(self, item), but that didn't work, and resulted in:
TypeError: setupDBCon() missing 1 required positional argument: 'item'
UPDATE: 2018-10-08
Result (A):
Partially following the solution from #granitosaurus I found that I can get the item keys as a list, by:
Adding (a): from adbot.items import NewAdsItem to my main spider code.
Adding (b): ikeys = NewAdsItem.fields.keys() within the Class of above.
I could then access the keys from my pipelines.py via:
def open_spider(self, spider):
self.ikeys = list(spider.ikeys)
print("Keys in pipelines: \t%s" % ",".join(self.ikeys) )
#self.createDbTable(ikeys)
However, there were 2 problems with this method:
I was not able to get the ikeys list, into the createDbTable(). (I kept getting errors about missing arguments here and there.)
The ikeys list (as retrieved) was re-arranged and did not keep the order of the items, as they appear in items.py, which partially defeated the purpose. I still don't understand why these are out of order, when all docs says that Python3 should keep the order of dicts and lists etc. While at the same time, when using process_item() and getting the items via: item.keys() their order remain intact.
Result (B):
At the end of the day, it turned out too laborious and complicated to fix (A), so I just imported the relevant items.py Class into my pipelines.py, and use the item list as a global variable, like this:
def createDbTable(self):
self.ikeys = NewAdsItem.fields.keys()
print("Keys in creatDbTable: \t%s" % ",".join(self.ikeys) )
...
In this case I just decided to accept that the list obtained seem to be alphabetically sorted, and worked around the issue by just changing the key names. (Cheating!)
This is disappointing, because the code is ugly and contorted.
Any better suggestions would be much appreciated.
Scrapy pipelines have 3 connected methods:
process_item(self, item, spider)
This method is called for every item pipeline component.
process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components.
open_spider(self, spider)
This method is called when the spider is opened.
close_spider(self, spider)
This method is called when the spider is closed.
https://doc.scrapy.org/en/latest/topics/item-pipeline.html
So you can only get access to item in process_item method.
If you want to get item class however you can attach it to spider class:
class MySpider(Spider):
item_cls = MyItem
class MyPipeline:
def open_spider(self, spider):
fields = spider.item_cls.fields
# fields is a dictionary of key: default value
self.setup_table(fields)
Alternative you can lazy load during process_item method itself:
class MyPipeline:
item = None
def process_item(self, item, spider):
if not self.item:
self.item = item
self.setup_table(item)

csv file process in Python

I work with a csv data as follow:
ticker,exchange_country,company_name,price,exchange_rate,shares_outstanding,net_income
1,HK,CK HUTCHISON HOLDINGS LTD,1.404816984,7.757949829,3859.677979,31633
2,HK,CLP HOLDINGS LTD,1.312602194,7.757949829,2526.450928,16319
3,HK,HONG KONG & CHINA GAS CO LTD,0.234939214,7.757949829,12717.04199,7546.200195
11,HK,HANG SENG BANK LTD,2.198193203,7.757949829,1911.843018,15451
I have a StockStatRecord class:
class StockStatRecord:
def __init__(self, stock_load):
self.name = stock_load[0]
self.company_name = stock_load[2]
self.exchange_country = stock_load[1]
self.price = stock_load[3]
self.exchange_rate = stock_load[4]
self.shares_outstanding = stock_load[5]
self.net_income = stock_load[6]
How am I supposed to create another class to extract the data from that CSV, parse it, create new record and return the record created? In this class, it also needs to validate the rows when reading. Validation will fail for any row that is missing any piece of information, or if the name (symbol or player name) is empty, or if any of the numbers(int or float) cannot be parsed ( watch out of the division by zero).
There are several ways of doing this, either rolling out the code yourself, or using a Python module that is made for veryfing data-schemas, like Colander, or the extended CSV reader in Pandas (as Zwinck posted in the comment above).
What is not usually needed is a separate class to check values- you can do that on the same class - or usually, have a base class that implements the data-validation mechanisms, and then just have extra information on each field for the actual data classes. And finally, if you need to process data and spill an object back, there is no need for a class because in Python you can have functions independents of classes - there is no need to try to hammer down every piece of code to a class.
One simple thing to there is to (1) use Python's csv.DictReader instead of csv.Reader to read the rows - that way you have each piece of data bound to the column name already, as a dict, instead of a list where you have to manually track the column numbers, then set a property for each of the columns you need validation, so that the fields can be validated on setting - and a __init__ method that simply assigns all fields to their respectiv attributes:
class SockStatRecord:
def __init__(self, row):
for key, value in row.items():
setattr(self, key, value)
#property
def name(self):
return self._name
#name.setter
def name(self, value):
if not name: # example verification for empty name
raise ValueError
self._name = name
# continue for other fields
import csv
reader = csv.Dictreader(open("mydatafile.csv"))
all_records = []
for row in reader:
try:
all_records.append(StockDataRecord(row))
except ValueError:
print("Some error at record: {}".format(row))

Filling ChoiceBlock with snippet data

I have a snippet for countrycodes and I want to define localized country names on the root pages for each localized site.
The snippet looks like this:
#register_snippet
class Country(models.Model):
iso_code = models.CharField(max_length=2, unique=True)
panels = [
FieldPanel('iso_code'),
]
def get_iso_codes():
try:
countries = Country.objects.all()
result = []
for country in countries:
result.append((country.iso_code,country.iso_code))
return result
except Country.DoesNotExist:
return []
Now I want to call the function get_iso_codes when creating a choiceblock and fill the choices from the snippet.
The block looks like this
class CountryLocalizedBlock(blocks.StructBlock):
iso_code = blocks.ChoiceBlock(choices=Country.get_iso_codes(), unique=True)
localized_name = blocks.CharBlock(required=True)
However, when calling manage.py makemigrations I get the following error:
psycopg2.ProgrammingError: relation "home_country" does not exist
LINE 1: ..."."iso_code", "home_country"."sample_number" FROM "home_coun...
I can bypass this by commenting out 'Country.objects.all()' and then running makemigrations and later readding the line again to the code, however I would prefer a solution that does not require this workaround (also it fails when I run 'manage.py collectstatic' when building before deployment and I don't know how to work around this and am stuck)
I found a solution based on Wagtail, how do I populate the choices in a ChoiceBlock from a different model?
The country class remains untouched (except that the get_iso_codes method is now superflous). I've just extended Chooserblock and use Country as my target_model:
class CountryChooserBlock(blocks.ChooserBlock):
target_model = Country
widget = forms.Select
def value_for_form(self, value):
if isinstance(value, self.target_model):
return value.pk
else:
return value
And used the CountryChooserBlock instead of the ChoiceBlock:
class CountryLocalizedBlock(blocks.StructBlock):
iso_code = CountryChooserBlock(unique=True)
localized_name = blocks.CharBlock(required=True)

Mongoengine Link to Existing Collection

I'm working with Flask/Mongoengine-MongoDB for my latest web application.
I'm familiar with Pymongo, but I'm new to object-document mappers like Mongoengine.
I have a database and collection set up already, and I basically just want to query it and return the corresponding object. Here's a look at my models.py...
from app import db
# ----------------------------------------
# Taking steps towards a working backend.
# ----------------------------------------
class Property(db.Document):
# Document variables.
total_annual_rates = db.IntField()
land_value = db.IntField()
land_area = db.IntField()
assessment_number = db.StringField(max_length=255, required=True)
address = db.StringField(max_length=255, required=True)
current_capital_value = db.IntField
valuation_as_at_date = db.StringField(max_length=255, required=True)
legal_description = db.StringField(max_length=255, required=True)
capital_value = db.IntField()
annual_value = db.StringField(max_length=255, required=True)
certificate_of_title_number = db.StringField(max_length=255, required=True)
def __repr__(self):
return address
def get_property_from_db(self, query_string):
if not query_string:
raise ValueError()
# Ultra-simple search for the moment.
properties_found = Property.objects(address=query_string)
return properties_found[0]
The error I get is as follows: IndexError: no such item for Cursor instance
This makes complete sense, since the object isn't pointing at any collection. Despite trolling through the docs for a while, I still have no idea how to do this.
Do any of you know how I could appropriately link up my Property class to my already extant database and collection?
The way to link a class to an existing collection can be accomplished as such, using meta:
class Person(db.DynamicDocument):
# Meta variables.
meta = {
'collection': 'properties'
}
# Document variables.
name = db.StringField()
age = db.IntField()
Then, when using the class object, one can actually make use of this functionality as might be expected with MongoEngine:
desired_documents = Person.objects(name="John Smith")
john = desired_documents[0]
Or something similar :) Hope this helps!
I was googling this same question and i noticed the answer has changed since the previous answer:
According to the latest Mongoengine guide:
If you need to change the name of the collection (e.g. to use MongoEngine with an existing
database), then create a class dictionary attribute called meta on your document, and set collection to the
name of the collection that you want your document class to use:
class Page(Document):
meta = {'collection': 'cmsPage'}
The code on the grey did the trick and i could use my data instantly.

Django haystack, how to match parts of words?

I'm using haystack 1.2.7 + whoosh 2.4.0 in Django 1.4 (Python is 2.7)
Example: Search query "sear" should match items containing "search" and "sear" and "searching" (etc).
my settings:
HAYSTACK_SITECONF = 'verticalsoftware.search.search_sites'
HAYSTACK_SEARCH_ENGINE = 'whoosh'
HAYSTACK_WHOOSH_PATH = 'C:/whoosh/prodeo_index'
HAYSTACK_INCLUDE_SPELLING = True
search index:
class GalleryIndex(SearchIndex):
text = indexes.CharField(document=True, use_template=True)
content_auto = indexes.NgramField(model_attr='title')
def index_queryset(self):
"""Used when the entire index for model is updated."""
return Gallery.objects.filter(date_added__lte=datetime.datetime.now())
also tried with EdgeNgramField and/or RealTimeSearchIndex
custom urlCONF:
from django.conf.urls.defaults import *
from verticalsoftware.search.views import SearchWithRequest
urlpatterns = patterns('haystack.views',
url(r'^$', SearchWithRequest(), name='haystack_search'),
)
custom view:
from haystack.views import SearchView
import operator
from haystack.query import SearchQuerySet, SQ
class SearchWithRequest(SearchView):
__name__ = 'SearchWithRequest'
def build_form(self, form_kwargs=None):
if form_kwargs is None:
form_kwargs = {}
if self.searchqueryset is None:
sqs = SearchQuerySet().filter(reduce(operator.__or__, [SQ(text=word.strip()) for word in self.request.GET.get("q").split(' ')]))
form_kwargs['searchqueryset'] = sqs
return super(SearchWithRequest, self).build_form(form_kwargs)
for sqs I've tried everything imaginable, using filter and autocomplete as seen in the docs and every relevant forum post I could find; using __startswith and __contains in combination with my content_auto or text field didn't help at all (the latter would not match anything at all; while the former only matched 1 character or the complete string)
the variant pasted above at least has the benefit of returning results for strings with spaces (each word still has to fully match the corresponding database entry, ergo the need for this post)
any help will be IMMENSELY appreciated
late to the party, but suggesting to change your main document field (text) to an EdgeNgramField or NgramField, otherwise the searched index is not capable of matching word fragments, only complete word matching is possible with the CharField.
also, playing in the django shell is sometimes usefull, when debugging haystack:
./manage.py shell
from haystack.query import SearchQuerySet
s = SearchQuerySet()
s.auto_query('sear')
s.auto_query('sear').count()
...

Resources