How to get/import the Scrapy item list from items.py to pipelines.py? - python-3.x

In my items.py:
class NewAdsItem(Item):
AdId = Field()
DateR = Field()
AdURL = Field()
In my pipelines.py:
import sqlite3
from scrapy.conf import settings
con = None
class DbPipeline(object):
def __init__(self):
self.setupDBCon()
self.createTables()
def setupDBCon(self):
# This is NOT OK!
# I want to get the items already HERE!
dbfile = settings.get('SQLITE_FILE')
self.con = sqlite3.connect(dbfile)
self.cur = self.con.cursor()
def createTables(self):
# OR optionally HERE.
self.createDbTable()
...
def process_item(self, item, spider):
self.storeInDb(item)
return item
def storeInDb(self, item):
# This is OK, I CAN get the items in here, using:
# item.keys() and/or item.values()
sql = "INSERT INTO {0} ({1}) VALUES ({2})".format(self.dbtable, ','.join(item.keys()), ','.join(['?'] * len(item.keys())) )
...
How can I get the item list names (like "AdId" etc) from items.py, before process_item() (in pipelines.py) is executed?
I use scrapy runspider myspider.py for execution.
I already tried to add "item" and/or "spider" like this def setupDBCon(self, item), but that didn't work, and resulted in:
TypeError: setupDBCon() missing 1 required positional argument: 'item'
UPDATE: 2018-10-08
Result (A):
Partially following the solution from #granitosaurus I found that I can get the item keys as a list, by:
Adding (a): from adbot.items import NewAdsItem to my main spider code.
Adding (b): ikeys = NewAdsItem.fields.keys() within the Class of above.
I could then access the keys from my pipelines.py via:
def open_spider(self, spider):
self.ikeys = list(spider.ikeys)
print("Keys in pipelines: \t%s" % ",".join(self.ikeys) )
#self.createDbTable(ikeys)
However, there were 2 problems with this method:
I was not able to get the ikeys list, into the createDbTable(). (I kept getting errors about missing arguments here and there.)
The ikeys list (as retrieved) was re-arranged and did not keep the order of the items, as they appear in items.py, which partially defeated the purpose. I still don't understand why these are out of order, when all docs says that Python3 should keep the order of dicts and lists etc. While at the same time, when using process_item() and getting the items via: item.keys() their order remain intact.
Result (B):
At the end of the day, it turned out too laborious and complicated to fix (A), so I just imported the relevant items.py Class into my pipelines.py, and use the item list as a global variable, like this:
def createDbTable(self):
self.ikeys = NewAdsItem.fields.keys()
print("Keys in creatDbTable: \t%s" % ",".join(self.ikeys) )
...
In this case I just decided to accept that the list obtained seem to be alphabetically sorted, and worked around the issue by just changing the key names. (Cheating!)
This is disappointing, because the code is ugly and contorted.
Any better suggestions would be much appreciated.

Scrapy pipelines have 3 connected methods:
process_item(self, item, spider)
This method is called for every item pipeline component.
process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components.
open_spider(self, spider)
This method is called when the spider is opened.
close_spider(self, spider)
This method is called when the spider is closed.
https://doc.scrapy.org/en/latest/topics/item-pipeline.html
So you can only get access to item in process_item method.
If you want to get item class however you can attach it to spider class:
class MySpider(Spider):
item_cls = MyItem
class MyPipeline:
def open_spider(self, spider):
fields = spider.item_cls.fields
# fields is a dictionary of key: default value
self.setup_table(fields)
Alternative you can lazy load during process_item method itself:
class MyPipeline:
item = None
def process_item(self, item, spider):
if not self.item:
self.item = item
self.setup_table(item)

Related

Trying to figure out how to pass variables from one class to another in python while calling a class from a dictionary

So I am getting used to working with OOP in python, it has been a bumpy road but so far things seem to be working. I have, however hit a snag and i cannot seem to figure this out. here is the premise.
I call a class and pass 2 variables to it, a report and location. From there, I need to take the location variable, pass it to a database and get a list of filters it is supposed to run through, and this is done through a dictionary call. Finally, once that dictionary call happens, i need to take that report and run it through the filters. here is the code i have.
class Filters(object):
def __init__ (self, report, location):
self.report = report
self.location = location
def get_location(self):
return self.location
def run(self):
cursor = con.cursor()
filters = cursor.execute(filterqry).fetchall()
for i in filters:
f = ReportFilters.fd.get(i[0])
f.run()
cursor.close()
class Filter1(Filters):
def __init__(self):
self.f1 = None
''' here is where i tried super() and Filters.__init__.() etc.... but couldn't make it work'''
def run(self):
'''Here is where i want to run the filters but as of now i am trying to print out the
location and the report to see if it gets the variables.'''
print(Filters.get_location())
class ReportFilters(Filters):
fd = {
'filter_1': Filter1(),
'filter_2': Filter2(),
'filter_3': Filter3()
}
My errors come from the dictionary call, as when i tried to call it as it is asking for the report and location variables.
Hope this is clear enough for you to help out with, as always it is duly appreciated.
DamnGroundHog
The call to its parent class should be defined inside the init function and you should pass the arguments 'self', 'report' and 'location' into init() and Filters.init() call to parent class so that it can find those variables.
If the error is in the Filters1 class object, when you try to use run method and you do not see a location or a report variable passed in from parent class, that is because you haven't defined them when you instantiated those object in ReportFilters.fd
It should be:
class ReportFilters(Filters):
fd = {
'filter_1': Filter1(report1, location1),
'filter_2': Filter2(report2, location2),
'filter_3': Filter3(report3, location3)
}
class Filter1(Filters):
def __init__(self, report, location):
Filters.__init__(self, report, location)
self.f1 = None
def run(self):
print(self.get_location())

Python / Attributes between methods of a class

I'm new in Python and I'm trying to get my head around how are managed attributes between methods of a class.
In the following example, I'm trying to modify a list in the method "regex" and use it afterwards in another method "printsc".
The "regex" part works without issues, but the attribute "self.mylist" is not updated so when I call "printsc" the result is "None".
class MyClass():
def __init__(self):
self.mylist = None
def regex(self, items):
self.mylist = []
for item in items:
if re.match(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$", item):
self.mylist.append("IP:" + item)
else:
self.mylist.append("DNS:" + item)
return self.mylist
def printsc(self):
print(self.mylist)
items = ['192.168.0.1', 'hostname1', '10.0.1.15', 'server.local.fr']
MyClass().regex(items)
MyClass().printsc()
What am I missing ? What is the best way to achieve this goal ?
Thank you for your answers!
When you do MyClass(), it returns you an object.. And you are calling your methods on the object. Since you are doing it twice, each time a new object is created and regex and printsc are called on different objects.
what you should do is
myObj = MyClass()
myObj.regex(items)
myObj.printsc()
The problem is that when you do:
MyClass().regex(items)
MyClass().printsc()
You are creating 2 separate instances of MyClass, each of which will have a different .mylist attribute.
Either mylist is an instance attribute, and then this will work:
instance = MyClass()
instance.regex(items)
instance.printsc()
Or, if you want to share .mylist across instances, it should be
a class attribute:
class MyClass():
class_list = None
def __init__(self):
pass
def regex(self, items):
cls = self.__class__
if cls.class_list is None:
cls.class_list = []
for item in items:
if re.match(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$", item):
cls.class_list.append("IP:" + item)
else:
cls.class_list.append("DNS:" + item)
return cls.class_list
def printsc(self):
# Going throuhgh `.__class__.` is actually optional for
# reading an attribute - if it is not in the instance
# Python will fetch it from the class instead.
# i.e. , the line bellow would work with `self.class_list`
print(self.__class__.class_list)
This way, the list persists across different instances of the class, as you try to do in your example.
You should create an object of the class:
a = MyClass()
a.regex(items)
a.printsc()
>>> ['IP:192.168.0.1', 'DNS:hostname1', 'IP:10.0.1.15', 'DNS:server.local.fr']

Load inconsistent data in pymongo

I am working with pymongo and am wanting to ensure that data saved can be loaded even if additional data elements have been added to the schema.
I have used this for classes that don't need to have the information processed before assigning it to class attributes:
class MyClass(object):
def __init__(self, instance_id):
#set default values
self.database_id = instance_id
self.myvar = 0
#load values from database
self.__load()
def __load(self):
data_dict = Collection.find_one({"_id":self.database_id})
for key, attribute in data_dict.items():
self.__setattr__(key,attribute)
However, in classes that I have to process the data from the database this doesn't work:
class Example(object):
def __init__(self, name):
self.name = name
self.database_id = None
self.member_dict = {}
self.load()
def load(self):
data_dict = Collection.find_one({"name":self.name})
self.database_id = data_dict["_id"]
for element in data_dict["element_list"]:
self.process_element(element)
for member_name, member_info in data_dict["member_class_dict"].items():
self.member_dict[member_name] = MemberClass(member_info)
def process_element(self, element):
print("Do Stuff")
Two example use cases I have are:
1) List of strings the are used to set flags, this is done by calling a function with the string as the argument. (def process_element above)
2) A dictionary of dictionaries which are used to create a list of instances of a class. (MemberClass(member_info) above)
I tried creating properties to handle this but found that __setattr__ doesn't look for properties.
I know I could redefine __setattr__ to look for specific names but it is my understanding that this would slow down all set interactions with the class and I would prefer to avoid that.
I also know I could use a bunch of try/excepts to catch the errors but this would end up making the code very bulky.
I don't mind the load function being slowed down a bit for this but very much want to avoid anything that will slow down the class outside of loading.
So the solution that I came up with is to use the idea of changing the __setattr__ method but instead to handle the exceptions in the load function instead of the __setattr__.
def load(self):
data_dict = Collection.find_one({"name":self.name})
for key, attribute in world_data.items():
if key == "_id":
self.database_id = attribute
elif key == "element_list":
for element in attribute:
self.process_element(element)
elif key == "member_class_dict":
for member_name, member_info in attribute.items():
self.member_dict[member_name] = MemberClass(member_info)
else:
self.__setattr__(key,attribute)
This provides all of the functionality of overriding the __setattr__ method without slowing down any future calls to __setattr__ outside of loading the class.

Iterating through class variables in python

Please correct my code
PS - i'm fairly new to python
class Contact:
def __init__(self,cid, email):
self.cid=cid
self.email=email
def ind(contacts):
index={}
#Code here
return index
contacts = [Contact(1,'a'),
Contact(2,'b'),
Contact(3,'c'),
Contact(4,'a')]
print(ind(contacts))
Need the output to be like -
{'a':[1,4], 'b':2, 'c':3}
The following methods create list values like:
{'a':[1,4], 'b':[2], 'c':[3]}
I can't imagine why this wouldn't be fine, but I've added a method at the end that gets your specific output.
This doesn't maintain order of the emails:
def ind(contracts):
index={}
for contract in contracts:
index.setdefault(contract.email, []).append(contract.cid)
return index
To maintain order (e.g. start with 'a'), add from collects import OrderedDict to the top of your file and then the method is:
def ind(contracts):
index = OrderedDict()
for contract in contracts:
index.setdefault(contract.email, []).append(contract.cid)
return index
The printout of index will look different, but it acts the same as a normal dict object (just with ordering).
Exact output (with ordering):
def ind(contracts):
index = OrderedDict()
for contract in contracts:
if contract.email in index:
value = index[contract.email]
if not isinstance(value, list):
index[contract.email] = [value]
index[contract.email].append(contract.cid)
else:
index[contract.email] = contract.cid
return index

Dynamically add methods to a class in Python 3.0

I'm trying to write a Database Abstraction Layer in Python which lets you construct SQL statments using chained function calls such as:
results = db.search("book")
.author("J. K. Rowling")
.price("<40.00")
.title("Harry")
.execute()
but I am running into problems when I try to dynamically add the required methods to the db class.
Here is the important parts of my code:
import inspect
def myName():
return inspect.stack()[1][3]
class Search():
def __init__(self, family):
self.family = family
self.options = ['price', 'name', 'author', 'genre']
#self.options is generated based on family, but this is an example
for opt in self.options:
self.__dict__[opt] = self.__Set__
self.conditions = {}
def __Set__(self, value):
self.conditions[myName()] = value
return self
def execute(self):
return self.conditions
However, when I run the example such as:
print(db.search("book").price(">4.00").execute())
outputs:
{'__Set__': 'harry'}
Am I going about this the wrong way? Is there a better way to get the name of the function being called or to somehow make a 'hard copy' of the function?
You can simply add the search functions (methods) after the class is created:
class Search: # The class does not include the search methods, at first
def __init__(self):
self.conditions = {}
def make_set_condition(option): # Factory function that generates a "condition setter" for "option"
def set_cond(self, value):
self.conditions[option] = value
return self
return set_cond
for option in ('price', 'name'): # The class is extended with additional condition setters
setattr(Search, option, make_set_condition(option))
Search().name("Nice name").price('$3').conditions # Example
{'price': '$3', 'name': 'Nice name'}
PS: This class has an __init__() method that does not have the family parameter (the condition setters are dynamically added at runtime, but are added to the class, not to each instance separately). If Search objects with different condition setters need to be created, then the following variation on the above method works (the __init__() method has a family parameter):
import types
class Search: # The class does not include the search methods, at first
def __init__(self, family):
self.conditions = {}
for option in family: # The class is extended with additional condition setters
# The new 'option' attributes must be methods, not regular functions:
setattr(self, option, types.MethodType(make_set_condition(option), self))
def make_set_condition(option): # Factory function that generates a "condition setter" for "option"
def set_cond(self, value):
self.conditions[option] = value
return self
return set_cond
>>> o0 = Search(('price', 'name')) # Example
>>> o0.name("Nice name").price('$3').conditions
{'price': '$3', 'name': 'Nice name'}
>>> dir(o0) # Each Search object has its own condition setters (here: name and price)
['__doc__', '__init__', '__module__', 'conditions', 'name', 'price']
>>> o1 = Search(('director', 'style'))
>>> o1.director("Louis L").conditions # New method name
{'director': 'Louis L'}
>>> dir(o1) # Each Search object has its own condition setters (here: director and style)
['__doc__', '__init__', '__module__', 'conditions', 'director', 'style']
Reference: http://docs.python.org/howto/descriptor.html#functions-and-methods
If you really need search methods that know about the name of the attribute they are stored in, you can simply set it in make_set_condition() with
set_cond.__name__ = option # Sets the function name
(just before the return set_cond). Before doing this, method Search.name has the following name:
>>> Search.price
<function set_cond at 0x107f832f8>
after setting its __name__ attribute, you get a different name:
>>> Search.price
<function price at 0x107f83490>
Setting the method name this way makes possible error messages involving the method easier to understand.
Firstly, you are not adding anything to the class, you are adding it to the instance.
Secondly, you don't need to access dict. The self.__dict__[opt] = self.__Set__ is better done with setattr(self, opt, self.__Set__).
Thirdly, don't use __xxx__ as attribute names. Those are reserved for Python-internal use.
Fourthly, as you noticed, Python is not easily fooled. The internal name of the method you call is still __Set__, even though you access it under a different name. :-) The name is set when you define the method as a part of the def statement.
You probably want to create and set the options methods with a metaclass. You also might want to actually create those methods instead of trying to use one method for all of them. If you really want to use only one __getattr__ is the way, but it can be a bit fiddly, I generally recommend against it. Lambdas or other dynamically generated methods are probably better.
Here is some working code to get you started (not the whole program you were trying to write, but something that shows how the parts can fit together):
class Assign:
def __init__(self, searchobj, key):
self.searchobj = searchobj
self.key = key
def __call__(self, value):
self.searchobj.conditions[self.key] = value
return self.searchobj
class Book():
def __init__(self, family):
self.family = family
self.options = ['price', 'name', 'author', 'genre']
self.conditions = {}
def __getattr__(self, key):
if key in self.options:
return Assign(self, key)
raise RuntimeError('There is no option for: %s' % key)
def execute(self):
# XXX do something with the conditions.
return self.conditions
b = Book('book')
print(b.price(">4.00").author('J. K. Rowling').execute())

Resources