Creating custom component in SpaCy - python-3.x

I am trying to create SpaCy pipeline component to return Spans of meaningful text (my corpus comprises pdf documents that have a lot of garbage that I am not interested in - tables, headers, etc.)
More specifically I am trying to create a function that:
takes a doc object as an argument
iterates over the doc tokens
When certain rules are met, yield a Span object
Note I would also be happy with returning a list([span_obj1, span_obj2])
What is the best way to do something like this? I am a bit confused on the difference between a pipeline component and an extension attribute.
So far I have tried:
nlp = English()
Doc.set_extension('chunks', method=iQ_chunker)
####
raw_text = get_test_doc()
doc = nlp(raw_text)
print(type(doc._.chunks))
>>> <class 'functools.partial'>
iQ_chunker is a method that does what I explain above and it returns a list of Span objects
this is not the results I expect as the function I pass in as method returns a list.

I imagine you're getting a functools partial back because you are accessing chunks as an attribute, despite having passed it in as an argument for method. If you want spaCy to intervene and call the method for you when you access something as an attribute, it needs to be
Doc.set_extension('chunks', getter=iQ_chunker)
Please see the Doc documentation for more details.
However, if you are planning to compute this attribute for every single document, I think you should make it part of your pipeline instead. Here is some simple sample code that does it both ways.
import spacy
from spacy.tokens import Doc
def chunk_getter(doc):
# the getter is called when we access _.extension_1,
# so the computation is done at access time
# also, because this is a getter,
# we need to return the actual result of the computation
first_half = doc[0:len(doc)//2]
secod_half = doc[len(doc)//2:len(doc)]
return [first_half, secod_half]
def write_chunks(doc):
# this pipeline component is called as part of the spacy pipeline,
# so the computation is done at parse time
# because this is a pipeline component,
# we need to set our attribute value on the doc (which must be registered)
# and then return the doc itself
first_half = doc[0:len(doc)//2]
secod_half = doc[len(doc)//2:len(doc)]
doc._.extension_2 = [first_half, secod_half]
return doc
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
Doc.set_extension("extension_1", getter=chunk_getter)
Doc.set_extension("extension_2", default=[])
nlp.add_pipe(write_chunks)
test_doc = nlp('I love spaCy')
print(test_doc._.extension_1)
print(test_doc._.extension_2)
This just prints [I, love spaCy] twice because it's two methods of doing the same thing, but I think making it part of your pipeline with nlp.add_pipe is the better way to do it if you expect to need this output on every document you parse.

Related

How to inspect mapped tasks' inputs from reduce tasks in Prefect

I'm exploring Prefect's map-reduce capability as a powerful idiom for writing massively-parallel, robust importers of external data.
As an example - very similar to the X-Files tutorial - consider this snippet:
#task
def retrieve_episode_ids():
api_connection = APIConnection(prefect.context.my_config)
return api_connection.get_episode_ids()
#task(max_retries=2, retry_delay=datetime.timedelta(seconds=3))
def download_episode(episode_id):
api_connection = APIConnection(prefect.context.my_config)
return api_connection.get_episode(episode_id)
#task(trigger=all_finished)
def persist_episodes(episodes):
db_connection = DBConnection(prefect.context.my_config)
...store all episodes by their ID with a success/failure flag...
with Flow("import_episodes") as flow:
episode_ids = retrieve_episode_ids()
episodes = download_episode.map(episode_ids)
persist_episodes(episodes)
The peculiarity of my flow, compared with the simple X-Files tutorial, is that I would like to persist results for all the episodes that I have requested, even for the failed ones. Imagine that I'll be writing episodes to a database table as the episode ID decorated with an is_success flag. Moreover, I'd like to write all episodes with a single task instance, in order to be able to perform a bulk insert - as opposed to inserting each episode one by one - hence my persist_episodes task being a reduce task.
The trouble I'm having is in being able to gather the episode ID for the failed downloads from that reduce task, so that I can store the failed information in the table under the appropriate episode ID. I could of course rewrite the download_episode task with a try/catch and always return an episode ID even in the case of failure, but then I'd lose the automatic retry/failure functionality which is a good deal of the appeal of Prefect.
Is there a way for a reduce task to infer the argument(s) of a failed mapped task? Or, could I write this differently to achieve what I need, while still keeping the same level of clarity as in my example?
Mapping over a list preserves the order. This is a property you can use to link inputs with the errors. Check the code I have below, will add more explanation after.
from prefect import Flow, task
import prefect
#task
def retrieve_episode_ids():
return [1,2,3,4,5]
#task
def download_episode(episode_id):
if episode_id == 5:
return ValueError()
return episode_id
#task()
def persist_episodes(episode_ids, episodes):
# Note the last element here will be the ValueError
prefect.context.logger.info(episodes)
# We change that ValueError into a "fail" message
episodes = ["fail" if isinstance(x, BaseException) else x for x in episodes]
# Note the last element here will be the "fail"
prefect.context.logger.info(episodes)
result = {}
for i, episode_id in enumerate(episode_ids):
result[episode_id] = episodes[i]
# Check final results
prefect.context.logger.info(result)
return
with Flow("import_episodes") as flow:
episode_ids = retrieve_episode_ids()
episodes = download_episode.map(episode_ids)
persist_episodes(episode_ids, episodes)
flow.run()
The handling will largely happen in the persist_episodes. Just pass the list of inputs again and then we can match the inputs with the failed tasks. I added some handling around identifying errors and replacing them with what you want. Does that answer the question?
Always happy to chat more. You can reach out in the Prefect Slack or Discourse as well.

RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

I have always resisted using asyncio within my code, but using it might help with some performance issues that I'm having.
Here is my scenario:
An end user provides a list of news sites to scrape
Each element is passed to an Article Class
A valid article is passed to an Extraction Class
The Extraction Class passes data to a NewsExtraction Class
90% this of the time this flow is flawless, but on an occasion one of the 12 functions in the NewsExtraction Class fails to extract data, which exist in the HTML provide. It seems that my code is "stepping on itself," which cause the data element not to be parsed. When I rerun the code all the elements are parsed correctly.
The NewsExtraction Class has this function get_article_data_elements, which is called from the Extraction Class.
The function get_article_data_elements call these items:
published_date = self._extract_article_published_date()
modified_date = self._extract_article_modified_date()
title = self._extract_article_title()
description = self._extract_article_description()
keywords = self._extract_article_key_words()
tags = self._extract_article_tags()
authors = self._extract_article_author()
top_image = self._extract_top_image()
language = self._extract_article_language()
categories = self._extract_article_category()
text = self._extract_textual_content()
url = self._extract_article_url()
Each of these data elements are used to populate a Python Dictionary, which is eventually passed back to the End User.
I have been trying to add asyncio code to the NewsExtraction Class, but I kept getting this error message:
RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited
I have spent the last 3 days trying to figure this issue out. I have looked at dozens of questions on Stack Overflow on this error RuntimeWarning: coroutine never awaited. I have also looked at numerous articles on using asyncio, but I cannot figure out how to use asyncio with my NewsExtraction Class, which is called from the Extraction Class.
Can someone provide me some pointers to solve my issue?
class NewsExtraction(object):
"""
This class is used to extract common data elements from a news article
on xyz
"""
def __init__(self, url, soup):
self._url = url
self._raw_soup = soup
truncated...
async def _extract_article_published_date(self):
"""
This function is designed to extract the publish date for the article being parsed.
:return: date article was originally published
:rtype: string
"""
json_date_published = JSONExtraction(self._url, self._raw_soup).extract_article_published_date()
if json_date_published is not None:
if len(json_date_published) != 0:
return json_date_published
else:
return None
elif json_date_published is None:
if self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")}):
date_published = self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")})
if len(date_published) != 0:
return date_published.text
else:
logger.info('The HTML tag to extract the publish date for the following article was not found.')
logger.info(f'Article URL -- {self._url}')
return None
truncated...
async def get_article_data_elements(self):
"""
This function is designed to extract all the common data elements from a
news article on xyz.
:return: dictionary of data elements related to the article
:rtype: dict
"""
article_data_elements = {}
# I have tried this:
published_date = self._extract_article_published_date().__await__()
# and this
published_date = self.task(self._extract_article_published_date())
await published_date
truncated...
I have also tried to use:
if __name__ == "__main__":
asyncio.run(NewsExtraction.get_article_data_elements())
# asyncio.run(self.get_article_data_elements())
I'm really banging my head on the wall with using asyncio in my news extraction code.
If this question is off base, I will be happy to delete it and keep reading about how to use asyncio correctly.
Can someone provide me some pointers to solve my issue?
Thanks in advance for any guidance on using asyncio
Your are defining _extract_article_published_date and get_article_data_elements as coroutines, and this coroutines must be await-ed in your code to get the result of their execution in an asynchronous way.
You can do this creating an instance of type NewsExtraction and calling this methods with the keyword await in front, this await pass the execution to other task in the loop until his awaited task completes its execution. Note that there are no threads or process involved in this task execution, the execution is passed only if it is no using cpu-time (await-ing I/O operations or sleeping).
if __name__ == '__main__':
extractor = NewsExtraction(...)
# this creates the event loop and runs the coroutine
asyncio.run(extractor.get_article_data_elements())
Inside your _extract_article_published_date you must also await your coroutines that perform requests over the network, if you are using some library for the scraping make sure that uses async/await behind the scenes to get a real performance while using asyncio.
async def get_article_data_elements(self):
article_data_elements = {}
# note here that the instance is self
published_date = await self._extract_article_published_date()
truncated...
You must dive into the asyncio documentation to get a better understanding of these features of Python 3.7+.

Scrapy crawl saved links out of csv or array

import scrapy
class LinkSpider(scrapy.Spider):
name = "articlelink"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
BASE_URL = 'https://www.topart-online.com/de/'
#scraping cards of specific category
def parse(self, response):
card = response.xpath('//a[#class="clearfix productlink"]')
for a in card:
yield{
'links': a.xpath('#href').get()
}
next_page_url = response.xpath('//a[#class="page-link"]/#href').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
this is my spider which crawls all pages of that category and saves all the productlinks into a csv file when i run in my server "scrapy crawl articlelink -o filename.csv".
now i have to crawl all the links in my csv file for specific information, that arent contained in the card of the productlink clearfix
How do i satrt?
So i'm glad you've gone and had a look at how to scrape the next pages.
Now with regard to the productean number, this is where yielding a dictionary is cumbersome. You will be better off in almost most scrapy scripts to be using the items dictionary. It's suited to being able to grab elements from different pages which is exactly want to do.
Scrapy extracts data from HTML and the mechanism it suggests to do this is through what they call items. Now scrapy accepts a few different types of ways to put data into some form of object. You can use a bog standard dictionary, but for data that requires modification or isn't that clean almost anything other than a very structured data set from the website you should use items at the least. Item's provides a dictionary like object.
To use the items mechanism in our spider script, we have to instantiate the items class to create an item's object. We then populate that items dictionary with the data we want and then in your particularly case we share this items dictionary across functions to continuing adding data from a different page.
In addition to this we have to declare the item field names as is called. But they are the keys to the items dictionary. We do this through items.py located in the project folder.
Code Example
items.py
import scrapy
class TopartItem(scrapy.Item):
title = scrapy.Field()
links = scrapy.Field()
ItemSKU = scrapy.Field()
Delivery_Status = scrapy.Field()
ItemEAN = scrapy.Field()
spider script
import scrapy
from ..items import TopartItem
class LinkSpider(scrapy.Spider):
name = "link"
allow_domains = ['topart-online.com']
start_urls = ['https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg=1']
custom_settings = {'FEED_EXPORT_FIELDS': ['title','links','ItemSKU','ItemEAN','Delivery_Status'] }
def parse(self, response):
card = response.xpath('//a[#class="clearfix productlink"]')
for a in card:
items = TopartItem()
link = a.xpath('#href')
items['title'] = a.xpath('.//div[#class="sn_p01_desc h4 col-12 pl-0 pl-sm-3 pull-left"]/text()').get().strip()
items['links'] = link.get()
items['ItemSKU'] = a.xpath('.//span[#class="sn_p01_pno"]/text()').get().strip()
items['Delivery_Status'] = a.xpath('.//div[#class="availabilitydeliverytime"]/text()').get().strip().replace('/','')
yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})
last_pagination_link = response.xpath('//a[#class="page-link"]/#href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)
def parse_item(self,response):
items = response.meta['items']
items['ItemEAN'] = response.xpath('//div[#class="productean"]/text()').get().strip()
yield items
Explanation
First within items.py, we are creating a class called TopArtItem because we are inheriting from scrapy.Item we can instantiating a field object creation for our item's object. Any field we want to create we give it a name and create the field object by scrapy.Field()
Within the spider script we have to import this class TopArtItem into our spider script. from ..items is a relative import which means from the parent directory of the spider script we want to take something from the items.py.
Now the code will look slightly familiar here. First to create our specific item' dictionary called items within the for loop we use items = TopArtItem().
The way to add to the items dictionary is similar to any other python dictionary, the key's in the items dictionary are our fields we created in items.py.
The variable link is the link to the specific page. We then grab the data we want which you've seen before.
So when we populate our items dictionary in we need to grab the productean number from the individual pages. We do this by following the link, and the callback is the function we want to send the HTML from the individual page to.
The meta = {'items',items} is the way we transfer our items dictionary to that function parse_item. We create a meta dictionary, with key called items, and the value is our items dictionary we just created.
We then have created the function parse_item. To get access to that items dictionary we must access it through the response.meta which holds the meta dictionary we created as we made the request in the previous function. response.meta['items'] is how we acccess our items dictionary, we then have called this items as before.
Now we can populate the items dictionary which already has data in it from the previous function and add the productean number to it. We then finally yield that items dictionary to tell scrapy we are done adding data for extraction from this particular card.
To summarise that workflow in the parse function we have a loop and for every loop iteration we are following a link, we extract the four pieces of data first, we then make scrapy follow the specific link and add the 5th piece of data before moving onto the next card in the original html document.
Additional Information
Note, try to use get() if you want to grab only one piece of data, and getall() for more than one piece of data. This is instead of extract_first() and extract(). If you look at the scrapy docs, they recommend this. get() is abit more concise and when using extract() you weren't always sure whether you'll get a string or a list as the extracted data. getall will always give you a list.
Recommend that you look up other examples of items in other scrapy scripts. by searching github or other websites. I would recommend once you understand the workflow to read carefully the items page on the docs here. It's clear but not example friendly. I think its more understandable when you've created scripts with items a few times.
Updated next page links
I've replaced the code you had for next_page with a more robust way of getting all the data.
last_pagination_link = response.xpath('//a[#class="page-link"]/#href')[-1].get()
last_page_number = int(last_pagination_link.split('=')[-1])
for i in range(2,last_page_number+1):
url = f'https://www.topart-online.com/de/Blattzweige-Blatt-und-Bluetenzweige/l-KAT282?seg={i}'
yield response.follow(url=url, callback=self.parse)
Here we are creating a variable for the last page and grabbing the number from that. I made this because incase you had pages where there were more than 3 pages.
We are doing a for loop and creating the url for each iteration. This url has whats called an f-string in it. f'' This allows us to plant a variable in the string, and to use a for loop to add either a number or anything else into that url. So we are planting the number 2 first into the url which gives us the link to the 2nd page. We then use the variable for the last page + 1, as the range function will only process the lastpage by selecting lastpage+1. To create the 3rd page url. We then follow that 3rd page url too.

How to modify the signature of a function dynamically

I am writing a framework in Python. When a user declares a function, they do:
def foo(row, fetch=stuff, query=otherStuff)
def bar(row, query=stuff)
def bar2(row)
When the backend sees query= value, it executes the function with the query argument depending on value. This way the function has access to the result of something done by the backend in its scope.
Currently I build my arguments each time by checking whether query, fetch and the other items are None, and launching it with a set of args that exactly matches what the user asked for. Otherwise I got the "got an unexpected keyword argument" error. This is the code in the backend:
#fetch and query is something computed by the backend
if fetch= None and query==None:
userfunction(row)
elif fetch==None:
userunction (row, query=query)
elif query == None:
userfunction (row, fetch=fetch)
else:
userfunction (row,fetch=fetch,query=query)
This is not good; for each additional "service" the backend offers, I need to write all the combinations with the previous ones.
Instead of that I would like to primarily take the function and manually add a named parameter, before executing it, removing all the unnecessary code that does these checks. Then the user would just use the stuff it really wanted.
I don't want the user to have to modify the function by adding stuff it doesn't want (nor do I want them to specify a kwarg every time).
So I would like an example of this if this is doable, a function addNamedVar(name, function) that adds the variable name to the function function.
I want to do that that way because the users functions are called a lot of times, meaning that it would trigger me to, for example, create a dict of the named var of the function (with inspect) and then using **dict. I would really like to just modify the function once to avoid any kind of overhead.
This is indeed doable in AST and that's what I am gonna do because this solution will suit better for my use case . However you could do what I asked more simply by having a function cloning approach like the code snippet I show. Note that this code return the same functions with different defaults values. You can use this code as example to do whatever you want.
This works for python3
def copyTransform(f, name, **args):
signature=inspect.signature(f)
params= list(signature.parameters)
numberOfParam= len(params)
numberOfDefault= len(f.__defaults__)
listTuple= list(f.__defaults__)
for key,val in args.items():
toChangeIndex = params.index(key, numberOfDefault)
if toChangeIndex:
listTuple[toChangeIndex- numberOfDefault]=val
newTuple= tuple(listTuple)
oldCode=f.__code__
newCode= types.CodeType(
oldCode.co_argcount, # integer
oldCode.co_kwonlyargcount, # integer
oldCode.co_nlocals, # integer
oldCode.co_stacksize, # integer
oldCode.co_flags, # integer
oldCode.co_code, # bytes
oldCode.co_consts, # tuple
oldCode.co_names, # tuple
oldCode.co_varnames, # tuple
oldCode.co_filename, # string
name, # string
oldCode.co_firstlineno, # integer
oldCode.co_lnotab, # bytes
oldCode.co_freevars, # tuple
oldCode.co_cellvars # tuple
)
newFunction=types.FunctionType(newCode, f.__globals__, name, newTuple, f.__closure__)
newFunction.__qualname__=name #also needed for serialization
You need to do that weird stuff with the names if you want to Pickle your clone function.

Overriding of CorpusView.read_block() not taken into account

I want to process a bunch of text files using NLTK, splitting them on a particular keyword. I am therefore trying to "subclass StreamBackedCorpusView, and override the read_block() method", as suggested by the documentation.
class CustomCorpusView(StreamBackedCorpusView):
def read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
class CustomCorpusReader(PlaintextCorpusReader):
CorpusView = CustomCorpusViewer
However my knowledge of inheritance is rusty, and it seems my overriding is not taken into account. The output of
corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.words())
is identical to the output of
corpus = PlaintextCorpusReader("/path/to/files", ".*")
print(corpus.words())
I guess I'm missing something obvious, but what ?
The documentation actually suggests two ways of defining a custom corpus view :
Call the StreamBackedCorpusView constructor, and provide your block reader function via the block_reader argument.
Subclass StreamBackedCorpusView, and override the read_block() method.
It also suggests the first way is easier, and indeed I managed to get it working as the following :
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *
class CustomCorpusReader(PlaintextCorpusReader):
def _custom_read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
def custom(self, fileids=None):
return concat(
[
self.CorpusView(fileid, self._custom_read_block, encoding=enc)
for (fileid, enc) in self.abspaths(fileids, True)
]
)
corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.custom())

Resources