How do you search for particular features? - featuretools

At last when I tried featuretools I was searching for a particular feature which I was expecting. When you have > 30 feature it is kind of time consuming to find the feature.
Has the feature_names object (second return object of the dfs method) a method to search for some text patterns (regex)?
feature_names is a list of "featuretools.feature_base.feature_base.IdentityFeature"
Post Scriptum: In the featuretools documentation of the API the return objects are not described

Deep Feature Synthesis returns feature objects. If you call FeatureBase.get_name() on one of those objects, it will return the name as a string. You can use this to implement whatever selection logic you'd like. For example, here is the code to make a list of all feature objects where amount is in the name
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
fl = ft.dfs(target_entity="customers", entityset=es, features_only=True)
keep = []
for feature in fl:
if "amount" in feature.get_name():
keep.append(feature)

Related

How to merge entities in spaCy via rules

I want to use some of the entities in spaCy 3's en_core_web_lg, but replace some of what it labeled as 'ORG' as 'ANALYTIC', as it treats the 3 char codes I want to use such as 'P&L' and 'VaR' as organizations. The model has DATE entities, which I'm fine to preserve. I've read all the docs, and it seems like I should be able to use the EntityRuler, with the syntax below, but I'm not getting anywhere. I have been through the training 2-3x now, read all the Usage and API docs, and I just don't see any examples of working code. I get all sorts of different error messages like I need a decorator, or other. Lord, is it really that hard?
my code:
analytics = [
[{'LOWER':'risk'}],
[{'LOWER':'pnl'}],
[{'LOWER':'p&l'}],
[{'LOWER':'return'}],
[{'LOWER':'returns'}]
]
matcher = Matcher(nlp.vocab)
matcher.add("ANALYTICS", analytics)
doc = nlp(text)
# Iterate over the matches
for match_id, start, end in matcher(doc):
# Create a Span with the label for "ANALYTIC"
span = Span(doc, start, end, label="ANALYTIC")
# Overwrite the doc.ents and add the span
doc.ents = list(doc.ents) + [span]
# Get the span's root head token
span_root_head = span.root.head
# Print the text of the span root's head token and the span text
print(span_root_head.text, "-->", span.text)
This of course crashes when my new 'ANALYTIC' entity span collides with the existing 'ORG' one. But I have no idea how to either merge these offline and put them back, or create my own custom pipeline using rules. This is the suggested text from the entity ruler. No clue.
# Construction via add_pipe
ruler = nlp.add_pipe("entity_ruler")
# Construction from class
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True)
So when you say it "crashes", what's happening is that you have conflicting spans. For doc.ents specifically, each token can only be in at most one span. In your case you can fix this by modifying this line:
doc.ents = list(doc.ents) + [span]
Here you've included both the old span (that you don't want) and the new span. If you get doc.ents without the old span this will work.
There are also other ways to do this. Here I'll use a simplified example where you always want to change items of length 3, but you can modify this to use your list of specific words or something else.
You can directly modify entity labels, like this:
for ent in doc.ents:
if len(ent.text) == 3:
ent.label_ = "CHECK"
print(ent.label_, ent, sep="\t")
If you want to use the EntityRuler it would look like this:
import spacy
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents":True})
patterns = [
{"label": "ANALYTIC", "pattern":
[{"ENT_TYPE": "ORG", "LENGTH": 3}]}]
ruler.add_patterns(patterns)
text = "P&L reported amazing returns this year."
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent, sep="\t")
One more thing - you don't say what version of spaCy you're using. I'm using spaCy v3 here. The way pipes are added changed a bit in v3.

How to add stop words from Tfidvectorizer?

I am trying to add stop words into my stop_word list, however, the code I am using doesn't seem to be working:
Creating stop words list:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords.extend(CustomListofWordstoExclude)
Here I am converting the text to a dtm (document term matrix) with tfidf weighting:
vect = TfidfVectorizer(stop_words = 'english', min_df=150, token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape
But when I do this, I get this error:
FutureWarning: Pass input=None as keyword args. From version 0.25 passing these as positional arguments will result in an error
warnings.warn("Pass {} as keyword args. From version 0.25 "
What does this mean? Is there an easier way to add stopwords?
I'm unable to reproduce the warning. However, note that a warning such as this does not mean that your code did not run as intended. It means that in future releases of the package it may not work as intended. So if you try the same thing next year with updated packages, it may not work.
With respect to your question about using stop words, there are two changes that need to be made for your code to work as you expect.
list.extend() extends the list in-place, but it doesn't return the list. To see this you can do type(stopwords1) which gives NoneType. To define a new variable and add the custom words list to stopwords in one line, you could just use the built-in + operator functionality for lists:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords + CustomListofWordstoExclude
To actually use stopwords1 as your new stopwords list when performing the TF-IDF vectorization, you need to pass stop_words=stopwords1:
vect = TfidfVectorizer(stop_words=stopwords1, # Passed stopwords1 here
min_df=150,
token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape

In spacy, Is it possible to get the corresponding rule id in a match of matches

In Spacy 2.x, I use the matcher to find specific tokens in my text corpus. Each rule has an ID ('class-1_0' for example). During parse, I use the callback on_match to handle each match. Is there a solution to retrieve the rule used to find the match directly in the callback.
Here is my sample code.
txt = ("Aujourd'hui, je vais me faire une tartine au beurre "
"de cacahuète, c'est un pilier de ma nourriture "
"quotidienne.")
nlp = spacy.load('fr')
def on_match(matcher, doc, id, matches):
span = doc[matches[id][1]:matches[id][2]]
print(span)
# find a way to get the corresponding rule without fuzz
matcher = Matcher(nlp.vocab)
matcher.add('class-1_0', on_match, [{'LEMMA': 'pilier'}])
matcher.add('class-1_1', on_match, [{'LEMMA': 'beurre'}, {'LEMMA': 'de'}, {'LEMMA': 'cacahuète'}])
doc = nlp(txt)
matches = matcher(doc)
In this case matches return :
[(12071893341338447867, 9, 12), (4566231695725171773, 16, 17)]
12071893341338447867 is a unique ID based on class-1_0. I cannot find the original rule name, even if I do some introspection in matcher._patterns.
It would be great if someone can help me.
Thank you very much.
Yes – you can simply look up the ID in the StringStore of your vocabulary, available via nlp.vocab.strings or doc.vocab.strings. Going via the Doc is pretty convenient here, because you can do so within your on_match callback:
def on_match(matcher, doc, match_id, matches):
string_id = doc.vocab.strings[match_id]
For efficiency, spaCy encodes all strings to integers and keeps a reference to the mapping in the StringStore lookup table. In spaCy v2.0, the integers are hash values, so they'll always match across models and vocabularies. Fore more details on this, see this section in the docs.
Of course, if your classes and IDs are kinda cryptic anyways, the other answer suggesting integer IDs will work fine, too. Just keep in mind that those integer IDs you choose will likely also be mapped to some random string in the StringStore (like a word, or a part-of-speech tag or something). This usually doesn't matter if you're not looking them up and resolving them to strings somewhere – but if you do, the output may be confusing. For example, if your matcher rule ID is 99 and you're calling doc.vocab.strings[99], this will return 'VERB'.
While writing my question, as often, I found the solution.
It's dead simple, instead of using unicode rule id, like class-1_0, simply use a interger. The identifier will be preserved throughout the process.
matcher.add(1, on_match, [{'LEMMA': 'pilier'}])
Match with
[(1, 16, 17),]

Adding documents to gensim model

I have a class wrapping the various objects required for calculating LSI similarity:
class SimilarityFiles:
def __init__(self, file_name, tokenized_corpus, stoplist=None):
if stoplist is None:
self.filtered_corpus = tokenized_corpus
else:
self.filtered_corpus = []
for convo in tokenized_corpus:
self.filtered_corpus.append([token for token in convo if token not in stoplist])
self.dictionary = corpora.Dictionary(self.filtered_corpus)
self.corpus = [self.dictionary.doc2bow(text) for text in self.filtered_corpus]
self.lsi = models.LsiModel(self.corpus, id2word=self.dictionary, num_topics=100)
self.index = similarities.MatrixSimilarity(self.lsi[self.corpus])
I now want to add a function to the class to allow adding documents to the corpus and updating the model accordingly.
I've found dictionary.add_documents, and model.add_documents, but there are two things that aren't clear to me:
When you originally create the LSI model, one of the parameters the function receives is id2word=dictionary. When updating the model, how do you tell it to use the updated dictionary? Is it actually unnecessary, or will it make a difference?
How do I update the index? It looks from the documentation that if I use the Similarity class, and not the MatrixSimilarity class, I can add documents to the index, but I don't see such functionality for MatrixSimilarity. If I understood correctly, the MatrixSimilarity is better if my input corpus contains dense vectors (which is does, because I'm using the LSI model). Do I have to change it to Similarity just so that I can update the index? Or, conversely, what's the complexity of creating this index? If it's insignificant, should I just create a new index with my updated corpus, as follows:
Code:
self.dictionary.add_documents(new_docs) # new_docs is already after filtering stop words
new_corpus = [self.dictionary.doc2bow(text) for text in new_docs]
self.lsi.add_documents(new_corpus)
self.index = similarities.MatrixSimilarity(self.lsi[self.corpus])
Thanks. :)
will it seems that it doesn't update the dictionary.. it just add a new documents not new features.. so you should take a different approach..
I had the same problem and I found this issue on the gensim githup helpful

Using SciKit's kMeans to Cluster One's Own Documents

The SciKit site offers this k-means demo, and I'd like to use as much of it as possible to cluster some of my own documents, since I'm new to both machine learning and SciKit. The problem is getting my documents in a form that fits their demonstration.
Here is the "problem area" from SciKit's example:
dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)
labels = dataset.target
true_k = np.unique(labels).shape[0]
As can be seen, in the example, the authors use/"fetch" a data set named "20newsgroups," the call for which (according to this page; see the second paragraph of 7.7) "returns a list of the raw text files that can be fed to text feature extractors." I am not relying on a list of "text files" -- as can be seen in my code below -- but I can place my "documents" in whatever form is necessary.
How can I use the SciKit example without having to place my "documents" in text files? Or is it standard practice only to cluster documents from text files rather than the database on which the documents live? It's simply not clear from the demo/documentation what in the example is completely superfluous, used because it made the authors' lives easier, and what isn't. Or at least it's not clear to me.
if cursor.rowcount > 0: #don't bother doing anything if we don't get anything from the database
data = cursor.fetchall()
for row in data:
temp_string = row[0]+" "+row[1]+" "+row[3]+" "+row[4] # currently skipping the event_url: row[2]
page = BeautifulSoup((''.join(temp_string)))
pagetwo = str(page)
clean_text = nltk.clean_html(pagetwo)
tokens = nltk.word_tokenize(clean_text)
fin_doc = "" + "\n"
for word in tokens:
fin_word = stemmer.stem(word).lower()
if fin_word not in stopwords and len(fin_word) > 2:
fin_doc += fin_word + " "
documents.append(fin_doc)
The documents are just a list of strings, one string for each document, iirc.
The documentation is a bit unclear on this one. fetch_20newsgroups downloads the dataset as files, but the representation in the code is the content of the files, not the files themselves.

Resources