I would like to calculate association rules from a text field from a dataset such as the one below using Python:
ID fav_breakfast
1 I like to eat eggs and bacon for breakfast.
2 Bacon, bacon, bacon!
3 I love pancakes, but only if they have extra syrup!
4 Waffles and bacon. Eggs too!
5 Eggs, potatoes, and pancakes. No meat for me!
Please note that Orange 2.7 is not an option as I am using the current version of Python (3.6), so Orange 3 is fair game; however, I can not seem to figure out how this module works with data in this format.
The first step, in my mind, would be to convert the above into a sparse matrix, such as the (truncated) one shown below:
Next, we would want to remove stop words (ie. I, to, and, for, etc.), upper/lower case issues, numbers, punctuation, account for words such as potato, potatoes, potatos, etc (with lemmatization).
Once this sparse matrix is in place, the next step would be to calculate association rules amongst all of words/strings in the sparse matrix. I have done this in R using the arulespackage; however, I haven't been able to identify an "arules equivalent" for Python.
The final solution that I envision would include a list of left-hand and right-hand side arguments along with the support, confidence, and lift of the rules in descending order with the highest lift rules at the top and lowest lift rules at the bottom (again, easy enough to obtain in R with arules).
In addition, I would like to have the ability to specify the right hand side to "bacon" that also shows the support, confidence, and lift of the rules in descending order with the highest lift rules in regards to "bacon" at the top and lowest lift rules in relation to "bacon" at the bottom.
Using Orange3-Associate will likely be the route to go here; however, I cannot find any good examples on the web. Thanks for your help in advance!
Is this what you had in mind? Orange should be able to pass outputs from one add-on and use them as inputs in another.
[EDIT]
I was able to reconstruct the case in code, but it is far less sexy:
import numpy as np
from orangecontrib.text
import Corpus, preprocess, vectorization
from orangecontrib.associate.fpgrowth import *
data = Corpus.from_file("deerwester")
p = preprocess.Preprocessor()
preproc_corpus = p(data)
v = vectorization.bagofwords.BoWPreprocessTransform(p, "Count", preproc_corpus)
N = 30
X = np.random.random((N, 50)) > .9
itemsets = dict(frequent_itemsets(X, .1))
rules = association_rules(itemsets, .6)
list(rules_stats(rules, itemsets, N))
Related
I'm working on the heart attack analysis on Kaggle in python.
I am a beginner and I'm trying to figure whether it's still necessary to one-hot-encode or LableEncode these features. I see so many people encoding the values for this project, but I'm confused because everything already looks scaled (apart from age, thalach, oldpeak and slope).
age: age in years
sex: (1 = male; 0 = female)
cp: ordinal values 1-4
thalach: maximum heart rate achieved
exang: (1 = yes; 0 = no)
oldpeak: depression induced by exercise
slope: the slope of the peak exercise
ca: values (0-3)
thal: ordinal values 0-3
target: 0= less chance, 1= more chance
Would you say it's still necessary to one-hot-encode, or should I just use a StandardScaler straight away?
I've seen many people encode the whole dataset for this project, but it makes no sense to me to do so. Please confirm if only using StandardScaler would be enough?
When you apply StandardScaler, the columns would have values in the same range. That helps models to keep weights under bound and gradient descent will not shoot off when converging. This will help the model converge faster.
Independently, in order to decide between Ordinal values and One hot encoding, consider if the column values are similar or different based on the distance between them. If yes, then choose ordinal values. If you know the hierarchy of the category, then you can manually assign the ordinal values. Otherwise, you should use LabelEncoder. It seems like the heart attack data is already given with ordinal values manually assigned. For example, higher chest pain = 4.
Also, it is important to refer to notebooks that perform better. Take a look at the one below for reference.
95% Accuracy - https://www.kaggle.com/code/abhinavgargacb/heart-attack-eda-predictor-95-accuracy-score
I am completely new to the field of Bayesian Networks. For my project, I need to check All the possible d separation conditions existing in a 7 node dag and for that I am looking for some good python code.
My knowledge in programming is limited ( a bit of numerical analysis and data structures; but I understand d separation, e separation and other concepts in a dag quite well).
It would be really very helpful if someone could point out where to look for such a specific code. Please note that I want a python codes that checks for All the conditional independences following from d separation in a 7 node dag.
I would be happier with an
algorithm checking whether each path is blocked or not etc rather than one built on semi graphoid axioms.
I don't know exactly where should I look or to whom should I ask, so any help would be greatly appreciated.
I guess you understand that your demand is a very large list. Even if we only consider d-separation between only 2 variables (conditionned by a set of nodes).
Anyway, you can do that quite easily with pyAgrum (https://agrum.org)
import itertools
import pyAgrum as gum
# create a BN
bn=gum.fastBN("A->B<-C->D->E->F;B->E<-G");
# print the indepency model by testing d-separations
# how to iterate for each subset of an interable
def powerset(iterable):
"""
powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)
"""
xs = list(iterable)
# note we return an iterator rather than a list
return itertools.chain.from_iterable(itertools.combinations(xs,n) for n in range(len(xs)+1))
# testing every d-separation
for i in bn.names():
for j in bn.names()-{i}:
for k in powerset(bn.names()-{i,j}):
if bn.isIndependent(i,j,k):
print(f"{i} indep {j} given {k}")
And the result (in a notebook) :
I was trying my hand at sentiment analysis in python 3, and was using the TDF-IDF vectorizer with the bag-of-words model to vectorize a document.
So, to anyone who is familiar with that, it is quite evident that the resulting matrix representation is sparse.
Here is a snippet of my code. Firstly, the documents.
tweets = [('Once you get inside you will be impressed with the place.',1),('I got home to see the driest damn wings ever!',0),('An extensive menu provides lots of options for breakfast.',1),('The flair bartenders are absolutely amazing!',1),('My first visit to Hiro was a delight!',1),('Poor service, the waiter made me feel like I was stupid every time he came to the table.',0),('Loved this place.',1),('This restaurant has great food',1),
('Honeslty it did not taste THAT fresh :(',0),('Would not go back.',0),
('I was shocked because no signs indicate cash only.',0),
('Waitress was a little slow in service.',0),
('did not like at all',0),('The food, amazing.',1),
('The burger is good beef, cooked just right.',1),
('They have horrible attitudes towards customers, and talk down to each one when customers do not enjoy their food.',0),
('The cocktails are all handmade and delicious.',1),('This restaurant has terrible food',0),
('Both of the egg rolls were fantastic.',1),('The WORST EXPERIENCE EVER.',0),
('My friend loved the salmon tartar.',1),('Which are small and not worth the price.',0),
('This is the place where I first had pho and it was amazing!!',1),
('Horrible - do not waste your time and money.',0),('Seriously flavorful delights, folks.',1),
('I loved the bacon wrapped dates.',1),('I dressed up to be treated so rudely!',0),
('We literally sat there for 20 minutes with no one asking to take our order.',0),
('you can watch them preparing the delicious food! :)',1),('In the summer, you can dine in a charming outdoor patio - so very delightful.',1)]
X_train, y_train = zip(*tweets)
And the following code to vectorize the documents.
tfidfvec = TfidfVectorizer(lowercase=True)
vectorized = tfidfvec.fit_transform(X_train)
print(vectorized)
When I print vectorized, it does not output a normal matrix. Instead, this:
If I'm not wrong, this must be a sparse matrix representation. However, I am not able to comprehend its format, and what each term means.
Also, there are 30 documents. So, that explains the 0-29 on the first column. If that's the trend then I'm guessing the second column is the index of the words, and the last value is it's tf-idf? It just struck me while I was typing my question, but kindly correct me if I'm wrong.
Could anyone with experience in this help me understand it better?
Yes, technically the first two tuples represent the row-column position, and the third column is the value in that position. So it is basically showing the position and values of the nonzero values.
I'm looking for a library easily usable from C++, Python or F#, which can distinguish well formed English sentences from "word salad". I tried The Stanford Parser and unfortunately, it parsed this:
Some plants have with done stems animals with exercise that to predict?
without a complaint. I'm not looking for something very sophisticated, able to handle all possible corner cases. I only need to filter out an obvious nonsense.
Here is something I just stumbled upon:
A general-purpose sentence-level nonsense detector, by a Stanford student named Ian Tenney.
Here is the code from the project, undocumented but available on GitHub.
If you want to develop your own solution based on this, I think you should pay attention the 4th group of features used, ie the language model, under section 3 "Features and preprocessing".
It might not suffice, but I think getting a probability score of each subsequences of length n is a good start. 3-grams like "plants have with", "have with done", "done stems animals", "stems animals with" and "that to predict" seem rather improbable, which could lead to a "nonsense" label on the whole sentence.
This method has the advantage of relying on a learned model rather than on a set of hand-made rules, which afaik is your other option. Many people would point you to Chapter 8 of NLTK's manual, but I think that developing your own context-free grammar for general English is asking a bit much.
The paper was useful, but goes into too much depth for solving this problem. Here is the author's basic approach, heuristically:
Baseline sentence heuristic: first letter is Capitalized,
and line ends with one of .?! (1 feature).
Number of characters, words, punctuation, digits, and named entities (from Stanford CoreNLP NER tagger), and normalized versions by text length (10 features).
Part-of-speech distributional tags: (# / # words) for each Penn treebank tag (45 features).
Indicators for the part of speech tag of the first
and last token in the text (45x2 = 90 features).
Language model raw score (s lm = log p(text))
and normalized score (s¯lm = s lm / # words) (2 features).
However, after a lot of searching, the github repo only includes the tests and visualizations. The raw training and test data are not there. Here is his function for calculating these features:
(note: this uses pandas dataframes as df)
def make_basic_features(df):
"""Compute basic features."""
df['f_nchars'] = df['__TEXT__'].map(len)
df['f_nwords'] = df['word'].map(len)
punct_counter = lambda s: sum(1 for c in s
if (not c.isalnum())
and not c in
[" ", "\t"])
df['f_npunct'] = df['__TEXT__'].map(punct_counter)
df['f_rpunct'] = df['f_npunct'] / df['f_nchars']
df['f_ndigit'] = df['__TEXT__'].map(lambda s: sum(1 for c in s
if c.isdigit()))
df['f_rdigit'] = df['f_ndigit'] / df['f_nchars']
upper_counter = lambda s: sum(1 for c in s if c.isupper())
df['f_nupper'] = df['__TEXT__'].map(upper_counter)
df['f_rupper'] = df['f_nupper'] / df['f_nchars']
df['f_nner'] = df['ner'].map(lambda ts: sum(1 for t in ts
if t != 'O'))
df['f_rner'] = df['f_nner'] / df['f_nwords']
# Check standard sentence pattern:
# if starts with capital, ends with .?!
def check_sentence_pattern(s):
ss = s.strip(r"""`"'""").strip()
return s[0].isupper() and (s[-1] in '.?!')
df['f_sentence_pattern'] = df['__TEXT__'].map(check_sentence_pattern)
# Normalize any LM features
# by dividing logscore by number of words
lm_cols = {c:re.sub("_lmscore_", "_lmscore_norm_",c)
for c in df.columns if c.startswith("f_lmscore")}
for c,cnew in lm_cols.items():
df[cnew] = df[c] / df['f_nwords']
return df
So I guess that's a function you can use in this case. For the minimalist version:
raw = ["This is is a well-formed sentence","but this ain't a good sent","just a fragment"]
import pandas as pd
df = pd.DataFrame([{"__TEXT__":i, "word": i.split(), 'ner':[]} for i in raw])
the parser seems to want a list of the words, and named entities recognized (NER) using the Stanford CoreNLP library, which is written in Java. You can pass in nothing (an empty list []) and the function do calculate everything else. You'll get back a dataframe (like a matrix) with all the features of sentences that you can then used to decide what to call "well formed" by the rules given.
Also, you don't HAVE to use pandas here. A list of dictionaries will also work. But the original code used pandas.
Because this example involved a lot of steps, I've created a gist where I run through an example up to the point of producing a clean list of sentences and a dirty list of not-well-formed sentences
my gist: https://gist.github.com/marcmaxson/4ccca7bacc72eb6bb6479caf4081cefb
This replaces the Stanford CoreNLP java library with spacy - a newer and easier to use python library that fills in the missing meta data, such as sentiment, named entities, and parts of speech used to determine if a sentence is well-formed. This runs under python 3.6, but could work under 2.7. all the libraries are backwards compatible.
I'm working on a game where I need to find the biggest weight for a specific sentence.
Suppose I have the sentence "the quick brown fox" and assume only single words with their defined weight: "the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8
In this case the problem is trivial, as the solution consists in adding each words' weight.
Now assume we also add double words, so besides the above words, we also have "the quick" -> 5, "quick brown" -> 10, "brown fox" -> 1
I'd like to know which combination of single and double words provides the biggest weight, in this case it would be "the", "quick brown", "fox"
My question is, besides the obvious brute force approach, is there any other possible way to obtain a solution? Needless to say, I'm looking for some optimal way to achive this for larger sentences.
Thank you.
You can look at Integer Linear Program libraries like lp_solve. In this case, you will want to maximize the scores, and your objective function will contain the weights. Then you can subject it to constraints, like you cannot have "quick brown" and "brown" at the same time.
For word alignment, this was used in this paper, but your problem is way simpler than that, but you can browse through the paper to get an idea on how ILP was used. There's probably other algorithms other than ILP that can be used to solve this optimally, but ILP can solve it optimally and efficiently for small problems.
"the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8
Say for the above individual words , I shall take an array
[10,5,3,8] for words 0,1,2,3
Traverse through the list and get if the combination of two scores is less than the combined score
for example
10+5 >5 the + quick >the quick
5+3 < 10 quick brown > quick + brown . Mark This
and so on
While marking the combined solution mark them along continuous ranges .
for example
if words scores are
words = [1,2,5,3,1,4,6,2,6,8] and [4,6,9,7,8,2,9,1,2]
marked ranges (inclusive of both ends)
are [0,1],[2,5],[6,7]
Pseudo code is given below
traverse from 0 to word length - 1
if number not in range :
add word[number] to overall sum.
else:
if length of range = 1 :
add combined_word_score [ lower_end_number]
else if length of range = 2 :
add combined_word_score [ lower_end_number+next number]
else if length of range > 2 and is odd number :
add max (alternate_score_starting at lower_end_number ,
word[lower_end]+word[higher_end]+alternate_score_starting at
next_number)
else if length of range > 2 and is even number :
add max (alternate_score_starting at lower_end_number +word[higher_end],
word[lower_end]+alternate_score_starting at
next_number).
This feels like a dynamic programming question.
I can imagine the k words of the sentence placed beside each other with a light bulb in between each word (i.e. k-1 light bulbs in total). If a light bulb is switched on, that means that the words adjoining it are part of a single phrase, and if its off, they are not. So any configuration of these light bulbs indicates a possible combination of weights.. of course many configurations are not even possible because we don't have any scores for they phrases they require. So k-1 light bulbs mean there are a max of 2^(k-1) possible answers for us to go through.
Rather than brute forcing it, we can recognize that there are parts of each computation that we can reuse for other computations, so for (The)(quick)(brown fox ... lazy dog) and (The quick)(brown fox ... lazy dog), we can compute the maximum score for (brown fox ... lazy dog) only once, memoize it and re-use it without doing any extra work the next time we see it.
Before we even start, we should first get rid of the light bulbs that can have only 1 possible value (suppose we did not have the phrase 'brown fox' or any bigger phrase with that phrase in it, then the light bulb between 'brown' and 'fox' would always have to be turned off).. Each removed bulb halves the solution space.
If w1, w2, w3 are the words, then the bulbs would be w1w2, w2w3, w3w4, etc. So
Optimal(w1w2 w2w3 w3w4 ...) = max(Optimal(w2w3 w3w4 ...) given w1w2 is on, Optimal(w2w3 w3w4 ...) given w1w2 is off)
(Caveat if we reach something where we have no possible solution, we just return MIN_INT and things should work out)
We can solve the problem like this, but we can probably save even more time if were clever about the order in which we approached the bulbs. Maybe attacking the center bulbs first might help.. I am not sure about this part.