Include punctuation in keras tokenizer

Include punctuation in keras tokenizer - keras

Is there any way to include punctuation in keras tokenizer?
I would like to have a transformation...
FROM
Tomorrow will be cold.
TO
Index-tomorrow, Index-will,...,Index-point
How can I achieve that?

This is possible if you do some pre-processing on the text.
First you want to make sure that the punctuation is not filtered out by the Tokenizer. You can see from the documentation that the Tokenizer takes a filter argument on initialization. You can replace the default value with the set of characters you would like to filter, and exclude the ones you want to have in your index.
The second part is making sure that the punctuation is recognized as its own token. If you tokenize the example sentence the result would take "cold." as a token instead of "cold" and ".". What you need is a seperator between the word and the punctuation. A naive approach is to replace the punctuation in the text with a space + punctuation.
Following code does what you ask:
from keras.preprocessing.text import Tokenizer
t = Tokenizer(filters='!"#$%&()*+,-/:;<=>?#[\\]^_`{|}~\t\n') # all without .
text = "Tomorrow will be cold."
text = text.replace(".", " .")
t.fit_on_texts([text])
print(t.word_index)
-> prints: {'will': 2, 'be': 3, 'cold': 4, 'tomorrow': 1, '.': 5}
The replace logic can be done in a smarter way (eg. with regex if you want to capture all punctuation), but you get the gist.

A general solutions, inspired by the one proposed by lmartens, using Regex expressions to replace a set of punctuation marks. Here the code:
from keras.preprocessing.text import Tokenizer
import re
to_exclude = '!"#$%&()*+-/:;<=>#[\\]^_`{|}~\t\n'
to_tokenize = '.,:;!?'
t = Tokenizer(filters=to_exclude) # all without .
text = "Tomorrow, will be. cold?"
text = re.sub(r'(['+to_tokenize+'])', r' \1 ', text)
t.fit_on_texts([text])
print(t.word_index) # {'tomorrow': 1, ',': 2, 'will': 3, 'be': 4, '.': 5, 'cold': 6, '?': 7}

Related

Zapier Formatter - Hashtags Words In A String With Exemptions

I am looking to automate my social-media hashtags in Zapier, dependent on the post title.
Input: High School English As A Second Language Teacher
Output: #High #School #English #Second #Language #Teacher
I found the Regex (I think), which is \b(\w) to select the first letter of each word. However, this may not be Python. I would need exceptions too, to remove words like "A", "As", "The" etc.

While this is possible, it becomes very tricky and error prone once there's any punctuation or other characters. Nevertheless, here's a simple first pass:
import re
title = input_data['title']
# 'High School English As A Second Language Teacher'
words = re.findall(r'\w{3,}', title)
# ['High', 'School', 'English', 'Second', 'Language', 'Teacher']
result = ' '.join(['#' + word for word in words])
# '#High #School #English #Second #Language #Teacher'
return {'result': result}
That finds all words that are 3 or more characters, adds a # to each, and joins them all into a big string. You can play with that regex here.

How can I create a dictionary for a large amount to text and list the most frequent word?

I am new to coding and I am trying to create a dictionary from a large body of text and would also like the most frequent word to be shown?
For example, if I had a block of text such as:
text = '''George Gordon Noel Byron was born, with a clubbed right foot, in London on January 22, 1788. He was the son of Catherine Gordon of Gight, an impoverished Scots heiress, and Captain John (“Mad Jack”) Byron, a fortune-hunting widower with a daughter, Augusta. The profligate captain squandered his wife’s inheritance, was absent for the birth of his only son, and eventually decamped for France as an exile from English creditors, where he died in 1791 at 36.'''
I know the steps I would like the code to take. I want words that are the same but capitalised to be counted together so Hi and hi would count as Hi = 2.
I am trying to get the code to loop through the text and create a dictionary showing how many times each word appears. My final goal is to them have the code state which word appears most frequently.
I don't know how to approach such a large amount of text, the examples I have seen are for a much smaller amount of words.
I have tried to remove white space and also create a loop but I am stuck and unsure if I am going the right way about coding this problem.
a.replace(" ", "")
#this gave built-in method replace of str object at 0x000001A49AD8DAE0>, I have now idea what this means!
print(a.replace) # this is what I tried to write to remove white spaces
I am unsure of how to create the dictionary.
To count the word frequency would I do something like:
frequency = {}
for value in my_dict.values() :
if value in frequency :
frequency[value] = frequency[value] + 1
else :
frequency[value] = 1
What I was expecting to get was a dictionary that lists each word shown with a numerical value showing how often it appears in the text.
Then I wanted to have the code show the word that occurs the most.

This may be too simple for your requirements, but you could do this to create a dictionary of each word and its number of repetitions in the text.
text = "..." # text here.
frequency = {}
for word in text.split(" "):
if word not in frequency.keys():
frequency[word] = 1
else:
frequency[word] += 1
print(frequency)
This only splits the text up at each ' ' and counts the number of each occurrence.
If you want to get only the words, you may have to remove the ',' and other characters which you do not wish to have in your dictionary.
To remove characters such as ',' do.
text = text.replace(",", "")
Hope this helps and happy coding.

First, to remove all non-alphabet characters, aside from ', we can use regex
After that, we go through a list of the words and use a dictionary
import re
d = {}
text = text.split(" ")#turns it into a list
text = [re.findall("[a-zA-Z']", text[i]) for i in range(len(text))]
#each word is split, but non-alphabet/apostrophe are removed
text = ["".join(text[i]) for i in range(len(text))]
#puts each word back together
#there may be a better way for the short-above. If so, please tell.
for word in text:
if word in d.keys():
d[word] += 1
else:
d[word] = 1
d.pop("")
#not sure why, but when testing I got one key ""

You can use regex and Counter from collections :
import re
from collections import Counter
text = "This cat is not a cat, even if it looks like a cat"
# Extract words with regex, ignoring symbols and space
words = re.compile(r"\b\w+\b").findall(text.lower())
count = Counter(words)
# {'cat': 3, 'a': 2, 'this': 1, 'is': 1, 'not': 1, 'even': 1, 'if': 1, 'it': 1, 'looks': 1, 'like': 1}
# To get the most frequent
most_frequent = max(count, key=lambda k: count[k])
# 'cat'

How to detokenize spacy text without doc context?

I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.
The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.
Example:
Input to Seq2Seq: Some text
Output from Seq2Seq: This does n't work .
Is there any API in spacy to reverse tokenization done by rules in its tokenizer?

Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.
James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.
Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.
Here's the code to find these bits for a spaCy Doc:
def has_pre_space(token):
if token.i == 0:
return False
if token.nbor(-1).whitespace_:
return True
else:
return False
def has_space(token):
return token.whitespace_
The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.
James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.

TL;DR
I've written a code that attempts to do it, the snippet is below.
Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote.
The main thought was "What spaCy splits, shall be rejoined once more!"
Code:
#!/usr/bin/env python
import spacy
import string
class detokenizer:
""" This class is an attempt to detokenize spaCy tokenized sentence """
def __init__(self, model="en_core_web_sm"):
self.nlp = spacy.load(model)
def __call__(self, tokens : list):
""" Call this method to get list of detokenized words """
while self._connect_next_token_pair(tokens):
pass
return tokens
def get_sentence(self, tokens : list) -> str:
""" call this method to get detokenized sentence """
return " ".join(self(tokens))
def _connect_next_token_pair(self, tokens : list):
i = self._find_first_pair(tokens)
if i == -1:
return False
tokens[i] = tokens[i] + tokens[i+1]
tokens.pop(i+1)
return True
def _find_first_pair(self,tokens):
if len(tokens) <= 1:
return -1
for i in range(len(tokens)-1):
if self._would_spaCy_join(tokens,i):
return i
return -1
def _would_spaCy_join(self, tokens, index):
"""
Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...
In other words, we say we should join only if the join is reversible.
eg.:
for the text ["The","man","."]
we would joins "man" with "."
but wouldn't join "The" with "man."
"""
left_part = tokens[index]
right_part = tokens[index+1]
length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
length_after_join = len(self.nlp(left_part + right_part))
if self.nlp(left_part)[-1].text in string.punctuation:
return False
return length_before_join == length_after_join
Usage:
import spacy
dt = detokenizer()
sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")
spaCy_tokenized = nlp(sentence)
string_tokens = [a.text for a in spaCy_tokenized]
detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)
print(sentence)
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)
output:
I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']
Downsides:
In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word.
This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.
I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.
After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)

How should I strip these tweets of words like "the" and "I"?

I'm trying to clean up a bunch of tweets so that they can be used for k-means clustering. I've written the following code that should strip each tweet of its unwanted characters.
from nltk.corpus import stopwords
import nltk
import json
with open("/Users/titus/Desktop/trumptweets.json",'r', encoding='utf8') as f:
data = json.loads(f.readline())
tweets = []
for sentence in data:
tokens = nltk.wordpunct_tokenize(sentence['text'])
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha() and w not in
stopwords.words('english') and w is not 'the']
s = " "
useful_sentence = s.join(words)
tweets.append(useful_sentence)
print(tweets)
I'm trying to remove words like "I" and "the", but for some reason I can't figure out how. If I look at the tweets after they've gone through the loop, the word "the" still occurs.
Question: How is it possible that there are still occurences of "the" and "I" in the tweets? How should I fix this?

Beware of the processing order.
Here are two test strings for you:
THIS THE REMAINS.
this the is removed
Because "THE" is not "the". You lowercase after filtering, but you should first lowercase then filter.
The bad news for you: k-means works horribly bad on noisy short text like twitter. Because it is sensitive to noise, and the TFIDF vectors need very long texts to be reliable. So carefully verify your results, they probably are not as good as they may seem in the first enthusiasm.

Have you tried lowering w in check?
words = [w.lower() for w in text if w.isalpha() and w.lower() not in
stopwords.words('english') and w.lower() is not 'the']

is (and is not) is the (reference) identity check. It compares if two variable names point to the same object in memory. Typically this is only used to compare with None, or for some other speical cases.
In your case, use the != operator or the negation of == to compare with the string "the".
See also: Is there a difference between `==` and `is` in Python?

How to separate amino acid, number and amino acid string?

Right now, I have amino acid string.
The amino acid mutation column looks like this A59M, T133G, K2*, G1927? and ? only.
So, I tried to use re to separate one column into three columns and remove those ? only but keep G1297?.
import re
AA_mut = AA_mut.replace('p.','')
m = re.search(r'^(\w+)(\d+)(\S+)$',AA_mut)
But, I got
(A5,9,M; T13,3,M;....)
Please give me some advise.
Thanks

\w matches letters and digits in perl. It looks to me like it's doing the same thing in python.
You might try being more explicit. Is that a single, capital letter on the front? If so maybe you want something like
^([A-Z])(\d+)(\D+)$
In perl:
print join ("<>", m/^([A-Z])(\d+)(\D+)$/) while <DATA>;
__DATA__
A59M
T133G
K2*
G1927?
?
prints
A<>59<>M
T<>133<>G
K<>2<>*
G<>1927<>?

Assuming you have:
data = ["A59M", "T133G", "K2*", "G1927?", "?"]
You can extract it using:
out = [(s[0], s[1:-1], s[-1]) for s in data if len(s) > 2]
This gives me:
out == [('A', '59', 'M'), ('T', '133', 'G'),
('K', '2', '*'), ('G', '1927', '?')]

import re
AA_mut = AA_mut.replace('p.','')
m = re.search(r'^(\w)(\d+)(\S+)$',AA_mut)
I use this one to solve my problem. The original \w+ leaves one digit for \d+ and one alphabet for \S+. Once I removed the "+". It takes only first alphabet and leaves other parts.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string