Using NLTK, how to search for concepts in a text - python-3.x

I'm novice to both Python and NLTK. So, I'm trying to see the representation of some concepts in text using NLTK. I have a CSV file which looks like this image
And I want to see how frequent, e.g., Freedom, Courage, and all other concepts are. I also want to know how to make sure the code looks for bi and trigrams. However, the code I have below only allows me to look for a single list of words in a text (Preps.txt like this ).
The output I expect is something like:
Concept = Frequency in text, i.e., Freedom = 10, Courage = 20
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/Muhsa/Myfolder/Concepts' #this is where the texts I want to study are located
Concepts= PlaintextCorpusReader(corpus_root, '.*')
Concepts.fileids()
for fileid in Concepts.fileids():
text3 = Concepts.words(fileid)
from nltk import word_tokenize
from nltk import FreqDist
text3 = Concepts.words(fileid)
preps = open('preps.txt', encoding="utf-8")
rawpreps = preps.read() #preps refer to the file that has the list of words
tokens = word_tokenize(rawpreps)
texty = nltk.Text(tokens)
fdist = nltk.FreqDist(w.lower() for w in text3)
for m in texty:
print(m + ':', fdist[m], end=' ')

I reorganised your code a little bit. I assumed you had 1 file per concept words, and that 'preps.txt' only contained the courage words but not the others.
I hope it is easy to understand.
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import word_tokenize
from nltk import FreqDist
# Load the courage vocabulary
with open('preps.txt', encoding="utf-8") as file:
content = file.read() #preps refer to the file that has the list of words
courage_words = content.split('\n') # This is a list of words
# load freedom and development words in the same fashion
# Load the corpus
corpus_root = '/Users/Muhsa/Myfolder/Concepts' #this is where the texts I want to study are located
corpus = PlaintextCorpusReader(corpus_root, '.*')
# Count the number of word in the whole corpus that are also in the courage vocabulry
courage_freq = len([w for w in corpus.words() if w in courage_words])
print('Corpus contains {} courage words'.format(courage_freq))
# For each file in the corpus
for file_id in corpus.fileids():
# Count the number of word in the file that are also in courage word
file_freq = len([w for w in corpus.words(file_id) if w in courage_words])
print(file_id, file_freq)
Or better
# Load concept vocabulary in different files, in a python dictionary
concept_voc = {}
for file_path in ['courage.txt', 'freedom.txt', 'development.txt']:
concept_name = file_path.replace('.txt', '')
with open(file_path) as f:
voc = f.read().split('\n')
concept_voc[concept_name] = voc
# Load concept vocabulary in a csv file, each column is one vocabulary, the first line is the "name"
df = pd.read_csv('to_dict.csv')
convept_voc = df.to_dict('columns')
# concept_voc['courage'] returns the list of courage words
# And then for each concept compute the frequency as before
for concept in concept_voc:
voc = concept_voc[concept]
corpus_freq = len([w for w in corpus.words() if w in voc])
print(concept, '=', corpus_freq)

Related

How to iterate on keras Dataset and edit content

I am working on this movie classification problem
https://www.tensorflow.org/tutorials/keras/text_classification
In this example text files(12500 files with movie revies) are read and a batched dataset is prepared like below
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
at the time of standardization
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
#I WANT TO REMOVE STOP WORDS HERE, CAN I DO
return tf.strings.regex_replace(stripped_html,'[%s]' % re.escape(string.punctuation),'')
Problem: I understand that I have got training dataset with labels in variable 'raw_train_ds'. Now I want to iterate over this dataset and remove stop words from the movie review text and store back to same variable, I tried to do it in function 'custom_standardization' but it gives type error,
I also tried to use tf.strings.as_strings but it returns error
InvalidArgumentError: Value for attr 'T' of string is not in the list of allowed values: int8, int16, int32, int64
can someone please help on it OR simply please help how to remove stopwords from the batch dataset
It looks like right now TensorFlow does not have built in support for stop words removal, just basic standardization (lowercase & punctuation stripping). The TextVectorization used in the tutorial supports a custom standardization callback, but I couldn't find any stop words examples.
Since the tutorial downloads the imdb dataset and reads the text files from disc you can just do standardization manually with python before reading them. This will modify the text files themselves, but then you can read in the files normally using tf.keras.preprocessing.text_dataset_from_directory, and the entries will already have the stop words removed.
#!/usr/bin/env python3
import pathlib
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
def cleanup_text_files_in_folder(folder_name):
text_files = []
for file_path in pathlib.Path(folder_name).glob('*.txt'):
text_files.append(str(file_path))
print(f'Found {len(text_files)} files in {folder_name}')
# Give some kind of status
i = 0
for text_file in text_files:
replace_file_contents(text_file)
i += 1
if i % 1000 == 0:
print("No of files processed =", i)
return text_files
def replace_file_contents(input_file):
"""
This will read in the contents of the text file, process it (clean up, remove stop words)
and overwrite the new 'processed' output to that same file
"""
with open(input_file, 'r') as file:
file_data = file.read()
file_data = process_text_adv(file_data)
with open(input_file, 'w') as file:
file.write(file_data)
def process_text_adv(text):
# review without HTML tags
text = BeautifulSoup(text, features="html.parser").get_text()
# review without punctuation and numbers
text = re.sub(r'[^\w\s]','',text, re.UNICODE)
# lowercase
text = text.lower()
# simple split
text = text.split()
swords = set(stopwords.words("english")) # conversion into set for fast searching
text = [w for w in text if w not in swords]
# joining of splitted paragraph by spaces and return
return " ".join(text)
if __name__ == "__main__":
# Download & untar dataset beforehand, then running this would modify the text files
# in place. Back up the originals if that's a concern.
cleanup_text_files_in_folder('aclImdb/train/pos/')
cleanup_text_files_in_folder('aclImdb/train/neg/')
cleanup_text_files_in_folder('aclImdb/test/pos/')
cleanup_text_files_in_folder('aclImdb/test/neg/')

Having issues computing the average of compound sentiment values for each text file in a folder

# below is the sentiment analysis code written for sentence-level analysis
import glob
import os
import nltk.data
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
# Next, VADER is initialized so I can use it within the Python script
sid = SentimentIntensityAnalyzer()
# I will also initialize the 'english.pickle' function and give it a short
name
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#Each of the text file is listed from the folder speeches
files = glob.glob(os.path.join(os.getcwd(), 'cnn_articles', '*.txt'))
text = []
#iterate over the list getting each file
for file in files:
#open the file and then call .read() to get the text
with open(file) as f:
text.append(f.read())
text_str = "\n".join(text)
# This breaks up the paragraph into a list of strings.
sentences = tokenizer.tokenize(text_str )
sent = 0.0
count = 0
# Iterating through the list of sentences and extracting the compound scores
for sentence in sentences:
count +=1
scores = sid.polarity_scores(sentence)
sent += scores['compound'] #Adding up the overall compound sentiment
# print(sent, file=open('cnn_compound.txt', 'a'))
if count != 0:
sent = float(sent / count)
print(sent, file=open('cnn_compound.txt', 'a'))
With these lines of code, I have been able to get the average of all the compound sentiment values for all the text files. What I really want is the
average compound sentiment value for each text file, such that if I have 10
text files in the folder, I will have 10 floating point values representing
each of the text file. So that I can plot these values against each other.
Kindly assist me as I am very new to Python.
# below is the sentiment analysis code written for sentence-level analysis
import os, string, glob, pandas as pd, numpy as np
import nltk.data
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
# Next, VADER is initialized so I can use it within the Python
script
sid = SentimentIntensityAnalyzer()
exclude = set(string.punctuation)
# I will also initialize the 'english.pickle' function and give
it a short
name
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#Each of the text file is listed from the folder speeches
files = glob.glob(os.path.join(os.getcwd(), 'cnn_articles',
'*.txt'))
text = []
sent = 0.0
count = 0
cnt = 0
#iterate over the list getting each file
for file in files:
f = open(file).read().split('.')
cnt +=1
count = (len(f))
for sentence in f:
if sentence not in exclude:
scores = sid.polarity_scores(sentence)
print(scores)
break
sent += scores['compound']
average = round((sent/count), 4)
t = [cnt, average]
text.append(t)
break
df = pd.DataFrame(text, columns=['Article Number', 'Average
Value'])
#
#df.to_csv(r'Result.txt', header=True, index=None, sep='"\t\"
+"\t\"', mode='w')
df.to_csv('cnn_result.csv', index=None)

Python (NLTK) - more efficient way to extract noun phrases?

I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline.
I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.
But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?
import nltk
import pandas as pd
myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)
tokens = [nltk.word_tokenize(i) for i in texts]
tag_list = [nltk.pos_tag(w) for w in tokens]
phrases = [chunkr.parse(sublist) for sublist in tag_list]
leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
flatten the list of lists of lists of tuples that we've ended up with, into
just a list of lists of tuples
leaves = [tupls for sublists in leaves for tupls in sublists]
Join the extracted terms into one bigram
nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]
Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.
With ne_chunk and solution from
NLTK Named Entity recognition to a Python list and
How can I extract GPE(location) using NLTK ne_chunk?
[code]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks((sent)))
[out]:
0 [New York]
1 [Washington, Bruce Wayne]
Name: text, dtype: object
To use the custom RegexpParser :
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)
def get_continuous_chunks(text, chunk_func=ne_chunk):
chunked = chunk_func(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for subtree in chunked:
if type(subtree) == Tree:
current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.',
'Another bar foo Washington DC thingy with Bruce Wayne.']})
df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
[out]:
0 [bar sentence, New York city]
1 [bar foo Washington DC thingy, Bruce Wayne]
Name: text, dtype: object
I suggest referring to this prior thread:
Extracting all Nouns from a text file using nltk
They suggest using TextBlob as the easiest way to achieve this (if not the one that is most efficient in terms of processing) and the discussion there addresses your question.
from textblob import TextBlob
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
blob = TextBlob(txt)
print(blob.noun_phrases)
The above methods didn't give me the required results. Following is the function that I would suggest
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import re
def get_noun_phrases(text):
pos = pos_tag(word_tokenize(text))
count = 0
half_chunk = ""
for word, tag in pos:
if re.match(r"NN.*", tag):
count+=1
if count>=1:
half_chunk = half_chunk + word + " "
else:
half_chunk = half_chunk+"---"
count = 0
half_chunk = re.sub(r"-+","?",half_chunk).split("?")
half_chunk = [x.strip() for x in half_chunk if x!=""]
return half_chunk
The Constituent-Treelib library, which can be installed via: pip install constituent-treelib does excatly what you are looking for in few lines of code. In order to extract noun (or any other) phrases, perform the following steps.
from constituent_treelib import ConstituentTree
# First, we have to provide a sentence that should be parsed
sentence = "I've got a machine learning task involving a large amount of text data."
# Then, we define the language that should be considered with respect to the underlying models
language = ConstituentTree.Language.English
# You can also specify the desired model for the language ("Small" is selected by default)
spacy_model_size = ConstituentTree.SpacyModelSize.Medium
# Next, we must create the neccesary NLP pipeline.
# If you wish, you can instruct the library to download and install the models automatically
nlp = ConstituentTree.create_pipeline(language, spacy_model_size) # , download_models=True
# Now, we can instantiate a ConstituentTree object and pass it the sentence and the NLP pipeline
tree = ConstituentTree(sentence, nlp)
# Finally, we can extract the phrases
tree.extract_all_phrases()
Result...
{'S': ["I 've got a machine learning task involving a large amount of text data ."],
'PP': ['of text data'],
'VP': ["'ve got a machine learning task involving a large amount of text data",
'got a machine learning task involving a large amount of text data',
'involving a large amount of text data'],
'NML': ['machine learning'],
'NP': ['a machine learning task involving a large amount of text data',
'a machine learning task',
'a large amount of text data',
'a large amount',
'text data']}
If you only want the noun phrases, just pick them out with tree.extract_all_phrases()['NP']
['a machine learning task involving a large amount of text data',
'a machine learning task',
'a large amount of text data',
'a large amount',
'text data']

Extracting words from pdf using python 3?

we are extracting words from resumes in a pdf format.
ONE WAY OF DOING IT!
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('resume1.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Output:
2
Mostrecentversionalwaysavailableat
nlp.stanford.edu/
˘
rkarthik/cv.html
KarthikRaghunathan
Mobile:
+1-650-384-5782
Email:
kr
f
csDOTstanfordDOTedu
g
Homepage:
nlp.stanford.edu/
˘
rkarthik
ResearchInterests
Intelligence,NaturalLanguageProcessing,Human-RobotInteraction
EducationStanfordUniversity
,California2008onwards
MasterofScienceinComputerScienceCurrentGPA:3.91/4.00
NationalInstituteofTechnology(NIT)
,Calicut,India2004-2008
BachelorofTechnologyinComputerScienceandEngineeringCGPA:9.14/10.00
SoftwareSkills
ProgrammingLanguages
:C,C
++
,Perl,Java,C
#
,MATLAB,Lisp,SQL,MDX,Intelx86
assembly
Speech/NLP/AITools
:HMMToolkit(HTK),CMUSphinxAutomaticSpeechRecogni-
tionSystem,FestivalSpeechSynthesisSystem,VoiceXML,BerkeleyAligner,Giza++,Moses
StatisticalMachineTranslationToolkit,RobotOperatingSystem(ROS)
OtherTools
:L
A
T
E
X,LEX,YACC,Vim,Eclipse,MicrosoftVisualStudio,MicrosoftSQLServer
ManagementStudio,TestNGJavaTestingPlatform,SVN
OperatingSystems
:Linux,Windows,DOS
WorkExperienceMicrosoftCorporationSoftwareDevelopmentEngineerIntern
Redmond,WAJune2009-Sept2009
WorkedwiththeRevenue&RelevanceTeamatMicrosoftadCenteronthe
adCenterMarket-
placeScorecard
project,aimedatdevelopingastandardreliablesetofmetricsthatmeasure
thecompany'sperformanceintheonlineadvertisingmarketplaceandaidinmakinginformed
decisionstomaximizethemarketplacevalue.Alsoinitiatedtheonastatisticallearning
modelthatectivelypredictschangesintheadvertisers'biddingbehaviorwithtime.
StanfordNaturalLanguageProcessingGroupGraduateResearchAssistant
StanfordUniversity,CASept2008onwards
WorkingonStanford'sstatisticalmachinetranslation(SMT)system(aspartoftheDARPA
GALEProgram)undertheguidanceofProf.ChristopherManning.LedStanford'sfor
theGALEPhase3Chinese-EnglishMTevaluationaspartoftheIBM-Rosettateam.
MicrosoftResearch(MSR)LabIndiaResearchIntern
Bangalore,IndiaApr2007-Jul2007
Investigatedthetoleranceofstatisticalmachinetranslationsystemstonoiseinthetraining
corpus,particularlythekindofnoisethataccompaniesautomaticextractionofparallelcorpora
fromcomparablecorpora.AlsoworkedonthedesignofanonlinegameforNLPdataacquisition.
InternationalInstituteofInformationTechnology(IIIT)SummerIntern
Hyderabad,IndiaApr2006-Jun2006
Workedontherapidprototypingofrestricteddomainspokendialogsystems(SDS)forIndian
languages.Developedthe
IIITReceptionist
,aSDSinTamil,TeluguandEnglishlanguages,
whichfunctionedasanautomaticreceptionistforIIIT.
CourseProjectsNormalizationoftextinSMSmessagesusinganSMTsystem
Apr2009-Jun2009
Developedasystemforconvertingtextspeak(languageusedinSMScommunication)toproper
EnglishusingtheMosesstatisticalmachinetranslationsystem.
STAIRspokendialogproject
Jan2009-Apr2009
DevelopedaspokendialoginterfacetotheStanfordAIRobot(STAIR)forgivinginstructions
forfetchingtasks,undertheguidanceofProf.DanJurafskyandProf.AndrewNg.
The words are not extracted as keywords or words and this hapazard thing appears.
Another way of doing it.
import PyPDF2
#import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
#write a for-loop to open many files -- leave a comment if you'd #like to
learn how
filename = 'sample.pdf'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the
pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words.
It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert
scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our
PDF file.
# Now, we will clean our text variable, and return it as a list of keywords.
print(text)
#The word_tokenize() function will break our text phrases into #individual
words
tokens = word_tokenize(text)
#print(tokens)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',',' ']
#We initialize the stopwords variable which is a list of words like #"The",
"I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are
NOT IN stop_words and NOT IN punctuations.
keywords = []
keywords = [word for word in tokens if not word in stop_words and not word
in string.punctuation]
print(keywords)
Output:
For this the output is the same but textract module cannot be found.
Question: Can anyone correct the code or give a new one to help with the work?

Print only topic name using LDA with python

I need to print only the topic word (only one word). But it contains some number, But I can not get only the topic name like "Happy". My String word is "Happy", why it shows "Happi"
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import string
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
fr = open('Happy DespicableMe.txt','r')
doc_a = fr.read()
fr.close()
doc_set = [doc_a]
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=1, id2word = dictionary, passes=20)
rafa = ldamodel.show_topics(num_topics=1, num_words=1, log=False , formatted=False)
print(rafa)
It only shows [(0, '0.142*"happi"')]. But I want to print only the word.
You are plagued by a misunderstanding:
Stemming extracts the stem of a word through a series of transformation rules stripping off common suffixes and prefixes. Indeed, the resulting stem is not necessarily an actual English word. The purpose use of stemming is to normalize words for comparison. E.g.
stem_word('happy') == stem_word('happier')
What you need is a Lemmatizer (e.g. nltk.stem.wordnet) to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.
After you have install the corpus/wordnet you can use it like this:
from nltk.corpus import wordnet
syns = wordnet.synsets("happier")
print(syns[0].lemmas()[0].name())
Output:
happy

Resources