Spacy - Convert Token type into list - python-3.x

I have few elements which I got after performing operation in spacy having type
Input -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output:
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
I want to make all elements in list with str type for iteration.
Expected output -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output
<class 'str'>
<class 'str'>
<class 'str'>
please suggest some optimized way..

Spacy Token has a attribute called text.
Here's a complete example:
import spacy
nlp = spacy.load('en_core_web_sm')
t = (u"India Australia Brazil")
li = nlp(t)
for i in li:
print(i.text)
or if you want the list of tokens as list of strings:
list_of_strings = [i.text for i in li]

Thanks for the solution and for sharing your knowledge. It works very well to convert a spacy doc/span to a string or list of strings to further use them in string operations.
you can also use this:-
for i in li:
print(str(i))

Related

Pandas set_index() seems to change the types for some rows to <class 'pandas.core.series.Series'>

I'm observing an unexpected behavior of the Pandas set_index() function.
In order to make my results reproducible I provide my DataFrame as a pickle file df_test.pkl.
df_test = pd.read_pickle('./df_test.pkl')
time id avg
0 1554985690182 117455392 4.06300000
1 1554985690288 117455393 0.95800000
2 1554985690641 117455394 2.38400000
...
Now, when I iterate over the rows and print the type of each "id" value I get <class 'numpy.int64'> for all cells.
for i in df_test.index:
print(type(df_test.at[i,'id']))
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
...
Now I set the index to the "time" column and everything looks fine.
df_test = df_test.set_index(keys='time', drop=True)
id avg
time
1554985690182 117455392 4.06300000
1554985690288 117455393 0.95800000
1554985690641 117455394 2.38400000
...
But when I iterate again over the rows and print the type of each "id" value I get <class 'pandas.core.series.Series'> for some cells.
for i in df_test.index:
print(type(df_test.at[i,'id']))
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
...
Does anyone know what is going on here?
UPDATE:
I have removed the "id_type" column from the df_test DataFrame, because it was not helpful. Thanks to #Let'stry for making me aware!
I think I found the answer myself.
There where duplicate timestamps in the "time" column and it seems that Pandas cannot set_index() properly if there are duplicate values in the selected column. Which makes total sense, because an index with duplicates would be pointless.
By the way, I found this issue by using the argument verify_integrity=True in the set_index() function. So I recommend using that argument to avoid this kind of trouble.
df_test = df_test.set_index(keys='time', drop=True, verify_integrity=True)
Everything works fine now after I've removed the duplicate rows.

How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv

I want to preserve the datatype of a column of a data frame which is a list while writing it to csv file. when I read it, I have to have the values in lis format.
I have tried
pd.read_csv('namesss.csv',dtype = {'letters' = list})
but it says
dtype <class 'list'> not understood
this is an example
df = pd.DataFrame({'name': ['jack','johnny','stokes'],
'letters':[['j','k'],['j','y'],['s','s']]})
print(type(df['letters'][0]))
df
<class 'list'>
name letters
0 jack [j, k]
1 johnny [j, y]
2 stokes [s, s]
df.to_csv('namesss.csv')
print(type(pd.read_csv('namesss.csv')['letters'][0]))
<class 'str'>
You can use the ast module to make strings into lists :
import ast
df2 = pd.read_csv('namesss.csv')
df2['letters'] =[ast.literal_eval(x) for x in df2['letters'] ]
In [1] : print(type(df2['letters'][0]))
Out[1] : <class 'list'>

Remove stopwords list from list in Python (Natural Language Processing)

I have been trying to remove stopwords using python 3 code but my code does not seem to work,I want to know how to remove stop words from the below list. The example structure is as below:
from nltk.corpus import stopwords
word_split1=[['amazon','brand','-
','solimo','premium','almonds',',','250g','by','solimo'],
['hersheys','cocoa', 'powder', ',', '225g', 'by', 'hersheys'],
['jbl','t450bt','extra','bass','wireless','on-
ear','headphones','with','mic','white','by','jbl','and']]
I am trying to remove stop words and tried the below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below
stop_words = set(stopwords.words('english'))
filtered_words=[]
for i in word_split1:
if i not in stop_words:
filtered_words.append(i)
I get error:
Traceback (most recent call last):
File "<ipython-input-451-747407cf6734>", line 3, in <module>
if i not in stop_words:
TypeError: unhashable type: 'list'
You have a list of lists.
Try:
word_split1=[['amazon','brand','- ','solimo','premium','almonds',',','250g','by','solimo'],['hersheys','cocoa', 'powder', ',', '225g', 'by', 'hersheys'],['jbl','t450bt','extra','bass','wireless','on-ear','headphones','with','mic','white','by','jbl','and']]
stop_words = set(stopwords.words('english'))
filtered_words=[]
for i in word_split1:
for j in i:
if j not in stop_words:
filtered_words.append(j)
or flatten your list.
Ex:
from itertools import chain
word_split1=[['amazon','brand','- ','solimo','premium','almonds',',','250g','by','solimo'],['hersheys','cocoa', 'powder', ',', '225g', 'by', 'hersheys'],['jbl','t450bt','extra','bass','wireless','on-ear','headphones','with','mic','white','by','jbl','and']]
stop_words = set(stopwords.words('english'))
filtered_words=[]
for i in chain.from_iterable(word_split1):
if i not in stop_words:
filtered_words.append(i)
or
filtered_words = [i for i in chain.from_iterable(word_split1) if i not in stop_words]
The list is a 2D array and you're trying to hash a list, convert it to a 1D array first, then your code would work fine,
word_split1 = [j for x in word_split1 for j in x]
stop_words = set(stopwords.words('english'))
filtered_words=[]
for i in word_split1:
if i not in stop_words:
filtered_words.append(i)

Type Error when Lemmatizing words using NLTK

I have parsed 30 excel files and created a pandas dataframe. I have tokenized the words, taken out stop words and made bigrams. However when I try to lemmatize it gives me this error: TypeError: unhashable type: 'list'
Here's my code:
# Use simple pre-proces to clean up data and tokenize
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
# Define Function for Removing stopwords
def remove_stopwords(texts):
return[[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
# Define function for bigrams
def make_bigrams(texts):
return[bigram_mod[doc] for doc in texts]
#Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
#Define function for lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
return WordNetLemmatizer().lemmatize(word)
#Lemmatize words
data_lemmatized = get_lemma(data_words_bigrams)
This is exactly where I get the error. How should I adjust my code to resolve this issue? Thank you in advance
as suggested, the first few lines of the dataframe
df.head()
dataframe snap

Parsing html tags with Python

I have been given an url and I want to extract the contents of the <BODY> tag from the url.
I'm using Python3. I came across sgmllib but it is not available for Python3.
Can someone please guide me with this? Can I use HTMLParser for this?
Here is what i tried:
import urllib.request
f=urllib.request.urlopen("URL")
s=f.read()
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print("Encountered some data:", data)
parser = MyHTMLParser()
parser.feed(s)
this gives me error : TypeError: Can't convert 'bytes' object to str implicitly
To fix the TypeError change line #3 to
s = str(f.read())
The web page you're getting is being returned in the form of bytes, and you need to change the bytes into a string to feed them to the parser.
If you take a look at your s variable its type is byte.
>>> type(s)
<class 'bytes'>
and if you take a look at Parser.feed it requires a string or unicode as an argument.So,do
>>> x = s.decode('utf-8')
>>> type(x)
<class 'str'>
>>> parser.feed(x)
or do x = str(s).

Resources