Spacy - Convert Token type into list

Spacy - Convert Token type into list - python-3.x

I have few elements which I got after performing operation in spacy having type
Input -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output:
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
I want to make all elements in list with str type for iteration.
Expected output -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output
<class 'str'>
<class 'str'>
<class 'str'>
please suggest some optimized way..

Spacy Token has a attribute called text.
Here's a complete example:
import spacy
nlp = spacy.load('en_core_web_sm')
t = (u"India Australia Brazil")
li = nlp(t)
for i in li:
print(i.text)
or if you want the list of tokens as list of strings:
list_of_strings = [i.text for i in li]

Thanks for the solution and for sharing your knowledge. It works very well to convert a spacy doc/span to a string or list of strings to further use them in string operations.
you can also use this:-
for i in li:
print(str(i))

Related

Pandas set_index() seems to change the types for some rows to <class 'pandas.core.series.Series'>

I'm observing an unexpected behavior of the Pandas set_index() function.
In order to make my results reproducible I provide my DataFrame as a pickle file df_test.pkl.
df_test = pd.read_pickle('./df_test.pkl')
time id avg
0 1554985690182 117455392 4.06300000
1 1554985690288 117455393 0.95800000
2 1554985690641 117455394 2.38400000
...
Now, when I iterate over the rows and print the type of each "id" value I get <class 'numpy.int64'> for all cells.
for i in df_test.index:
print(type(df_test.at[i,'id']))
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
...
Now I set the index to the "time" column and everything looks fine.
df_test = df_test.set_index(keys='time', drop=True)
id avg
time
1554985690182 117455392 4.06300000
1554985690288 117455393 0.95800000
1554985690641 117455394 2.38400000
...
But when I iterate again over the rows and print the type of each "id" value I get <class 'pandas.core.series.Series'> for some cells.
for i in df_test.index:
print(type(df_test.at[i,'id']))
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
...
Does anyone know what is going on here?
UPDATE:
I have removed the "id_type" column from the df_test DataFrame, because it was not helpful. Thanks to #Let'stry for making me aware!

I think I found the answer myself.
There where duplicate timestamps in the "time" column and it seems that Pandas cannot set_index() properly if there are duplicate values in the selected column. Which makes total sense, because an index with duplicates would be pointless.
By the way, I found this issue by using the argument verify_integrity=True in the set_index() function. So I recommend using that argument to avoid this kind of trouble.
df_test = df_test.set_index(keys='time', drop=True, verify_integrity=True)
Everything works fine now after I've removed the duplicate rows.

How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv

I want to preserve the datatype of a column of a data frame which is a list while writing it to csv file. when I read it, I have to have the values in lis format.
I have tried
pd.read_csv('namesss.csv',dtype = {'letters' = list})
but it says
dtype <class 'list'> not understood
this is an example
df = pd.DataFrame({'name': ['jack','johnny','stokes'],
'letters':[['j','k'],['j','y'],['s','s']]})
print(type(df['letters'][0]))
df
<class 'list'>
name letters
0 jack [j, k]
1 johnny [j, y]
2 stokes [s, s]
df.to_csv('namesss.csv')
print(type(pd.read_csv('namesss.csv')['letters'][0]))
<class 'str'>

You can use the ast module to make strings into lists :
import ast
df2 = pd.read_csv('namesss.csv')
df2['letters'] =[ast.literal_eval(x) for x in df2['letters'] ]
In [1] : print(type(df2['letters'][0]))
Out[1] : <class 'list'>

Remove stopwords list from list in Python (Natural Language Processing)

I have been trying to remove stopwords using python 3 code but my code does not seem to work,I want to know how to remove stop words from the below list. The example structure is as below:
from nltk.corpus import stopwords
word_split1=[['amazon','brand','-
','solimo','premium','almonds',',','250g','by','solimo'],
['hersheys','cocoa', 'powder', ',', '225g', 'by', 'hersheys'],
['jbl','t450bt','extra','bass','wireless','on-
ear','headphones','with','mic','white','by','jbl','and']]
I am trying to remove stop words and tried the below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below
stop_words = set(stopwords.words('english'))
filtered_words=[]
for i in word_split1:
if i not in stop_words:
filtered_words.append(i)
I get error:
Traceback (most recent call last):
File "<ipython-input-451-747407cf6734>", line 3, in <module>
if i not in stop_words:
TypeError: unhashable type: 'list'

You have a list of lists.
Try:
word_split1=[['amazon','brand','- ','solimo','premium','almonds',',','250g','by','solimo'],['hersheys','cocoa', 'powder', ',', '225g', 'by', 'hersheys'],['jbl','t450bt','extra','bass','wireless','on-ear','headphones','with','mic','white','by','jbl','and']]
stop_words = set(stopwords.words('english'))
filtered_words=[]
for i in word_split1:
for j in i:
if j not in stop_words:
filtered_words.append(j)
or flatten your list.
Ex:
from itertools import chain
word_split1=[['amazon','brand','- ','solimo','premium','almonds',',','250g','by','solimo'],['hersheys','cocoa', 'powder', ',', '225g', 'by', 'hersheys'],['jbl','t450bt','extra','bass','wireless','on-ear','headphones','with','mic','white','by','jbl','and']]
stop_words = set(stopwords.words('english'))
filtered_words=[]
for i in chain.from_iterable(word_split1):
if i not in stop_words:
filtered_words.append(i)
or
filtered_words = [i for i in chain.from_iterable(word_split1) if i not in stop_words]

The list is a 2D array and you're trying to hash a list, convert it to a 1D array first, then your code would work fine,
word_split1 = [j for x in word_split1 for j in x]
stop_words = set(stopwords.words('english'))
filtered_words=[]
for i in word_split1:
if i not in stop_words:
filtered_words.append(i)

Type Error when Lemmatizing words using NLTK

I have parsed 30 excel files and created a pandas dataframe. I have tokenized the words, taken out stop words and made bigrams. However when I try to lemmatize it gives me this error: TypeError: unhashable type: 'list'
Here's my code:
# Use simple pre-proces to clean up data and tokenize
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
# Define Function for Removing stopwords
def remove_stopwords(texts):
return[[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
# Define function for bigrams
def make_bigrams(texts):
return[bigram_mod[doc] for doc in texts]
#Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
#Define function for lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
return WordNetLemmatizer().lemmatize(word)
#Lemmatize words
data_lemmatized = get_lemma(data_words_bigrams)
This is exactly where I get the error. How should I adjust my code to resolve this issue? Thank you in advance
as suggested, the first few lines of the dataframe
df.head()
dataframe snap

Parsing html tags with Python

I have been given an url and I want to extract the contents of the <BODY> tag from the url.
I'm using Python3. I came across sgmllib but it is not available for Python3.
Can someone please guide me with this? Can I use HTMLParser for this?
Here is what i tried:
import urllib.request
f=urllib.request.urlopen("URL")
s=f.read()
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print("Encountered some data:", data)
parser = MyHTMLParser()
parser.feed(s)
this gives me error : TypeError: Can't convert 'bytes' object to str implicitly

To fix the TypeError change line #3 to
s = str(f.read())
The web page you're getting is being returned in the form of bytes, and you need to change the bytes into a string to feed them to the parser.

If you take a look at your s variable its type is byte.
>>> type(s)
<class 'bytes'>
and if you take a look at Parser.feed it requires a string or unicode as an argument.So,do
>>> x = s.decode('utf-8')
>>> type(x)
<class 'str'>
>>> parser.feed(x)
or do x = str(s).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spacy - Convert Token type into list - python-3.x

Spacy Token has a attribute called text. Here's a complete example: import spacy nlp = spacy.load('en_core_web_sm') t = (u"India Australia Brazil") li = nlp(t) for i in li: print(i.text) or if you want the list of tokens as list of strings: list_of_strings = [i.text for i in li]

Thanks for the solution and for sharing your knowledge. It works very well to convert a spacy doc/span to a string or list of strings to further use them in string operations. you can also use this:- for i in li: print(str(i))

Related

Pandas set_index() seems to change the types for some rows to <class 'pandas.core.series.Series'>

How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv

Remove stopwords list from list in Python (Natural Language Processing)

Type Error when Lemmatizing words using NLTK

Parsing html tags with Python

Categories

Resources