Preparing large txt file for gensim FastText unsupervised model - python-3.x

When I attempt to run FastText using gensim in Python, the best I can get is a result that gives me the most similar but each result is a single character. (I'm on a windows machine, which I've heard affects the result.)
I have all of my data stored in either a csv file in which I've already tokenized each sentence or in the original txt file I started with. When I try to use the csv file, I end up with the single character result.
Here's the code I'm using to process my csv file (I'm looking at analyzing how sports articles discuss white vs. nonwhite NFL quarterbacks differently, this is the code for my NonWhite results csv file):
from gensim.models import FastText
from gensim.test.utils import get_tmpfile, datapath
import os
embedding_size = 200
window_size = 10
min_word = 5
down_sampling = 1e-2
if os.path.isfile(modelpath):
model1 = FastText.load(modelpath)
else:
class NWIter():
def __iter__(self):
path = datapath(csvpath)
with utils.open(path, 'r') as fin:
for line in fin:
yield line
model1 = FastText(vector_size=embedding_size, window=window_size, min_count=min_word,sample=down_sampling,workers=4)
model1.build_vocab(corpus_iterable=NWIter())
exs1=model1.corpus_count
model1.train(corpus_iterable=NWIter(), total_examples=exs1, epochs=50)
model1.save(modelpath)
The cleaned CSV data looked like this, with each row representing a sentence that had been cleaned (stopwords removed, tokenized, and lemmatized).
When that didn't work, I attempted to bring in the raw text but got lots of UTF-8 encoding errors with unrecognizable characters. I attempted to work around this issue, finally getting to a point where it tried to read in the raw text file - only for the single character returns to come back.
So it seems the issue persists regardless of if I use my csv file or if I use the txt file. So I'd prefer to stick with the csv as I've already processed the information; how can I bring that data in without Python (or gensim) seeing the individual characters as the unit of analysis?
Edit:
Here are the results I get when I run:
print('NonWhite: ',model1.wv.most_similar('smart', topn=10))
NonWhite: [('d', 0.36853086948394775), ('q', 0.326141357421875), ('s', 0.3181183338165283), ('M', 0.27458563446998596), ('g', 0.2703150510787964), ('o', 0.215525820851326), ('x', 0.2153075635433197), ('j', 0.21472081542015076), ('f', 0.20139966905117035), ('a', 0.18369245529174805)]

The Gensim FastText model (like its other models in the Word2Vec family) needs each individual text as a list-of-string-tokens, not a plain string.
If you pass texts as plain strings, they appear to be lists-of-single-characters – because of the way Python treats strings. Hence, the only 'words' the model sees are single-characters – including the individual spaces.
If the format of your file is such that each line is already a space-delimited text, you could simply change your yield line to:
yield line.split()
If instead it's truly a CSV, and your desired training texts are in only one column of the CSV, you should pick out that field and properly break it into a list-of-string-tokens.

Related

Getting the vocabulary of Stanford's glove model

Does anyone knows if I can get all the vocabulary for the glove model?
I look to do the same thing that this guy does to BERT on this video [on 15:40]: https://www.youtube.com/watch?v=zJW57aCBCTk&ab_channel=ChrisMcCormickAI
The GloVe vectors and their vocabulary are simply distributed as (space-separated column) text files. On a Unix-derived OS, you can get the vocabulary with a command like:
cut -f 1 -d ' ' glove.6B.50d.txt
If you'd like to do it in Python, the following works. The only trick is that the files use no quoting. Rather, the GloVe files simply use space as a delimiter and space is not allowed inside tokens.
import csv
vocab = set()
with open("glove.6B.100d.txt", encoding="utf-8") as f:
g300 = csv.reader(f, delimiter=" ", quoting=csv.QUOTE_NONE, escapechar=None)
for row in g300:
vocab.add(row[0])
print(vocab)

Why does page 323 from Automate The boring Stuff generate an int of 21?

I'm going back through the book "Automate the boring stuff" (which has been a great book btw)as I need to brush up on CSV parsing for a project and I'm trying to understand why each output is generated. Why does this code from page 323 create an output of '21', when it's four words, 16 characters, and three commas. Not to mention that I'm entering strings and it outputs numbers.
#%%
import csv
outputFile = open('output.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow(['spam', 'eggs', 'bacon', 'ham'])
First I thought ok it's the number of characters, but that adds up to 16. Then I thought ok each word maybe has a space plus one at the beginning and end of the CSV file? Which does technically maybe explain but nothing explicit, it's more "oh it's obvious because " but it's not explicitly stated. I'm not seeing a reference to the addition of or how that number is created.
There seems like a plausible explanation but I don't understand why it's 21.
I've tried breakpoint or pdb but I'm still learning how to use those, to get the following breakdown which I don't see containing anything that answers it. No counting or summation that I can see.
The docs state that csv.csvwriter.csvwriterow returns "the return value of the call to the write method of the underlying file object.".
In your example
outputFile = open('output.csv', 'w', newline='')
is your underlying file object which you then hand to csv.writer().
If we look a bit deeper we can find the type of outputFile with print(type(outputFile)).
<class '_io.TextIOWrapper'>
While the docs don't explicitly define the write method for TextIOWrapper, it does state that it inherits from TextIOBase, which defines it's write() method as "Write the string s to the stream and return the number of characters written.".
If we look at the text file written:
spam,eggs,bacon,ham
We see that it indeed has 21 characters.

Is there way to fix CSV reading in Russian Language in Azure ML Studio?

I have a large csv file containing some text in Russian language. When I upload it to Azure ML Studio as dataset, it appears like "����". What I can do to fix that problem?
I tried changing encoding of my text to UTF8, KOI8-R.
There is no code, but I can share part of the dataset for you to try.
One workaround may be zipping your csv and reading it using python module. Your python script in this case should look something like :
# coding: utf-8
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
# imports up here can be used to
import pandas as pd
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
russian_ds = pd.read_csv('./Script Bundle/your_russian_dataset.csv', encoding = 'utf-8')
# your logic goes here
return russian_ds
It worked for with french datasets so hopefully you will find it useful

np.save is converting floats to weird characters

I am attempting to append results to an ongoing csv file. Each result comes out as an nd.array:
[IN]: Print(savearray)
[OUT]: [[ 0.55219001 0.39838119]]
Initially I tried
np.savetxt('flux_ratios.csv', savearray,delimiter=",")
But this overwrites the old data every time I save, so instead I am attempting to append the data like this:
f = open('flux_ratios.csv', 'ab')
np.save(f, 'a',savearray)
f.close()
This is (in a sense) appending, however it is saving the numerical data as weird characters, as can be seen in this screenshot:
I have no idea why or how this is happening so any help would be greatly appreciated!
First off, np.save does not write text whereas np.savetxt does. You are trying to combine binary with text, which is why you get the odd characters when you try to read the file.
You could just change np.save(f, 'a', savearray) to np.savetxt(f, savearray, delimiter=',').
Otherwise you could also consider using pandas.to_csv in append mode.

Decoding/Encoding using sklearn load_files

I'm following the tutorial here
https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb
to learn about machine learning and text.
In my case, I'm using tweets I downloaded, with positive and negative tweets in the exact same directory structure they are using (trying to learn sentiment analysis).
Here in the iPython Notebook I load my data just like they do:
tweets_train =load_files('Path to my training Tweets')
And then I try to fit them with CountVectorizer
vect = CountVectorizer().fit(text_train)
I get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position
561: invalid continuation byte
Is this because my Tweets have all sorts of non standard text in them? I didn't do any cleanup of my Tweets (I assume there are libraries that help with that in order to make a bag of words work?)
EDIT:
Code I use using Twython to download tweets:
def get_tweets(user):
twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
user_timeline = twitter.get_user_timeline(screen_name=user,count=1)
lis = user_timeline[0]['id']
lis = [lis]
for i in range(0, 16): ## iterate through all tweets
## tweet extract method with the last list item as the max_id
user_timeline = twitter.get_user_timeline(screen_name=user,
count=200, include_retweets=False, max_id=lis[-1])
for tweet in user_timeline:
lis.append(tweet['id']) ## append tweet id's
text = str(tweet['text']).replace("'", "")
text_file = open(user, "a")
text_file.write(text)
text_file.close()
You get a UnicodeDecodeError because your files are being decoded with the wrong text encoding.
If this means nothing to you, make sure you understand the basics of Unicode and text encoding, eg. with the official Python Unicode HOWTO.
First, you need to find out what encoding was used to store the tweets on disk.
When you saved them to text files, you used the built-in open function without specifying an encoding. This means that the system's default encoding was used. Check this, for example, in an interactive session:
>>> f = open('/tmp/foo', 'a')
>>> f
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>
Here you can see that in my local environment the default encoding is set to UTF-8. You can also directly inspect the default encoding with
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
There are other ways to find out what encoding was used for the files.
For example, the Unix tool file is pretty good at guessing the encoding of existing files, if you happen to be working on a Unix platform.
Once you think you know what encoding was used for writing the files, you can specify this in the load_files() function:
tweets_train = load_files('path to tweets', encoding='latin-1')
... in case you find out Latin-1 is the encoding that was used for the tweets; otherwise adjust accordingly.

Resources