How do I read a data in .txt format using keras framework? - keras

I need to train a fully-connected neural network classifier using a dataset in .txt format as training data. How can I read the data set using keras framework?
I have to use less than 40 parameters,and only the first 100 rows of the dataset

To read in .txt data to Keras you can tf.keras.preprocessing.text.Tokenizer, see the documentation here:
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
You still have to open the file and read in the data with the Python function open. You can then use the the tokenizer to read the data into a numerical representation.
Here's an example:
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
data = open('dataset.txt').read()
# Tip: You can specify characters which is to be ignored here as well.
tokenizer = Tokenizer(num_words=40)
tokenizer.fit_on_texts(data)
# Here we use a binary representation, I recommend looking into the representation that fits your training data needs.
data = tokenizer.texts_to_matrix(data, mode='binary')
train_data = data[:100]

Related

How to build a customized dataset from MNIST in pytorch

I am importing MNIST dataset as train_data_MNIST = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)and I am trying to make a smaller dataset from MNIST, let's say the first 10,000 images and corresponding labels. I know this can be handled with torch.utils.data.Subset. But what I want is a torchvision.datasets object (if I directly apply torch.utils.data.Subset to the train_data_MNIST that I list above, the result is an object from torch.utils.data.Subset class).
Is there any possible way such that I can use a fraction of the original MNIST dataset to create a new dataset (not subset)?
Thanks in advance.
What about modifying data and targets directly? For example:
dataset = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)
dataset.data = dataset.data[:10000]
dataset.targets = dataset.targets[:10000]

How to preprocess data for training a character level RNN

I am trying to train a RNN model which classifies origin of names. The data looks like attached image. I understand I have to first map the labels as an integer. I am using the following code to do that:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
origin_label = encoder.fit_transform(origins)
I am having troubles figuring out what the next steps would be. I am using Keras to build this model. Thank you very much for your help.
Data Format

Extracting fixed vectors from BioBERT without using terminal command?

If we want to use weights from pretrained BioBERT model, we can execute following terminal command after downloading all the required BioBERT files.
os.system('python3 extract_features.py \
--input_file=trial.txt \
--vocab_file=vocab.txt \
--bert_config_file=bert_config.json \
--init_checkpoint=biobert_model.ckpt \
--output_file=output.json')
The above command actually reads individual file containing the text, reads the textual content from it, and then writes the extracted vectors to another file. So, the problem with this is that it could not be scaled easily for very large data-sets containing thousands of sentences/paragraphs.
Is there is a way to extract these features on the go (using an embedding layer) like it could be done for the word2vec vectors in PyTorch or TF1.3?
Note: BioBERT checkpoints do not exist for TF2.0, so I guess there is no way it could be done with TF2.0 unless someone generates TF2.0 compatible checkpoint files.
I will be grateful for any hint or help.
You can get the contextual embeddings on the fly, but the total time spend on getting the embeddings will always be the same. There are two options how to do it: 1. import BioBERT into the Transformers package and treat use it in PyTorch (which I would do) or 2. use the original codebase.
1. Import BioBERT into the Transformers package
The most convenient way of using pre-trained BERT models is the Transformers package. It was primarily written for PyTorch, but works also with TensorFlow. It does not have BioBERT out of the box, so you need to convert it from TensorFlow format yourself. There is convert_tf_checkpoint_to_pytorch.py script that does that. People had some issues with this script and BioBERT (seems to be resolved).
After you convert the model, you can load it like this.
import torch
from transformers import *
# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('directory_with_converted_model')
model = BertModel.from_pretrained('directory_with_converted_model')
# Call the model in a standard PyTorch way
embeddings = model([tokenizer.encode("Cool biomedical tetra-hydro-sentence.", add_special_tokens=True)])
2. Use directly BioBERT codebase
You can get the embeddings on the go basically using the code that is exctract_feautres.py. On lines 346-382, they initialize the model. You get the embeddings by calling estimator.predict(...).
For that, you need to format your format the input. First, you need to format the string (using code on line 326-337) and then apply and call convert_examples_to_features on it.

Loading Training Images using Keras

To train a model using Keras, should I load all the images I have to an array to create something like
x_train, y_train
Or is there a better way to read the images on the fly while training. I am not looking for ImageDataGenerator class since my output is an array of points not classes based on directory names..
I managed to get my data csv file to contain the array of points and image file name in 9 columns as follows:
x1 x2 ..... x8 Image_file_name
You can use this data with ImageDataGenerator. You incorrectly assume that it needs folders for classes, but that only applies to flow_from_directory. The method flow_from_dataframe allows you to load data from a Pandas dataframe, from where you can load your data, for example:
idg = ImageDataGenerator(...)
df = pd.load_csv('your_data.csv')
generator = idf.flow_from_dataframe(directory='image folder', x_col = 'filename_column',
y_col = ['col1', 'col2', ..., 'coln'],
class_mode='other')
This generator will data from the dataframe, load the image filename in directory as specified by the value of x_col, and use the corresponding row to build the targets, which in this case will be a numpy array of the values of columns in y_col. More information about this method can be found in the keras documentation.
Loading the entire data set in memory in an array is not a great idea because the memory consumption could go out of control, so you should use a generator. ImageDataGenerator and flow_from_dataframe are a great way of loading images in Keras. Since you don't want to use ImageDataGenerator(can you mention why?) you can create your own generator function that loads chunks of images in memory. If you load your data in a generator make sure you use fit_generator and predict_generator functions.
To load unlabeled data you can do the following hack:
datagen = ImageDataGenerator()
test_data = datagen.flow_from_directory('.', classes=['directory_where_images_are_stored'])
For more information check out link [1].
[1] https://kylewbanks.com/blog/loading-unlabeled-images-with-imagedatagenerator-flowfromdirectory-keras

ValueError: could not convert string to float in panda

My code is :
import pandas as pd
data = pd.read_table('train.tsv')
X=data.Phrase
Y=data.Sentiment
from sklearn import cross_validation
X_train,X_test,Y_train,Y_test=cross_validation.train_test_split(X,Y,test_size=0.2,random_state=0)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X,Y)
I get the error :ValueError: could not convert string to float:
What changes can I make that my code works?
You can't pass in text data into MultinomialNB of scikit-learn as stated in its documentation.
None of the algorithms in scikit-learn works directly with text data. You need to do some preprocessing to get desired output. You'll need to first extract the features from text data using techniques like bagging or tokenizing. Have a look at this link for better understanding.
You also might want to look at using NLTK for such use cases as yours.
ValueError when using Multinomial Naive Bayes classifier
You probably should preprocess your data as shown in the answer above.

Resources