How to read numerical data from CSV in PyTorch? - pytorch

I'm new to PyTorch; trying to implement a model I developed in TF and compare the results. The model is an Autoencoder model. The input data is a csv file including n samples each with m features (a n*m numerical matrix in a csv file). The targets (the labels) are in another csv file with the same format as the input file. I've been looking online but couldn't find a good documentation for reading non-image data from csv file with multiple labels. Any idea how can I read my data and iterate over it during training?
Thank you

Might you be looking for something like TabularDataset?
class
torchtext.data.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)
Defines a Dataset of columns stored in CSV, TSV, or JSON format.
It will take a path to a CSV file and build a dataset from it. You also need to specify the names of the columns which will then become the data fields.
In general, all of implementations of torch.Dataset for specific types of data are located outside of pytorch in the torchvision, torchtext, and torchaudio libraries.

Related

how to predict a masked word in a given sentence

FitBERT is an useful package , but I have a small doubt on BERT development for masked word prediction as below: I trained a bert model with custom corpus using Google's Scripts like create_pretraining_data.py, run_pretraining.py, extract_features.py etc..as a result I got vocab file, .tfrecord file, .json file and check point files.
Now how to use those file for your package to predict a masked word in a given sentence??
From the tensorflow documentation:
A TFRecord file stores your data as a sequence of binary strings. This means you need to specify the structure of your data before you write it to the file. Tensorflow provides two components for this purpose: tf.train.Example and tf.train.SequenceExample. You have to store each sample of your data in one of these structures, then serialize it and use a tf.python_io.TFRecordWriter to write it to disk.
This document along with the tensorflow documentation explain quite well how to use those file types.
While instead to use FitBERT directly through the library you can follow the examples you find on the project's github.

Shuffling training data before splitting using Keras DataGenerator

My model is pretty obviously overfitting and I keep seeing everywhere that I should try shuffling my data before splitting it. I use:
to do my data processing and splitting right now and learned that the shuffle=True doesn't actually do what I thought it did (or possibly anything). So my question is how should I load in and split this data? I have image files in a train folder and then I have a .csv file with the file name in one column and the label in the other column. This is my first attempt at any machine learning stuff so I'm sorry if this is a dumb question.
If I understand your code correctly, you are loading dataframe=df as input for your training/ validation set and dataframe=test_df for your test set. shuffle=True will shuffle the loaded samples within the specified dataframe.
So if you load from different sources, you are shuffling after splitting.
To shuffle before splitting, you need to either
shuffle the images between directories before loading or
load it with ImageDataGenerator (shuffle=True), split it with array operations and manually set y_col and batch_size for your test set or
remove different directories for your files altogether, load your .csv as a Pandas DataFrame, shuffle and split the rows and then use those partial dataframes as input for your ImageDataGenerators
Personally, I would choose the last option.

How to read in .txt file (corpus) into torchtext in pytorch?

How to read in .txt file (corpus) into torchtext in pytorrch?
I only see data.Dataset for example datasets and data.TabularData for csv, json, and tsv.
https://github.com/pytorch/text#data
https://torchtext.readthedocs.io/en/latest/data.html#dataset
It still works if I read it in using a Tabular dataset like this:
test_file = data.TabularDataset(path=input_filepath, format='csv', fields=[('text', data.Field())])
But my dataset is not tabular, so I wanted to check to see if there was a better option.
I would suggest writing up a quick script to read your corpus and dump it to JSON (there are plenty of examples out there), then use that JSON with torchtext. You're going to want to have some sort of structure to your data to get the most out of torchtext (think batches/iterable datasets).
If you are lost on how to iterate through a dataset, check out my other answer here.

LIBSVM with large data samples

I am currently looking to use libsvm (or an alternate if it is suggested; opencv also looks like a viable option) in order to train an SVM. My training data sets are rather large; around 50 binary 128MB files. It appears to use libsvm I must convert the data to a proper format; however I was wondering if it is possible to do training on the raw binary data itself? Thanks in advance.
No, you cannot use your raw binary (image) data for training nor for testing.
In order to use libsvm you have to convert your binary data files into this format.
See this stackoverflow post for the details of the libsvm data-format.

Text Classification using Naive bayes

Do guide me along if I am not posting in the right section.
I have some text files for my training data which are unformatted in word documents. They all contain ASCII characters only.
I would like to train a model on the text files using data mining methods.
The text files do have about 300 words in each file on average.
Are there any software that are recommended for me to start on it?
My initial idea is to use all the words in one of the file as training data and the remaining as test data. This is to perform cross fold validation.
However, I have tools such as weka but it does not seem to satisfy my needs as converting to csv files does not seem to be feasible in my case as the text files are separated
I have trying to perform cross validation in such a way that all the words in the training data are considered as features.
You need to use weka StringToWord filter and convert your text files to arff files. After that you can use weka classification algorithms. Watch following video to learn basics.

Resources