external dataset learning in python for machine learning - python-3.x

Hi I want classify a dataset using naivebayesclassifier.For that I want to use external dataset which i have downloaded from google.this dataset contains a two folder for positive reviews and negative reviews.Each folder contains 1000 .txt files.How to import this file in my code as a train dataset in python.I am new to machine learning so I have very less idea about that.Please help me out.

You can use os.listdir, from (https://docs.python.org/2/library/os.html), e.g.:
import os
fileList = os.listdir('train_directory')
for file in fileList:
# add content of file to dataset.

Related

Spliting an image dataset

I have an image dataset (folder of jpg images) .I would like to split it : 70 % for train and 30% for test randomly.
So I write this simple script:
from sklearn.model_selection import train_test_split
path = ".\dataset"
output_split=train_test_split(path,path,test_size=0.2)
But I don't find anything in folder "output_split"
So where I store output of spliting (train and test)?
I recommend using tf.dataset in your project. You can reference this link here for documentation:
https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory
And this link for a example of the function use:
https://keras.io/api/preprocessing/image/
Please remember to use the same seed for both training and (testing/validation).
Please note that no changes will happen to your files in your local.

How To Import The MNIST Dataset From Local Directory Using PyTorch

I am writing a code of a well-known problem MNIST database of handwritten digits in PyTorch. I downloaded the train and testing dataset (from the main website) including the labeled dataset. The dataset format is t10k-images-idx3-ubyte.gz and after extract t10k-images-idx3-ubyte. My dataset folder looks like
MINST
Data
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
Now, I wrote a code to load data like bellow
def load_dataset():
data_path = "/home/MNIST/Data/"
xy_trainPT = torchvision.datasets.ImageFolder(
root=data_path, transform=torchvision.transforms.ToTensor()
)
train_loader = torch.utils.data.DataLoader(
xy_trainPT, batch_size=64, num_workers=0, shuffle=True
)
return train_loader
My code is showing Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif,.tiff,.webp
How can I solve this problem and I also want to check that my images are loaded (just a figure contains the first 5 images) from the dataset?
Read this Extract images from .idx3-ubyte file or GZIP via Python
Update
You can import data using this format
xy_trainPT = torchvision.datasets.MNIST(
root="~/Handwritten_Deep_L/",
train=True,
download=True,
transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()]),
)
Now, what is happening at download=True first your code will check at the root directory (your given path) contains any datasets or not.
If no then datasets will be downloaded from the web.
If yes this path already contains a dataset then your code will work using the existing dataset and will not download from the internet.
You can check, first give a path without any dataset (data will be downloaded from the internet), and then give another path which already contains dataset data will not be downloaded.
Welcome to stackoverflow !
The MNIST dataset is not stored as images, but in a binary format (as indicated by the ubyte extension). Therefore, ImageFolderis not the type dataset you want. Instead, you will need to use the MNIST dataset class. It could even download the data if you had not done it already :)
This is a dataset class, so just instantiate with the proper root path, then put it as the parameter of your dataloader and everything should work just fine.
If you want to check the images, just use the getmethod of the dataloader, and save the result as a png file (you may need to convert the tensor to a numpy array first).

How to load large multi file parquet files for tensorflow/pytorch

I am trying to load a few parquet files from a directory into Python for tensorflow/pytorch.
The files are too large to be loaded through the pyarrow.parquet functions
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dir')
table = dataset.read()
This gives out of memory error.
I have also tried using petastorm, but that doesn't work for make_reader() because it isn't of the petastorm type.
with make_batch_reader('dir') as reader:
dataset = make_petastorm_dataset(reader)
When I used the make_batch_reader() and then the make_petastorm_dataset(reader), it again gave an zip not iterable error or something along those lines.
I am not sure how to load the file into Python for ML training.
Some quick help would be greatly appreciated.
Thanks
Zash
For pyarrow, you can list the directory with Python, iterate over *.parquet files, open each one as pq.ParquetFile, and read it one row group at a time. This will alleviate the memory pressure, but won't be super fast without parallelization.
For petastorm, you are right to use make_batch_reader(). Indeed, the error messages are not always helpful; but you can inspect the stack trace and investigate where in petastorm code it originates from.
You can load entire data using dask using below code.
You can also load only chucks of data whenever needed by computing only those lines using the index. [Assuming you have different index].
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
#delayed
def load_chunk(pth):
x = ParquetFile(pth).to_pandas()
x = x.drop('[unwanted_columns_to_save_space]',axis=1)
return x
files = glob.glob('./your_path/*.parquet')
ddf = dd.from_delayed([load_chunk(f) for f in files])
df = ddf.compute()

Is there way to fix CSV reading in Russian Language in Azure ML Studio?

I have a large csv file containing some text in Russian language. When I upload it to Azure ML Studio as dataset, it appears like "����". What I can do to fix that problem?
I tried changing encoding of my text to UTF8, KOI8-R.
There is no code, but I can share part of the dataset for you to try.
One workaround may be zipping your csv and reading it using python module. Your python script in this case should look something like :
# coding: utf-8
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
# imports up here can be used to
import pandas as pd
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
russian_ds = pd.read_csv('./Script Bundle/your_russian_dataset.csv', encoding = 'utf-8')
# your logic goes here
return russian_ds
It worked for with french datasets so hopefully you will find it useful

Tensorflow object detection API tfrecord

Im new to the tensorflow TFRecord. so Im studying Tensorflow object detection API codes
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md
but I can`t find the codes that load tfrecord.
I think they use .config file to load tfrecord because I found this in config file.
tf_record_input_reader {
input_path: "/path/to/train_dataset.record-?????-of-00010"
}
Anyone can help?
Have you converted your dataset to TFRecord format yet?
If so, you should have a path which contains your training dataset sharded to a few record files, with the format
<path_to_training_data>/<train_dataset>.record-?????-of-xxxxx
Where <path_to_training_data> is the abovementioned path to your training dataset, <train_dataset> is the file name you gave to each file, xxxxx is the number of record files created (e.g. 00010), and ????? should be left as is, and is used as a format to all record files.
Once you've replaced <path_to_training_data>, <train_dataset> and xxxxx to the correct values of your dataset, TF OD API should handle everything else (finding all records, reading them, etc.).
Note there's usually tf_record_input_reader for both training dataset and eval dataset (validation/test), and each should have the corresponding above-mentioned values (path, dataset name, number of files).

Resources