Using download_data() and untar_data() in fastai library - python-3.x

I downloaded Fashion MNIST dataset from kaggle using dowload_data() function in fastai library.
downloaded_data = download_data("https://www.kaggle.com/zalando-research/fashionmnist/download")
output -
PosixPath('/root/.fastai/data/download.tgz')
download_data saves it as .tgz file, now I use untar_data().
path = untar_data('/root/.fastai/data/download.tgz')
output -
PosixPath('/root/.fastai/data/download.tgz')
Which did not extract .tgz file. How do I use this dataset in fastai library?

In fastai library, the download_data gives you a pathlib.PosixPath file, not the exact file, you need to use another unzipping library to extract the data.
If you just need the MNIST data from fast ai, here's an easier way:
from fastai import datasets
import gzip, pickle
MNIST_URL='http://deeplearning.net/data/mnist/mnist.pkl'
path = datasets.download_data(MNIST_URL, ext='.gz')
with gzip.open(path, 'rb') as f:
((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')

Related

Using PDF file in Keras OCR or PyTesseract - Python, is it possible?

I am using Keras OCR and PyTesseract and was wondering if it is possible to use PDF files as the image input.
If not, does anyone have a suggestion as to how to convert a very massive PDF file into PNG or another acceptable format?
Thank you!
No, as far as I know PyTesseract works only with images. You'll need to convert your pdf to images first.
By "very massive PDF" I'm assuming you mean a pdf with lots of pages. This is not an issue. You can use pdf2image library (see the docs here). The method convert_from_path has an output_folder argument that lets you specify the folder where all your generated images will be saved:
Output directory for the generated files, should be seen more as a
“working directory” than an output folder. The converted images will
be written there to save system memory.
You can later use them one by one instead of your pdf to work with PyTesseract. If you don't assign the returned list of images from convert_from_path you don't risk filling up your memory.
Otherwise, if you are willing to keep everything in memory you can use the returned pages directly, like so:
pages = convert_from_path(pdf_path)
for example, my code :
Python : 3.9
Macos: BigSur
from PIL import Image
from fonctions_images import *
from pdf2image import convert_from_path
path='/Users/yves/documents_1/'
fichier =path+'TOUTOU.pdf'
images = convert_from_path(fichier,500, transparent=True,grayscale=True,poppler_path='/usr/local/Cellar/poppler/21.12.0/bin')
for v in range(0,len(images)):
image=images[v]
image.save(path+"image.png", format="png")
test=path+"image.png"
img = cv2.imread(test) # to store image in memory
img = del_lines(path,img) # to supprime the lines
img = cv2.imread(path+"img_final_bin_1.png")
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
d=pytesseract.image_to_data(img[3820:4050,2340:4000], lang='fra',config=custom_config,output_type='data.frame')

How To Import The MNIST Dataset From Local Directory Using PyTorch

I am writing a code of a well-known problem MNIST database of handwritten digits in PyTorch. I downloaded the train and testing dataset (from the main website) including the labeled dataset. The dataset format is t10k-images-idx3-ubyte.gz and after extract t10k-images-idx3-ubyte. My dataset folder looks like
MINST
Data
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
Now, I wrote a code to load data like bellow
def load_dataset():
data_path = "/home/MNIST/Data/"
xy_trainPT = torchvision.datasets.ImageFolder(
root=data_path, transform=torchvision.transforms.ToTensor()
)
train_loader = torch.utils.data.DataLoader(
xy_trainPT, batch_size=64, num_workers=0, shuffle=True
)
return train_loader
My code is showing Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif,.tiff,.webp
How can I solve this problem and I also want to check that my images are loaded (just a figure contains the first 5 images) from the dataset?
Read this Extract images from .idx3-ubyte file or GZIP via Python
Update
You can import data using this format
xy_trainPT = torchvision.datasets.MNIST(
root="~/Handwritten_Deep_L/",
train=True,
download=True,
transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()]),
)
Now, what is happening at download=True first your code will check at the root directory (your given path) contains any datasets or not.
If no then datasets will be downloaded from the web.
If yes this path already contains a dataset then your code will work using the existing dataset and will not download from the internet.
You can check, first give a path without any dataset (data will be downloaded from the internet), and then give another path which already contains dataset data will not be downloaded.
Welcome to stackoverflow !
The MNIST dataset is not stored as images, but in a binary format (as indicated by the ubyte extension). Therefore, ImageFolderis not the type dataset you want. Instead, you will need to use the MNIST dataset class. It could even download the data if you had not done it already :)
This is a dataset class, so just instantiate with the proper root path, then put it as the parameter of your dataloader and everything should work just fine.
If you want to check the images, just use the getmethod of the dataloader, and save the result as a png file (you may need to convert the tensor to a numpy array first).

how to print tensorflow graph to svg or png file?

I know that I can use tensorboard to visualize graphs. But in some cases, I may not want to open tensorboard is there a way to directly generate a png of svg file to visualize tensorflow? Thanks.
import tensorflow as tf
g = tf.Graph()
with g.as_default():
x = tf.constant(3.)
import pprint
pprint.pprint(tf.get_default_graph())

How can we Visualizing MultiDimensional data clustered?

I have dataset of 100+ dimensions and I used PRECOMPUTED correlation as distance metric.
`
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
af = AffinityPropagation(affinity='precomputed').fit(my_distanceMetric_as_correlationMatrix)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
`
Now I can see the data in different clusters but I would like to visualize these clusters. So I request your support.
You can download the .whl file from http://www.lfd.uci.edu/~gohlke/pythonlibs/#cvxopt
(ctrl-f for scikit-learn and choose the appropriate version.)
Place the downloaded file in your current working directory, and install using
pip install filename
in my case the filename is scikit_learn‑0.18.1‑cp27‑cp27m‑win_amd64.whl

external dataset learning in python for machine learning

Hi I want classify a dataset using naivebayesclassifier.For that I want to use external dataset which i have downloaded from google.this dataset contains a two folder for positive reviews and negative reviews.Each folder contains 1000 .txt files.How to import this file in my code as a train dataset in python.I am new to machine learning so I have very less idea about that.Please help me out.
You can use os.listdir, from (https://docs.python.org/2/library/os.html), e.g.:
import os
fileList = os.listdir('train_directory')
for file in fileList:
# add content of file to dataset.

Resources