gdal cannot read raster file in dbfs - databricks

I'm trying a simple code to read a tif file using gdal in databricks but the output is NoneType. Is it dfbs limitation?
from osgeo import gdal
filepath = "dbfs:/FileStore/tables/myraster.tif"
raster = gdal.Open(filepath)
I get the below error:
RuntimeError: dbfs:/FileStore/tables/myraster.tif: No such file or directory
I have tested the file reading it as image and it works fine:
dfRaw = spark.read.format("image").load(filepath)

because gdal uses Python file API, it can't work with dbfs:/ URLs. Instead you need to replace dbfs: with /dbfs if you're on full Databricks workspace:
from osgeo import gdal
filepath = "/dbfs/FileStore/tables/myraster.tif"
raster = gdal.Open(filepath)
Or use dbutils.fs.cp on Community Edition (see this answer)

Related

How to import library from a given path in Azure Databricks?

I want to import a standard library (of a given version) in a Databricks Notebook Job. I do not want to install the library everytime a job cluster is created for this job. I want to install the library in a DBFS location and import the library directly from that location (by changing sys.path or something similar).
This is working in local:
I installed a library in a given path using:
pip install --target=customLocation library=major.minor
Append custom Location to sys.path variable:
sys.path.insert(0, 'customLocation')
When I import the library and check the location, I get the expected response
import library
print(library.__file__)
#Output - customLocation\library\__init__.py
However, the same exact sequence does not work in Databricks:
I installed the library in DBFS location:
%pip install --target='dbfs/customFolder/' numpy==1.19.4
Append sys.path variable:
sys.path.insert(0, 'dbfs/customFolder/')
Tried to find numpy version and file location:
import numpy
print(numpy.__version__)
print(numpy.__file__)
#Output - 1.21.4 (Databricks Runtime 10.4 Default Numpy)
/dbfs/customFolder/numpy/__init__.py
The customFolder has version 1.19.4 and the imported numpy shows that location, but version number does not match?
How exactly is the imports working in databricks to create this behaviour?
I also tried importing using the importlib function and the result remains the same. Link - https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
import importlib.util
import sys
spec = importlib.util.spec_from_file_location('numpy', '/dbfs/customFolder/')
module = importlib.util.module_from_spec(spec)
sys.modules['numpy'] = module
spec.loader.exec_module(module)
import numpy
print(numpy.__version__)
print(numpy.__file__)
Related:
How to install Python Libraries in a DBFS Location and access them through Job Clusters (without installing it for every job)?

How to convert a CSV file to Parquet csv2parquet in python without using Spark and Pandas

I am new to python, My current requirement is, I need to convert a CSV file to a parquet format using the csv2parquet package.
I referred to https://pypi.org/project/csv2parquet/ but did not get much clarification. Can anyone help me?
Thanks in advance.
After resolving some issues I used this code to convert a simple CSV file to parquet format, It works for me.
install the csv2parquet python package in your system.
pip install csv2parquet
Sample CSV file data
employeees_detail.csv
Python Code:
import csv2parquet
from subprocess import run
command = 'csv2parquet "C:\\Users\\Dhandapani Sudhakar\\Desktop\\employees_detail.csv"'
run(command)
After the conversion the result parquet
employees_detail.parquet
or else you can directly execute the command in the command prompt
(csvtoparquetenv) A:\POCS\Project_Envs\csvtoparquetenv>csv2parquet "C:\Users\Dhandapani Sudhakar\Desktop\employees_detail.csv"

How to access files downloaded from kaggle into a Colaboratory notebook?

I am having some difficulties with manipulating multiple files in a Colaboratory Notebook downloaded to the /content directory in my google drive. So far, I have successfully downloaded and extracted a kaggle dataset to a Colaboratory Notebook using the following code:
!kaggle datasets download -d iarunava/cell-images-for-detecting-malaria -p /content
!unzip \cell-images-for-detecting-malaria.zip
I was also able to use Pillow to import a single file from the dataset into my Colaboratory session (I obtained the filename from the output produced during the extraction):
from PIL import Image
img = Image.open('cell_images/Uninfected/C96P57ThinF_IMG_20150824_105445_cell_139.png')
How can I access multiple extracted files from /content without knowing their names in advance?
Thank you!
After some further experimentation, I found that the python os module works similarly in Colab Notebooks as it does on an individual computer. For example, in a Colab Notebook the command
os.getcwd()
returns '/content' as an output.
Also, the command os.listdir() returns the names of all the files I downloaded and extracted.
You can use glob. glob.glob(pattern) will match all files that match the pattern. For example the code bellow will read all the .png files in the image_dir.
png = glob.glob(os.path.join(img_dir, '*.png'))
png = np.array(png)
png will contain a list of filenames.
In your case you can use:
png = glob.glob('cell_images/Uninfected/*.png')
png = np.array(png)

Trying to import a .csv file in pandas using python. Getting Unicode Decode error

I am trying to import a .csv file into pandas but I am getting a Unicode error. I am running a windows pc.
I am using the following command:
medals =pd.read_csv('C:\\Users\\Username\\Downloads\\data\\olympicmedals.csv')
What am I missing here?
This is just to import a .csv file into my notebook
I am using the following command:
medals =pd.read_csv('C:\\Users\\Username\\Downloads\\data\\olympicmedals.csv')
The file should be imported into my Jupyter notebook
Try this
medals =pd.read_csv(r"C:\\Users\\Username\\Downloads\\data\\olympicmedals.csv")

Change executable option after python's unzipping

I have a small trouble with file option - executable, after using extractall.
This is my part of code:
import zipfile
archive = zipfile.ZipFile(path_to_local_folder+file, 'r')
archive.extractall(path_to_local_folder)
os.remove(path_to_local_folder+file)
In my zip archive I have a file, which should have an option - executable. Of course I can add operation "make file as executable", but I think it's not a good idea.
P.S. I use Mac OS and Python 3.5

Resources