Convert databricks noteboooks.dbc into standard .py files - databricks

I often get sent databricks notebooks from various sources to move around / look at / refactor. Due to different tenancies I can't log into the actual environment. These are usually sent as .dbc files and I can convert them by opening up a new databricks enviroment are re-saving them as a .py file. I was wondering if there was a method where I could do this from command line, like nb-convert for Juypter ?
it's a little bit of a pain to import a whole host of files, then re-convert to python just for the sake of reading code.
Source control is not always an option due to permissions.

Import the .dbc in your Databricks workspace, for example in the Shared directory.
Then, as suggested by Carlos, install the Databricks CLI on your local computer and set it up.
pip install databricks-cli
databricks configure --token
and run the following to import the .py notebooks into your local folder
mkdir export_notebooks
cd export_notebooks
databricks workspace export_dir /Shared ./

Related

Databricks cli - dbfs commands to copy files

I'm working on the Deployment of the Purview ADB Lineage Solution Accelerator. In step 3 of Install OpenLineage on Your Databricks Cluster section, the author is asking to run the following in thepowershell to Upload the init script and jar to dbfs using the Databricks CLI.
dbfs mkdirs dbfs:/databricks/openlineage
dbfs cp --overwrite ./openlineage-spark-*.jar dbfs:/databricks/openlineage/
dbfs cp --overwrite ./open-lineage-init-script.sh dbfs:/databricks/openlineage/open-lineage-init-script.sh
Question: Do I correctly understand the above code as follows? If that is not the case, before running the code, I would like to know what exactly the code is doing.
The first line creates a folder openlineage in the root directory of dbfs
It's assumed that you are running the powershell command from the location where .jar and open-lineage-init-script.sh are located
The second and third lines of the code are copying the jar and .sh files from your local directory to the dbfs:/databricks/openlineage/ in dbfs of Databricks
dbfs mkdirs is an equivalent of UNIX mkdir -p, ie. under DBFS root it will create a folder named databricks, and inside it another folder named openlineage - and will not complain if these directories already exist.
and 3. Yes. Files/directories not prefixed with dbfs:/ mean your local filesystem. Note that you can copy from DBFS to local or vice versa, or between two DBFS locations. Just not between local filesystem only.

Can't Access /dbfs/FileStore using shell commands in databricks runtime version 7

In databricks runtime version 6.6 I am able to successfully run a shell command like the following:
%sh ls /dbfs/FileStore/tables
However, in runtime version 7, this no longer works. Is there any way to directly access /dbfs/FileStore in runtime version 7? I need to run commands to unzip a parquet zip file in /dbfs/FileStore/tables. This used to work in version 6.6 but databricks new "upgrade" breaks this simple core functionality.
Not sure if this matters but I am using the community edition of databricks.
WHen you run %sh ls /dbfs/FileStore/tables you can't Access /dbfs/FileStore using shell commands in databricks runtime version 7 because by default, the folder named '/dbfs/FileStore' does not exists in the 'dbfs'.
Try to upload some files in '/dbfs/FileStore/Tables'.
Now, try to run the same command again %sh ls /dbfs/FileStore/tables, now you see the results because we have upload the data into /dbfs/FileStore/tables folder.
The /dbfs mount doesn't work on Community Edition with DBR >= 7.x - it's a known limitation.
You can workaround this limitation by working with files on the driver node and upload or download files using the dbutils.fs.cp command (docs). So your code will look as following:
#write a file to local filesystem using Python I/O APIs
...
# upload file to DBFS
dbutils.fs.cp('file:/tmp/local-path', 'dbfs:/FileStore/tables/dbfs_file.txt')
and reading from DBFS will look as following:
# copy file from DBFS to local file_system
dbutils.fs.cp('dbfs:/tmp/dbfs_file.txt', 'file:/tmp/local-path')
# read the file locally
...
I know this question is a year old, but I wanted to share other posts that I found helpful in case someone has the same question.
I found the comments in this similar question to be helpful: How to access DBFS from shell?. The comments in the aforementioned post, also references Not able to cat dbfs file in databricks community edition cluster. FileNotFoundError: [Errno 2] No such file or directory: which I found helpful as well.
I learned in Community Edition ls /dbfs/FileStore/tables is not possible because the dbfs itself is not mounted on the nodes and the feature is disabled.

Pyspark list files by filtetypes in a directory

I want to list files by filetypes in a directory. The directory has .csv,.pdf etc files types and I want to list all the .csv files.
I am using the following command
dbutils.fs.ls("/mnt/test-output/*.csv")
I am expecting to get the list of all csv files in that directory.
I am getting the following error in databricks
java.io.FileNotFoundException: No such file or directory: /test-output/*.csv
Try using a shell cell with %sh. You can access DBFS and the mnt directory from there, too.
%sh
ls /dbfs/mnt/*.csv
Should get you a result like
/dbfs/mnt/temp.csv
%fs is a shortcut to dbutils and its access to the file system. dbutils doesn't support all unix shell functions and syntax, so that's probably the issue you ran into. Notice also how when running the %sh cell we access DBFS with /dbfs/.
I think you're mixing DBFS with local file system. Where is /mnt/test-output/*.csv?
If you're trying to read from DBFS then it will work.
Can you try running dbutils.fs.ls("/") to ensure that /mnt exist in DBFS.

Unable to import Pandas in AWS Lambda

I am new to AWS Lambda and I want to run code on Lambda for a machine learning API. The functions that I want to run on Lambda are, in summary, one to read some csv files to create a pandas dataFrame and search in it and the other to run some pickled machine learning models through requests from a Flask application. To do this, I need to import pandas, joblib and possibly scikit-learn which are compatible with Amazon Linux. I am using a Windows machine.
In general, I am going with the approach of using Lambda's layers by uploading zip files. Of course, since Lambda has a pre-built layer with SciPy and Numpy so I will not import them. If I import them, I will exceed Lambda's layer limit anyway.
To be more specific, I have done the following:
Downloaded and extracted linux-compatible versions of the libraries listed above. For example: From this link I have downloaded "pandas-0.25.0-cp35-cp35m-manylinux1_x86_64.whl" and unzipped to a folder.
The unzipped libraries are in the following directory:
lambda_layers\python\lib\python3.7\site-packages
They are zipped into a file and uploaded onto S3 Bucket for creating a layer.
I imported the packages:
import json
import boto3
import pandas as pd
I got the following error from Lambda:
{
"errorMessage": "Unable to import module 'lambda_function': C extension: No module named 'pandas._libs.tslibs.conversion' not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.",
"errorType": "Runtime.ImportModuleError"
}
Folder structure should be standard, you can also use Docker to create the zipped Linux compatible library and upload it in AWS Lambda layers. Below are the tested commands to create the zipped library for AWS Lambda layer:
Create and navigate to a directory :
$mkdir aws1
$cd aws1
Write the below commands in Dockerfile and exit by CTRL + D :
$cat> Dockerfile
FROM amazonlinux:2017.03
RUN yum -y install git \
python36 \
python36-pip \
zip \
&& yum clean all
RUN python3 -m pip install --upgrade pip \
&& python3 -m pip install boto3
You can provide any name for the image :
$docker build -t pythn1/lambda .
Run the image :
$docker run --rm -it -v ${PWD}:/var/task pythn1/lambda:latest bash
Specify the package which you want to zip, in requirements.txt and exit by CTRL + D :
$ cat > requirements.txt
pandas
sklearn
You can try using correct file structure (/python/lib/python3.6/site-packages/) here, but I did not test it yet :
$pip install -r requirements.txt -t /usr/lib/python3.6/dist-packages/
Navigate to the below directory :
$cd var/task
Create a zip file :
$ zip -r ./layers.zip /usr/lib/python3.6/dist-packages/
You should be able to see a layers.zip file in aws1 folder. If you provide the correct folder structure while installing, then the below steps are not required. But, with the folder structure I used, below commands are required :
Unzip layers.zip.
Exit Docker or open a new terminal and navigate to the folder where you unzipped the file. Unzipped file will be in the folder structure /usr/lib/python3.6/dist-packages/.
Copy these files to the correct folder structure :
$ cp -r ./python/lib/python3.6/site-packages/ /usr/lib/python3.6/dist-packages/
Zip them again :
$ zip -r ./lib_python.zip ./python
Upload the zip file to the layer, and add that layer to your Lambda function. Also, make sure that you select the right running environment while creating the layer.
Following this document - https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html#configuration-layers-path, you should zip python\lib\python3.7\site-packages\pandas (and other dependencies) folder for your python layers.
Make sure you add the layer to your function and follow the documentation for the right permissions.
I appreciate the answers that were given, just posting my own answer (that I found after a whole day looking) here for reference purpose.
I followed this guide and also this guide.
In summary, the steps to what I did are:
Connect to my Amazon EC2 instance (running on Linux) through ssh. I
wanted to deploy an application on Beanstalk so it was already up for
me anyway.
Follow the steps in the first guide to install python 3.7.
Follow the steps in the second guide to install the libraries. One of
the key notes is not to install with pip install -t since that
will lead to the libraries and the C extensions not built.
Zip the directory found in python\lib\python3.7\site-packages\ as
mentioned by the answers here (although I did follow the directory
guide in my first attempts)
Get the file from EC2 instance through
FileZilla.
Follow the Lambda layers guide and it is done.

Google Cloud Platform API for Python and AWS Lambda Incompatibility: Cannot import name 'cygrpc'

I am trying to use Google Cloud Platform (specifically, the Vision API) for Python with AWS Lambda. Thus, I have to create a deployment package for my dependencies. However, when I try to create this deployment package, I get several compilation errors, regardless of the version of Python (3.6 or 2.7). Considering the version 3.6, I get the issue "Cannot import name 'cygrpc'". For 2.7, I get some unknown error with the .path file. I am following the AWS Lambda Deployment Package instructions here. They recommend two options, and both do not work / result in the same issue. Is GCP just not compatible with AWS Lambda for some reason? What's the deal?
Neither Python 3.6 nor 2.7 work for me.
NOTE: I am posting this question here to answer it myself because it took me quite a while to find a solution, and I would like to share my solution.
TL;DR: You cannot compile the deployment package on your Mac or whatever pc you use. You have to do it using a specific OS/"setup", the same one that AWS Lambda uses to run your code. To do this, you have to use EC2.
I will provide here an answer on how to get Google Cloud Vision working on AWS Lambda for Python 2.7. This answer is potentially extendable for other other APIs and other programming languages on AWS Lambda.
So the my journey to a solution began with this initial posting on Github with others who have the same issue. One solution someone posted was
I had the same issue " cannot import name 'cygrpc' " while running
the lambda. Solved it with pip install google-cloud-vision in the AMI
amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 instance and exported the
lib/python3.6/site-packages to aws lambda Thank you #tseaver
This is partially correct, unless I read it wrong, but regardless it led me on the right path. You will have to use EC2. Here are the steps I took:
Set up an EC2 instance by going to EC2 on Amazon. Do a quick read about AWS EC2 if you have not already. Set one up for amzn-ami-hvm-2018.03.0.20180811-x86_64-gp2 or something along those lines (i.e. the most updated one).
Get your EC2 .pem file. Go to your Terminal. cd into your folder where your .pem file is. ssh into your instance using
ssh -i "your-file-name-here.pem" ec2-user#ec2-ip-address-here.compute-1.amazonaws.com
Create the following folders on your instance using mkdir: google-cloud-vision, protobuf, google-api-python-client, httplib2, uritemplate, google-auth-httplib2.
On your EC2 instance, cd into google-cloud-vision. Run the command:
pip install google-cloud-vision -t .
Note If you get "bash: pip: command not found", then enter "sudo easy_install pip" source.
Repeat step 4 with the following packages, while cd'ing into the respective folder: protobuf, google-api-python-client, httplib2, uritemplate, google-auth-httplib2.
Copy each folder on your computer. You can do this using the scp command. Again, in your Terminal, not your EC2 instance and not the Terminal window you used to access your EC2 instance, run the command (below is an example for your "google-cloud-vision" folder, but repeat this with every folder):
sudo scp -r -i your-pem-file-name.pem ec2-user#ec2-ip-address-here.compute-1.amazonaws.com:~/google-cloud-vision ~/Documents/your-local-directory/
Stop your EC2 instance from the AWS console so you don't get overcharged.
For your deployment package, you will need a single folder containing all your modules and your Python scripts. To begin combining all of the modules, create an empty folder titled "modules." Copy and paste all of the contents of the "google-cloud-vision" folder into the "modules" folder. Now place only the folder titled "protobuf" from the "protobuf" (sic) main folder in the "Google" folder of the "modules" folder. Also from the "protobuf" main folder, paste the Protobuf .pth file and the -info folder in the Google folder.
For each module after protobuf, copy and paste in the "modules" folder the folder titled with the module name, the .pth file, and the "-info" folder.
You now have all of your modules properly combined (almost). To finish combination, remove these two files from your "modules" folder: googleapis_common_protos-1.5.3-nspkg.pth and google_cloud_vision-0.34.0-py3.6-nspkg.pth. Copy and paste everything in the "modules" folder into your deployment package folder. Also, if you're using GCP, paste in your .json file for your credentials as well.
Finally, put your Python scripts in this folder, zip the contents (not the folder), upload to S3, and paste the link in your AWS Lambda function and get going!
If something here doesn't work as described, please forgive me and either message me or feel free to edit my answer. Hope this helps.
Building off the answer from #Josh Wolff (thanks a lot, btw!), this can be streamlined a bit by using a Docker image for Lambdas that Amazon makes available.
You can either bundle the libraries with your project source or, as I did below in a Makefile script, upload it as an AWS layer.
layer:
set -e ;\
docker run -v "$(PWD)/src":/var/task "lambci/lambda:build-python3.6" /bin/sh -c "rm -R python; pip install -r requirements.txt -t python/lib/python3.6/site-packages/; exit" ;\
pushd src ;\
zip -r my_lambda_layer.zip python > /dev/null ;\
rm -R python ;\
aws lambda publish-layer-version --layer-name my_lambda_layer --description "Lambda layer" --zip-file fileb://my_lambda_layer.zip --compatible-runtimes "python3.6" ;\
rm my_lambda_layer.zip ;\
popd ;
The above script will:
Pull the Docker image if you don't have it yet (above uses Python 3.6)
Delete the python directory (only useful for running a second
time)
Install all requirements to the python directory, created in your projects /src directory
ZIP the python directory
Upload the AWS layer
Delete the python directory and zip file
Make sure your requirements.txt file includes the modules listed above by Josh: google-cloud-vision, protobuf, google-api-python-client, httplib2, uritemplate, google-auth-httplib2
There's a fast solution that doesn't require much coding.
Cloud9 uses AMI so using pip on their virtual environment should make it work.
I created a Lambda from the Cloud9 UI and from the console activated the venv for the EC2 machine. I proceeded to install google-cloud-speech with pip.That was enough to fix the issue.
I was facing same error using goolge-ads API.
{
"errorMessage": "Unable to import module 'lambda_function': cannot import name'cygrpc' from 'grpc._cython' (/var/task/grpc/_cython/init.py)","errorType": "Runtime.ImportModuleError","stackTrace": []}
My Lambda runtime was Python 3.9 and architecture x86_64.
If somebody encounter similar ImportModuleError then see my answer here : Cannot import name 'cygrpc' from 'grpc._cython' - Google Ads API

Resources