Can't Access /dbfs/FileStore using shell commands in databricks runtime version 7 - databricks

In databricks runtime version 6.6 I am able to successfully run a shell command like the following:
%sh ls /dbfs/FileStore/tables
However, in runtime version 7, this no longer works. Is there any way to directly access /dbfs/FileStore in runtime version 7? I need to run commands to unzip a parquet zip file in /dbfs/FileStore/tables. This used to work in version 6.6 but databricks new "upgrade" breaks this simple core functionality.
Not sure if this matters but I am using the community edition of databricks.

WHen you run %sh ls /dbfs/FileStore/tables you can't Access /dbfs/FileStore using shell commands in databricks runtime version 7 because by default, the folder named '/dbfs/FileStore' does not exists in the 'dbfs'.
Try to upload some files in '/dbfs/FileStore/Tables'.
Now, try to run the same command again %sh ls /dbfs/FileStore/tables, now you see the results because we have upload the data into /dbfs/FileStore/tables folder.

The /dbfs mount doesn't work on Community Edition with DBR >= 7.x - it's a known limitation.
You can workaround this limitation by working with files on the driver node and upload or download files using the dbutils.fs.cp command (docs). So your code will look as following:
#write a file to local filesystem using Python I/O APIs
...
# upload file to DBFS
dbutils.fs.cp('file:/tmp/local-path', 'dbfs:/FileStore/tables/dbfs_file.txt')
and reading from DBFS will look as following:
# copy file from DBFS to local file_system
dbutils.fs.cp('dbfs:/tmp/dbfs_file.txt', 'file:/tmp/local-path')
# read the file locally
...

I know this question is a year old, but I wanted to share other posts that I found helpful in case someone has the same question.
I found the comments in this similar question to be helpful: How to access DBFS from shell?. The comments in the aforementioned post, also references Not able to cat dbfs file in databricks community edition cluster. FileNotFoundError: [Errno 2] No such file or directory: which I found helpful as well.
I learned in Community Edition ls /dbfs/FileStore/tables is not possible because the dbfs itself is not mounted on the nodes and the feature is disabled.

Related

Convert databricks noteboooks.dbc into standard .py files

I often get sent databricks notebooks from various sources to move around / look at / refactor. Due to different tenancies I can't log into the actual environment. These are usually sent as .dbc files and I can convert them by opening up a new databricks enviroment are re-saving them as a .py file. I was wondering if there was a method where I could do this from command line, like nb-convert for Juypter ?
it's a little bit of a pain to import a whole host of files, then re-convert to python just for the sake of reading code.
Source control is not always an option due to permissions.
Import the .dbc in your Databricks workspace, for example in the Shared directory.
Then, as suggested by Carlos, install the Databricks CLI on your local computer and set it up.
pip install databricks-cli
databricks configure --token
and run the following to import the .py notebooks into your local folder
mkdir export_notebooks
cd export_notebooks
databricks workspace export_dir /Shared ./

Pyspark list files by filtetypes in a directory

I want to list files by filetypes in a directory. The directory has .csv,.pdf etc files types and I want to list all the .csv files.
I am using the following command
dbutils.fs.ls("/mnt/test-output/*.csv")
I am expecting to get the list of all csv files in that directory.
I am getting the following error in databricks
java.io.FileNotFoundException: No such file or directory: /test-output/*.csv
Try using a shell cell with %sh. You can access DBFS and the mnt directory from there, too.
%sh
ls /dbfs/mnt/*.csv
Should get you a result like
/dbfs/mnt/temp.csv
%fs is a shortcut to dbutils and its access to the file system. dbutils doesn't support all unix shell functions and syntax, so that's probably the issue you ran into. Notice also how when running the %sh cell we access DBFS with /dbfs/.
I think you're mixing DBFS with local file system. Where is /mnt/test-output/*.csv?
If you're trying to read from DBFS then it will work.
Can you try running dbutils.fs.ls("/") to ensure that /mnt exist in DBFS.

Cloudera Quick Start VM lacks Spark 2.0 or greater

In order to test and learn Spark functions, developers require Spark latest version. As the API's and methods earlier to version 2.0 are obsolete and no longer work in the newer version. This throws a bigger challenge and developers are forced to install Spark manually which wastes a considerable amount of development time.
How do I use a later version of Spark on the Quickstart VM?
Every one should not waste setup time which I have wasted, so here is the solution.
SPARK 2.2 Installation Setup on Cloudera VM
Step 1: Download a quickstart_vm from the link:
Prefer a vmware platform as it is easy to use, anyways all the options are viable.
Size is around 5.4gb of the entire tar file. We need to provide the business email id as it won’t accept personal email ids.
Step 2: The virtual environment requires around 8gb of RAM, please allocate sufficient memory to avoid performance glitches.
Step 3: Please open the terminal and switch to root user as:
su root
password: cloudera
Step 4: Cloudera provides java –version 1.7.0_67 which is old and does not match with our needs. To avoid java related exceptions, please install java with the following commands:
Downloading Java:
wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
Switch to /usr/java/ directory with “cd /usr/java/” command.
cp the java download tar file to the /usr/java/ directory.
Untar the directory with “tar –zxvf jdk-8u31-linux-x64.tar.gz”
Open the profile file with the command “vi ~/.bash_profile”
export JAVA_HOME to the new java directory.
export JAVA_HOME=/usr/java/jdk1.8.0_131
Save and Exit.
In order to reflect the above change, following command needs to be executed on the shell:
source ~/.bash_profile
The Cloudera VM provides spark 1.6 version by default. However, 1.6 API’s are old and do not match with production environments. In that case, we need to download and manually install Spark 2.2.
Switch to /opt/ directory with the command:
cd /opt/
Download spark with the command:
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
Untar the spark tar with the following command:
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
We need to define some environment variables as default settings:
Please open a file with the following command:
vi /opt/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh
Paste the following configurations in the file:
SPARK_MASTER_IP=192.168.50.1
SPARK_EXECUTOR_MEMORY=512m
SPARK_DRIVER_MEMORY=512m
SPARK_WORKER_MEMORY=512m
SPARK_DAEMON_MEMORY=512m
Save and exit
We need to start spark with the following command:
/opt/spark-2.2.0-bin-hadoop2.7/sbin/start-all.sh
Export spark_home :
export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7/
Change the permissions of the directory:
chmod 777 -R /tmp/hive
Try “spark-shell”, it should work.

Virtuoso Upgrade

I have a Virtuoso version 06.01.3127 installed on Ubuntu 14.04.05 LTS version (Ubuntu-server).
I would like to upgrade my Virtuoso to at least version 7.2.4.2+, which includes the GeoSpatial features that I need.
I have looked the info provided in the following link Virtuoso: Upgrading from Release 6.x to Release 7.x but I have not been able to follow these steps.
To start with, the second step "Check the size of the .trx file, typically found alongside the .db and .ini files".
I can only find the odbc.ini and virtuoso.ini files, which are inside /virtuoso-opensource-6.1 folder.
Am I looking in the wrong place?
Does anyone have any guidance in this matter?
Thanks in advance
OpenLink Software (producer of Virtuoso, employer of me) does not force the location of any file -- so we cannot tell you exactly where to look on your host.
virtuoso.db is the default database storage file; your local file might be any *.db. This file must be present in a mounted filesystem, and should be fully identified (with full filepath) within the active *.ini file (default being virtuoso.ini).
You might have multiple virtuoso.ini and/or virtuoso.db files in different locations in your filesystem. You might try using some Linux commands, like --
find / -name virtuoso.db -ls
find / -name virtuoso.ini -ls
find / -name '*.db' -ls
find / -name '*.ini' -ls
Installing the binary components is done by following the instructions for installation...
You can get advice from a lot of experienced Virtuoso Users on the mailing list...

How to download SAS(Shared Acess signature) of an Azure File share onto Linux machine

I have a SAS generated for a file share(with read and list privileges(no write privileges).
my SAS looks like the following format :
“https://test.file.core.windows.net/testf1?[some_token_here]
. I used Azcopy to download the files through above SAS onto a windows virtual machine however Azcopy is not present in Linux.
How do I download the files using the above SAS onto my linux virtual machine(I run ubuntu 14.04 but i prefer a answer that runs on most linux distros)? I'd prefer using a single line of code to carry out the above task. I tried working with Azure-cli but I was unable to find any success.
ps: I am very new to Azure
If you are using the latest Azure-Cli, you might want to try:
azure storage file upload [options] [source] [share] [path]
azure storage file download [options] [share] [path] [destination]
Please try azure storage file for more help information.

Resources