Read data from mount in Databricks (using Autoloader) - databricks

I am using azure blob storage to store data and feeding this data to Autoloader using mount. I was looking for a way to allow Autoloader to load a new file from any mount. Let's say I have these folders in my mount:
mnt/
├─ blob_container_1
├─ blob_container_2
When I use .load('/mnt/') no new files are detected. But when I consider folders individually then it works fine like .load('/mnt/blob_container_1')
I want to load files from both mount paths using Autoloader (running continuously).

You can use the path for providing prefix patterns, for example:
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", <format>) \
.schema(schema) \
.load("<base_path>/*/files")
For example, if you would like to parse only png files within a directory that contains files with different suffixes, you can do:
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "binaryFile") \
.option("pathGlobfilter", "*.png") \
.load(<base_path>)
Refer – https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#filtering-directories-or-files-using-glob-patterns

Related

How to avoid having Pyspark partition output result in extra empty files with name the same as partition (subfolder name)?

Here is my desire output directory
parent_folder
partition=1
partition_inner=1
file1.avro
.....
partition=2
partition_inner=2
file1.avro
.....
instead what i get is
parent_folder
partition=1
partition_inner=1 (empty file)
partition_inner=1
file1.avro
.....
partition=2
partition_inner=2 (empty file)
partition_inner=2
file1.avro
.....
partition=1 (empty file)
partition=2 (empty file)
Here is my pyspark write code
mydf.write.format("com.databricks.spark.avro")
.partitionBy("partition", "partition_inner")
.mode("overwrite")
.save("wasbs://my-container#my-storage.blob.core.windows.net/myoutput")
what is the extra files that has the name as the subdirectory for? and how do i get rid of them?
I am using azure data factory with blob storage

extract gz files using hdfs/spark

I have large gzip files stored HDFS location
- /dataset1/sample.tar.gz --> contains 1.csv & 2.csv .... and so on
I would like extract
/dataset1/extracted/1.csv
/dataset1/extracted/2.csv
/dataset1/extracted/3.csv
.........................
.........................
/dataset1/extracted/1000.csv
Is there any hdfs commands that can be used to extract tar gz file (without copying to local machine) or use python/scala spark?
I tried using spark but since spark can not parallelize reading a gzipfile and the gzip file is very huge like 50GB.
I want to split the gzip and use those for spark aggregations.

Pyspark: Load a tar.gz file into a dataframe and filter by filename

I have a tar.gz file that has multiple files. The hierarchy looks as below. My intention is to read the tar.gz file, filter out the contents of b.tsv as it is static metadata where all the other files are actual records.
gzfile.tar.gz
|- a.tsv
|- b.tsv
|- thousand more files.
By pyspark load, I'm able to load the file into a dataframe. I used the command:
spark = SparkSession.\
builder.\
appName("Loading Gzip Files").\
getOrCreate()
input = spark.read.load('/Users/jeevs/git/data/gzfile.tar.gz',\
format='com.databricks.spark.csv',\
sep = '\t'
With the intention to filter, I added the filename
from pyspark.sql.functions import input_file_name
input.withColumn("filename", input_file_name())
Which now generates the data like so:
|_c0 |_c1 |filename |
|b.tsv0000666000076500001440035235677713575350214013124 0ustar netsaintusers1|Lynx 2.7.1|file:///Users/jeevs/git/data/gzfile.tar.gz|
|2|Lynx 2.7|file:///Users/jeevs/git/data/gzfile.tar.gz|
Of course, the file field is populating with the tar.gz file, making that approach useless.
A more irritating problem is, the _c0 is getting populated with filename+garbage+first row values
At this point, I'm wondering if the file read itself is getting weird as it is a tar.gz file. When we did the v1 of this processing, (spark 0.9), we had another step that loaded the data from s3 into an ec2 box, extract and write back into s3. I'm trying to get rid of those steps.
Thanks in advance!
Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
Above code will unzip all files with extension *.tar.gz in source to destination location.
If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable.
This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
Use following methods to load file, in assumption the content in *.csv file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')

Found nothing in _spark_metadata

I am trying to read CSV files from a specific folder and write same contents to other CSV file in a different location on the local pc for learning purpose. I can read the file and show the contents on the console. However, if I want to write it to another CSV file at the specified output directory I get a folder named "_spark_metadata" which contain nothing inside.
I paste the whole code here step by step:
creating Spark Session:
spark = SparkSession \
.builder \
.appName('csv01') \
.master('local[*]') \
.getOrCreate();
spark.conf.set("spark.sql.streaming.checkpointLocation", <String path to checkpoint location directory> )
userSchema = StructType().add("name", "string").add("age", "integer")
Read from CSV file
df = spark \
.readStream \
.schema(userSchema) \
.option("sep",",") \
.csv(<String path to local input directory containing CSV file>)
Write to CSV file
df.writeStream \
.format("csv") \
.option("path", <String path to local output directory containing CSV file>) \
.start()
In "String path to local output directory containing CSV file" I only get a folder _spark_metadata which contains no CSV file.
Any help on this is highly appreciated
You don't use readStream to read from static data. You use that to read from a directory where files are added into that folder.
You only need spark.read.csv

Spark 2.0: How to list or remove dirs and files in s3

Is there any way to list files and dirs, remove files and dirs, check if a dir exists, etc directly from spark 2.0 shell?
I am able to use os python library but it just 'sees' local dirs, not s3.
I have also found this but I cannot make it work
http://bigdatatech.taleia.software/2015/12/21/check-if-exists-a-amazon-s3-path-from-apache-spark/
Thanks
You can use s3cmd http://s3tools.org/s3cmd-howto , in order to use it inside python, you'll need to use os.system or os.subprocess
List your buckets again with s3cmd ls
~$ s3cmd ls
2007-01-19 01:41 s3://logix.cz-test
Upload a file into the bucket
~$ s3cmd put addressbook.xml s3://logix.cz-test/addrbook.xml
File 'addressbook.xml' stored as s3://logix.cz-test/addrbook.xml (123456 bytes)
Another options is using tinys3 lib https://www.smore.com/labs/tinys3/
Another option is using simples3 http://sendapatch.se/projects/simples3/
s = S3Bucket(bucket, access_key=access_key, secret_key=secret_key)
print s
<S3Bucket ... at 'https://s3.amazonaws.com/...'>

Resources