Databricks/Spark read custom metadata from Parquet file - azure

I created a Parquet file with custom metadata at file level:
Now I'm trying to read that metadata from the Parquet file in (Azure) Databricks. But when I run the following code I don't get any metadata which is present there.
storageaccount = 'zzzzzz'
containername = 'yyyyy'
access_key = 'xxxx'
spark.conf.set(f'fs.azure.account.key.{storageaccount}.blob.core.windows.net', access_key)
path = f"wasbs://{containername}#{storageaccount}.blob.core.windows.net/generated_example_10m.parquet"
data = spark.read.format('parquet').load(path)
print(data.printSchema())

I try to reproduce same thing in my environment. I got this output.
Please follow below code and Use select("*", "_metadata")
path = "wasbs://<container>#<storage_account_name>.blob.core.windows.net/<file_path>.parquet"
data = spark.read.format('parquet').load(path).select("*", "_metadata")
display(data)
or
Mention your schema and load path with .select("*", "_metadata")
df = spark.read \
.format("parquet") \
.schema(schema) \
.load(path) \
.select("*", "_metadata")
display(df)

Related

Spark Streaming Read Multiple Files with Dynamic Schema

I have a spark streaming application that reads multiples paths in a bucket.
Every path has a csv with a specific schema. How could I set the schema according to the path spark is reading.
Example:
# bucket structure: bucket_name/table_1/year/month/day/file.csv
schemas = {"table_1":"id INT, name STRING, status STRING", "table_2":"col1 STRING, col2 STRING"}
df_changes = spark.readStream\
.format("csv")\
.option("delimiter","|")\
.option("Header",True)\
.option("multiLine",True)\
.option('ignoreLeadingWhiteSpace',True)\
.option('ignoreTrailingWhiteSpace',True)\
.option("escape", "\"")\
.load(f"s3a://bucket_name/*/*/*/*/*/*.csv")\
.withColumn("file_path", input_file_name())\
.withColumn("raw_timestamp", current_timestamp())
def append_data(df,batchId):
file_path = df.first()['file_path']
lista = file_path.split('/')[:-4]
full_path = '/'.join(lista).replace('landing', 'raw') + '/cdc'
windowSpec = Window.partitionBy().orderBy(lit(None))
df.withColumn("row_number",row_number().over(windowSpec)).write.format("delta").mode('append').save(full_path)
df_changes.writeStream \
.foreachBatch(append_data) \
.option("checkpointLocation", "/checkpoint/")\
.start()
Is it possible to insert during the spark load the schema dynamically?
Something like: .load(f"s3a://bucket_name/*/*/*/*/*/*.csv", schemas=schemas[path-from-spark-splited])

How to query from Cloud SQL with PySpark?

I'm setting up a dataproc job to query some tables from BigQuery, but while I am able to retrieve data from BigQuery, using the same syntax does not work for retrieving data from an External Connection within my BigQuery project.
More specifically, I'm using the query below to retrieve event data from the analytics of my project:
PROJECT = ... # my project name
NUMBER = ... # my project's analytics number
DATE = ... # day of the events in the format YYYYMMDD
analytics_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('table', f'{PROJECT}.analytics_{NUMBER}.events_{DATE}') \
.load()
While the query above works perfectly, I am unable to query to an external connection of my project. I'd like to be able to do something like:
DB_NAME = ... # my database name, considering that my Connection ID is
# projects/<PROJECT_NAME>/locations/us-central1/connections/<DB_NAME>
my_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('table', f'{PROJECT}.{DB_NAME}.my_table') \
.load()
Or even like this:
query = 'SELECT * FROM my_table'
my_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('query', query) \
.load()
How can I retrieve this data?
Thanks in advance :)

How to include partitioned column in pyspark dataframe read method

I am writing Avro file-based from a parquet file. I have read the file as below:
Reading data
dfParquet = spark.read.format("parquet").option("mode", "FAILFAST")
.load("/Users/rashmik/flight-time.parquet")
Writing data
I have written the file in Avro format as below:
dfParquetRePartitioned.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
As expected, I got data partitioned by OP_CARRIER.
Reading Avro partitioned data from a specific partition
In another job, I need to read data from the output of the above job, i.e. from datasink/avro directory. I am using the below code to read from datasink/avro
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load("datasink/avro/OP_CARRIER=AA")
It reads data successfully, but as expected OP_CARRIER column is not available in dfAvro dataframe as it is a partition column of the first job. Now my requirement is to include OP_CARRIER field also in 2nd dataframe i.e. in dfAvro. Could somebody help me with this?
I am referring documentation from the spark document, but I am not able to locate the relevant information. Any pointer will be very helpful.
You replicate the same column value with a different alias.
dfParquetRePartitioned.withColumn("OP_CARRIER_1", lit(df.OP_CARRIER)) \
.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
This would give you what you wanted. But with a different alias.
Or you can also do it during reading. If location is dynamic then you can easily append the column.
path = "datasink/avro/OP_CARRIER=AA"
newcol = path.split("/")[-1].split("=")
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load(path).withColumn(newcol[0], lit(newcol[1]))
If the value is static its way more easy to add it during the data read.

pyspark parse filename on load

I'm quite new to spark and there is one thing that I don't understand: how to manipulate column content.
I have a set of csv as follow:
each dsX is a table and I would like to load the data at once for each table.
So far no problems:
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*")
But There is one information missing: the client_id and this client id is the first part of the csv name: clientId_table_category.csv
So I tried to do this:
def extract_path(patht):
print(patht)
return patht
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*") \
.withColumn("clientId", fn.lit(extract_path(fn.input_file_name())))
But the print returns:
Column<b'input_file_name()'>
And I can't do much with this.
I'm quite stuck here, how do you manipulate data in this configuration?
Another solution for me is to load each csv one by one and parse the clientId from the file name manually, but I was wondering if there wouldn't be a more powerful solution with spark.
you are going a little too far away :
df = spark.read.csv(
table+"/*",
header=True,
sep='\\'
).withColumn("clientId", fn.input_file_name())
this will create a column with the full path. Then you just need some extra string manipulation - easy using an UDF. You can also do that with builtin function but it is trickier.
from pyspark.sql.types import StringType
#fn.udf(StringType())
def get_id(in_string):
return in_string.split("/")[-1].split("_")[0]
df = df.withColumn(
"clientId",
get_id(fn.col("clientId")
)

How to read xlsx or xls files as spark dataframe

Can anyone let me know without converting xlsx or xls files how can we read them as a spark dataframe
I have already tried to read with pandas and then tried to convert to spark dataframe but got the error and the error is
Error:
Cannot merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
Code:
import pandas
import os
df = pandas.read_excel('/dbfs/FileStore/tables/BSE.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)
I try to give a general updated version at April 2021 based on the answers of #matkurek and #Peter Pan.
SPARK
You should install on your databricks cluster the following 2 libraries:
Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd
Then, you will be able to read your excel as follows:
sparkDF = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load(filePath)
PANDAS
You should install on your databricks cluster the following 2 libraries:
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: openpyxl
Then, you will be able to read your excel as follows:
import pandas
pandasDF = pd.read_excel(io = filePath, engine='openpyxl', sheet_name = 'NameOfYourExcelSheet')
Note that you will have two different objects, in the first scenario a Spark Dataframe, in the second a Pandas Dataframe.
As mentioned by #matkurek you can read it from excel directly. Indeed, this should be a better practice than involving pandas since then the benefit of Spark would not exist anymore.
You can run the same code sample as defined qbove, but just adding the class needed to the configuration of your SparkSession.
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()
Then, you can read your excel file.
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load("your_file"))
There is no data of your excel shown in your post, but I had reproduced the same issue as yours.
Here is the data of my sample excel test.xlsx, as below.
You can see there are different data types in my column B: a double value 2.2 and a string value C.
So if I run the code below,
import pandas
df = pandas.read_excel('test.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)
it will return a same error as yours.
TypeError: field B: Can not merge type <class 'pyspark.sql.types.DoubleType'> and class 'pyspark.sql.types.StringType'>
If we tried to inspect the dtypes of df columns via df.dtypes, we will see.
The dtype of Column B is object, the spark.createDateFrame function can not inference the real data type for column B from the real data. So to fix it, the solution is to pass a schema to help data type inference for column B, as the code below.
from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([StructField("A", DoubleType(), True), StructField("B", StringType(), True)])
sdf = spark.createDataFrame(df, schema=schema)
To force make column B as StringType to solve the data type conflict.
You can read excel file through spark's read function. That requires a spark plugin, to install it on databricks go to:
clusters > your cluster > libraries > install new > select Maven and in 'Coordinates' paste com.crealytics:spark-excel_2.12:0.13.5
After that, this is how you can read the file:
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load(filePath)
Just open file xlsx or xlms,open file in pandas,after that in spark
import pandas as pd
df = pd.read_excel('file.xlsx', engine='openpyxl')
df = spark_session.createDataFrame(df.astype(str))
Below configuration and code works for me to read excel file into pyspark dataframe. Pre-requisites before executing python code.
Install Maven library on your databricks cluster.
Maven library name & version: com.crealytics:spark-excel_2.12:0.13.5
Databricks Runtime: 9.0 (includes Apache Spark 3.1.2, Scala 2.12)
Execute below code in your python notebook to load excel file into pyspark dataframe:
sheetAddress = "'<enter sheetname>'!A1"
filePath = "<enter excel file full path>"
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("dataAddress", sheetAddress) \
.option("treatEmptyValuesAsNulls", "false") \
.option("inferSchema", "true") \
.load(filePath)
Steps to read .xls / .xlsx files from Azure Blob storage into a Spark DF
You can read the excel files located in Azure blob storage to a pyspark dataframe with the help of a library called spark-excel. (Also refered as com.crealytics.spark.excel)
Install the library either using the UI or Databricks CLI. (Cluster settings page > Libraries > Install new option. Make sure to chose maven)
Once the library is installed. You need proper credentials to access Azure blob storage. You can provide the access key in Cluster settings page > Advanced option > Spark configs
Example:
spark.hadoop.fs.azure.account.key.<storage-account>.blob.core.windows.net <access key>
Note: If you're the cluster owner you can provide it as a secret instead of giving access key as plain text as mentioned in the docs
Restart the cluster. you can use below code to read those excel files located in blob storage
filePath = "wasbs://<container-name>#<storage-account>.blob.core.windows.net/MyFile1.xls"
DF = spark.read.format("excel").option("header", "true").option("inferSchema", "true").load(filePath)
display(DF)
PS: The spark.read.format("excel") is the V2 approach. while spark.read.format("com.crealytics.spark.excel") is the V1, you can read more here

Resources