Fetch dbfs files as a stream dataframe in databricks

Fetch dbfs files as a stream dataframe in databricks - databricks

I have a problem where I need to create an external table in Databricks for each CSV file that lands into an ADLS gen 2 storage.
I thought about a solution when I would get a streaming dataframe from dbutils.fs.ls() output and then call a function that creates a table inside the forEachBatch().
I have the function ready, but I can't figure out a way to stream directory information into a streaming Dataframe. Do anyone have an idea on how this could be achieved?

Kindly check with the below code block.
package com.sparkbyexamples.spark.streaming
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object SparkStreamingFromDirectory {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val schema = StructType(
List(
StructField("Zipcode", IntegerType, true),
)
)
val df = spark.readStream
.schema(schema)
.json("Your directory")
df.printSchema()
val groupDF = df.select("Zipcode")
.groupBy("Zipcode").count()
groupDF.printSchema()
groupDF.writeStream
.format("console")
.outputMode("complete")
.start()
.awaitTermination()
}
}

Related

How to call avro SchemaConverters in Pyspark

Although PySpark has Avro support, it does not have the SchemaConverters method. I may be able to use Py4J to accomplish this, but I have never used a Java package within Python.
This is the code I am using
# Import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
def _test():
# Create SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("sparvro") \
.getOrCreate()
avroSchema = sc._jvm.org.apache.spark.sql.avro.SchemaConverters.toAvroType(StructType([ StructField("firstname", StringType(), True)]))
if __name__ == "__main__":
_test()
however, I keep getting this error
AttributeError: 'StructField' object has no attribute '_get_object_id'

Creating pyspark's spark context py4j java gateway object

I am trying to convert a java dataframe to a pyspark dataframe. For this I am creating a dataframe(or dataset of Row) in java process and starting a py4j.GatewayServer server process on java side. Then on python side I am creating a py4j.java_gateway.JavaGateway() client object and passing this to pyspark's SparkContext constructor to link it to the jvm process already started. But I am getting this error :-
File: "path_to_virtual_environment/lib/site-packages/pyspark/conf.py", line 120, in __init__
self._jconf = _jvm.SparkConf(loadDefaults)
TypeError: 'JavaPackage' object is not callable
Can someone please help ?
Below is the code I am using:-
Java Code:-
import py4j.GatewayServer
public class TestJavaToPythonTransfer{
Dataset<Row> df1;
public TestJavaToPythonTransfer(){
SparkSession spark =
SparkSession.builder().appName("test1").config("spark.master","local").getOrCreate();
df1 = spark.read().json("path/to/local/json_file");
}
public Dataset<Row> getDf(){
return df1;
}
public static void main(String args[]){
GatewayServer gatewayServer = new GatewayServer(new TestJavaToPythonTransfer());
gatewayServer.start();
System.out.println("Gateway server started");
}
}
Python code:-
from pyspark.sql import SQLContext, DataFrame
from pyspark import SparkContext, SparkConf
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
conf = SparkConf().set('spark.io.encryption.enabled','true')
py_sc = SparkContext(gateway=gateway,conf=conf)
j_df = gateway.getDf()
py_df = DataFrame(j_df,SQLContext(py_sc))
print('print dataframe content')
print(dpy_df.collect())
Command to run python code:-
python path_to_python_file.py
I also tried doing this:-
$SPARK_HOME/bin/spark-submit --master local path_to_python_file.py
But here though the code is not throwing any error but it is not printing anything to terminal. Do I need to set some spark conf for this?
P.S - apologies in advance if there is a typo mistake in code or mistake, since I could not copy the code and error stack directly from my firm's IDE.

There is a missing call to entry_point before calling getDf()
So, try this:
app = gateway.entry_point
j_df = app.getDf()
Additionally, I have create working copy using Python and Scala (hope you dont mind) below that shows how on Scala side py4j gateway is started with Spark session and a sample DataFrame and on Python side I have accessed that DataFrame and converted to Python List[Tuple] before converting back to a DataFrame for a Spark session on Python side:
Python:
from py4j.java_gateway import JavaGateway
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, IntegerType, StructField
if __name__ == '__main__':
gateway = JavaGateway()
spark_app = gateway.entry_point
df = spark_app.df()
# Note "apply" method here comes from Scala's companion object to access elements of an array
df_to_list_tuple = [(int(i.apply(0)), int(i.apply(1))) for i in df]
spark = (SparkSession
.builder
.appName("My PySpark App")
.getOrCreate())
schema = StructType([
StructField("a", IntegerType(), True),
StructField("b", IntegerType(), True)])
df = spark.createDataFrame(df_to_list_tuple, schema)
df.show()
Scala:
import java.nio.file.{Path, Paths}
import org.apache.spark.sql.SparkSession
import py4j.GatewayServer
object SparkApp {
val myFile: Path = Paths.get(System.getProperty("user.home") + "/dev/sample_data/games.csv")
val spark = SparkSession.builder()
.master("local[*]")
.appName("My app")
.getOrCreate()
val df = spark
.read
.option("header", "True")
.csv(myFile.toString)
.collect()
}
object Py4JServerApp extends App {
val server = new GatewayServer(SparkApp)
server.start()
print("Started and running...")
}

How to get Avro data from Confluent Schema Registry in String format from kafka in pyspark?

I am reading data from Kafka in spark(structured streaming) But Data getting in spark from kafka in spark is not in string format.
Spark: 2.3.4
Kafka Data format:
{"Patient_ID":316,"Name":"Richa","MobileNo":{"long":7049123177},"BDate":{"int":740},"Gender":"female"}
Here is the code for kafka to spark structured streaming:
# spark-submit --jars kafka-clients-0.10.0.1.jar --packages org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.4,org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 /home/kinjalpatel/kafka_sppark.py
import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import json
from pyspark.sql.functions import from_json, col, struct
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from pyspark.sql.column import Column, _to_java_column
sc = SparkContext()
sc.setLogLevel("ERROR")
spark = SparkSession(sc)
schema_registry_client = CachedSchemaRegistryClient(
url='http://localhost:8081')
serializer = MessageSerializer(schema_registry_client)
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "mysql-01-Patient") \
.option("partition.assignment.strategy", "range") \
.option("valueConverter", "org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter") \
.load()
df.printSchema()
mta_stream=df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(timestamp AS STRING)", "CAST(timestampType AS STRING)")
mta_stream.printSchema()
qry = mta_stream.writeStream.outputMode("append").format("console").start()
qry.awaitTermination()
This is the output I get:
+----+--------------------+----------------+---------+------+--------------------+-------------+
| key| value| topic|partition|offset| timestamp|timestampType|
+----+--------------------+----------------+---------+------+--------------------+-------------+
|null|�
Richa���...|mysql-01-Patient| 0| 160|2019-12-27 11:56:...| 0|
+----+--------------------+----------------+---------+------+--------------------+-------------+
How to get value column in string format?

from Spark documentation
import org.apache.spark.sql.avro._
// `from_avro` requires Avro schema in JSON string format.
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc" )))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
val output = df
.select(from_avro('value, jsonFormatSchema) as 'user)
.where("user.favorite_color == \"red\"")
.select(to_avro($"user.name") as 'value)
val query = output
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic2")
.start()
from databricks documentation
import org.apache.spark.sql.avro._
import org.apache.avro.SchemaBuilder
// When reading the key and value of a Kafka topic, decode the
// binary (Avro) data into structured data.
// The schema of the resulting DataFrame is: <key: string, value: int>
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t")
.load()
.select(
from_avro($"key", SchemaBuilder.builder().stringType()).as("key"),
from_avro($"value", SchemaBuilder.builder().intType()).as("value"))

For Reading Avro message from Kafka topic and parsing in pyspark structured streaming, don't have direct libraries for the same . But we can read/parsing Avro message by writing small wrapper and call that function as UDF in your pyspark streaming code.
Please refer:
Reading avro messages from Kafka in spark streaming/structured streaming

Convert csv files to parquet on s3 using Spark structured streaming

I'm trying to create a Spark application that will read my csv files from s3, convert it to parquet files and write the results to s3.
I have 8 new csv files every minute compressed with gzip (~60MB each gzip file), each row have ~200 columns and ~99% are at the same date (my partition column).
The cluster have 3 workers with 10 cores and memory of 20 GB each.
Here is my code:
val spark = SparkSession
.builder()
.appName("Csv2Parquet")
.config("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("fs.s3a.access.key", "MY ACESS KEY")
.config("fs.s3a.secret.key", "MY SECRET")
.config("spark.executor.memory", "15G")
.config("spark.driver.memory", "5G")
.getOrCreate()
import spark.implicits._
val schema= StructType(Array(
StructField("myDate", DateType, nullable=false),
StructField("myTimestamp", TimestampType, nullable=true),
...
...
...
StructField("myColumn200", StringType, nullable=true)
))
val df = spark.readStream
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "false")
.option("mode", "DROPMALFORMED")
.option("delimiter","\t")
.load("s3a://my-bucket/raw-data/*.gz")
.withColumn("myPartitionDate", $"myDate")
val query = df.repartition($"myPartitionDate").writeStream
.option("checkpointLocation", "/shared/checkpoints/csv2parquet")
.trigger(Trigger.ProcessingTime(60000))
.format("parquet")
.option("path", "s3a://my-bucket/parquet-data")
.partitionBy(myPartitionDate)
.start("s3a://my-bucket/parquet-data")
query.awaitTermination()
The problem is that only one task is responsible for writing the "main" partition (that includes 99% of the events) to s3 and it takes ~4 minutes to handle this task. how can i improve it?

Spark 2.0 - Databricks xml reader Input path does not exist

I am trying to use Databricks XML file reader api.
Sample code:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///C:/TestData")
.getOrCreate();
//val sqlContext = new SQLContext(sc)
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("books.xml")
df.show()
If i give the file path directly , its looking for some warehouse directory. so i set the spark.sql.warehouse.dir option, but now it throws Input path does not exist.
It is actually looking under the project root directory , why is it looking for project root directory?

Finally its working.. We need to specify warehouse directory as well pass the absolute file path in the load method. I am not sure what is the use of warehouse directory.
The main part is we dont need to give C: as mentioned by other Stackoverflow answer.
working code:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///TestData/")
.getOrCreate();
//val sqlContext = new SQLContext(sc)
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("file:///TestData/books.xml")
df.show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Fetch dbfs files as a stream dataframe in databricks - databricks

Related

How to call avro SchemaConverters in Pyspark

Creating pyspark's spark context py4j java gateway object

How to get Avro data from Confluent Schema Registry in String format from kafka in pyspark?

Convert csv files to parquet on s3 using Spark structured streaming

Spark 2.0 - Databricks xml reader Input path does not exist

Categories

Resources