Removing Blank fields from Spark Dataframe - apache-spark

I use spark structured streaming to consume a kafka topic which has several type of message(different schema of each type). I define a schema which has all fields for different kind of message.
How can i filter empty fields from dataframe for each row, or how can i read dataframe from kafka with dynamic schema.
val inputDS = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "overview")
.load()
val schemaa: StructType = StructType(
Array(
StructField("title", StringType, true),
StructField("url", StringType, true),
StructField("content", StringType, true),
StructField("collect_time", StringType, true),
StructField("time", StringType, true),
StructField("user_head", StringType, true),
StructField("image", StringType, true)
)
)
inputDS.withColumn("value", from_json($"value".cast(StringType), schemaa))
//.filter() // todo filter empty field
.writeStream
.format("console")
.start()
.awaitTermination()

Related

Spark Streaming: write source records and calculate aggregates in single pass

I am reading from a Kinesis stream:
import pyspark.sql.types as T
import pyspark.sql.functions as F
schema = T.StructType(
[
T.StructField("data", T.StringType(), True),
T.StructField("metadata", T.StructType([
T.StructField("timestamp", T.TimestampType(), True),
T.StructField("record-type", T.StringType(), True),
T.StructField("operation", T.StringType(), True),
T.StructField("partition-key-type", T.StringType(), True),
T.StructField("partition-key-value", T.StringType(), True),
T.StructField("schema-name", T.StringType(), True),
T.StructField("table-name", T.StringType(), True)
]), True)
]
)
kinesis = (
spark.readStream
.format("kinesis")
.option("streamName", "data-platform-aws-dms-poc-dms-sample")
.option("region", "us-east-1")
.option("initialPosition", "TRIM_HORIZON")
.load()
.selectExpr("partitionKey", "CAST(data AS STRING) AS data", "stream", "shardId", "sequenceNumber", "approximateArrivalTimestamp")
.withColumn("data", F.from_json("data", schema))
.selectExpr("partitionKey", "data.data", "data.metadata.*", "stream", "shardId", "sequenceNumber", "approximateArrivalTimestamp")
)
And writing the output to Delta Lake:
checkpoint_path = "s3://route-databricks-sandbox-data/checkpoint/sandbox.route.dms_sample"
query = (
kinesis
.writeStream
.format("delta")
.queryName("dms_sample_output")
.outputMode("append")
.option("checkpointLocation", checkpoint_path)
.partitionBy("schema-name", "table-name")
.toTable("sandbox.public.dms_sample")
)
I want to perform an aggregation over the data as it flows through my pipeline without having to re-read the data from the stream. The aggregation will be a calculation over each Kinesis shardId so it will be small and fit easily in memory.

Unable to read data from kafka topic

I'm a beginner in kafka. Trying to code a spark application to read data from a kafka topic created. Kafka topic1 is up & running.
Is there any problem with the code provided below:
val kafka_bootstrap_servers = "localhost:9092"
val users_df = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap_servers)
.option("subscribe", kafka_topic_name)
.load()
val users_df_1 = users_df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
val user_schema = StructType(
List(
StructField("RecordNumber", IntegerType, true),
StructField("Zipcode", StringType, true),
StructField("ZipCodeType", StringType, true),
StructField("City", StringType, true),
StructField("State", StringType, true),
StructField("LocationType", StringType, true),
StructField("Lat", StringType, true),
StructField("Long", StringType, true),
StructField("Xaxis", StringType, true),
StructField("Yaxis", StringType, true),
StructField("Zaxis", StringType, true),
StructField("WorldRegion", StringType, true),
StructField("Country", StringType, true),
StructField("LocationText", StringType, true),
StructField("Location", StringType, true),
StructField("Decommisioned", StringType, true)
)
)
val users_df_2 = users_df_1.select(from_json(col("RecordNumber"), user_schema)
.as("user_detail"), col("Zipcode"))
val users_df_3 = users_df_2.select(col = "user_detail.*", "Zipcode")
users_df_3.printSchema()
users_df_3.show(numRows = 10, truncate = false)
spark.stop()
println("Apache spark application completed.")
}
}
json data sample below
{"RecordNumber":76511,"Zipcode":27007,"ZipCodeType":"STANDARD","City":"ASH HILL","State":"NC","LocationType":"NOT ACCEPTABLE","Lat":36.4,"Long":-80.56,"Xaxis":0.13,"Yaxis":-0.79,"Zaxis":0.59,"WorldRegion":"NA","Country":"US","LocationText":"Ash Hill, NC","Location":"NA-US-NC-ASH HILL","Decommisioned":false,"TaxReturnsFiled":842,"EstimatedPopulation":1666,"TotalWages":28876493}
Error msg below
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at streamingApp$.main(streamingApp.scala:25)
at streamingApp.main(streamingApp.scala)
Need help to read data from kafka topic.
Please follow the guide for spark streaming + kafka integration
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
You might be missing the artifact "spark-sql-kafka-0-10_2.12"

Issue in writing records in into MYSQL from Spark Structured Streaming Dataframe

I am using below code to write spark Streaming dataframe into MQSQL DB .Below is the kafka topic JSON data format and MYSQL table schema.Column name and types are same to same.
But I am unable to see records written in MYSQL table. Table is empty with zero records.Please suggest.
Kafka Topic Data Fomat
ssingh#RENLTP2N073:/mnt/d/confluent-6.0.0/bin$ ./kafka-console-consumer --topic sarvtopic --from-beginning --bootstrap-server localhost:9092
{"id":1,"firstname":"James ","middlename":"","lastname":"Smith","dob_year":2018,"dob_month":1,"gender":"M","salary":3000}
{"id":2,"firstname":"Michael ","middlename":"Rose","lastname":"","dob_year":2010,"dob_month":3,"gender":"M","salary":4000}
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("SSKafka") \
.getOrCreate()
dsraw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sarvtopic") \
.option("startingOffsets", "earliest") \
.load()
ds = dsraw.selectExpr("CAST(value AS STRING)")
dsraw.printSchema()
from pyspark.sql.types import StructField, StructType, StringType,LongType
from pyspark.sql.functions import *
custom_schema = StructType([
StructField("id", LongType(), True),
StructField("firstname", StringType(), True),
StructField("middlename", StringType(), True),
StructField("lastname", StringType(), True),
StructField("dob_year", StringType(), True),
StructField("dob_month", LongType(), True),
StructField("gender", StringType(), True),
StructField("salary", LongType(), True),
])
Person_details_df2 = ds\
.select(from_json(col("value"), custom_schema).alias("Person_details"))
Person_details_df3 = Person_details_df2.select("Person_details.*")
from pyspark.sql import DataFrameWriter
def foreach_batch_function(df, epoch_id):
Person_details_df3.write.jdbc(url='jdbc:mysql://172.16.23.27:30038/securedb', driver='com.mysql.jdbc.Driver', dbtable="sparkkafka", user='root',password='root$1234')
pass
query = Person_details_df3.writeStream.trigger(processingTime='20 seconds').outputMode("append").foreachBatch(foreach_batch_function).start()
query
Out[14]: <pyspark.sql.streaming.StreamingQuery at 0x1fb25503b08>
MYSQL table Schema:
create table sparkkafka(
id int,
firstname VARCHAR(40) NOT NULL,
middlename VARCHAR(40) NOT NULL,
lastname VARCHAR(40) NOT NULL,
dob_year int(40) NOT NULL,
dob_month int(40) NOT NULL,
gender VARCHAR(40) NOT NULL,
salary int(40) NOT NULL,
PRIMARY KEY (id)
);
I presume Person_details_df3 is your streaming dataframe and your spark version is above 2.4.0 version.
To use foreachBatch API write as below:
db_target_properties = {"user":"xxxx", "password":"yyyyy"}
def foreach_batch_function(df, epoch_id):
df.write.jdbc(url='jdbc:mysql://172.16.23.27:30038/securedb', table="sparkkafka", properties=db_target_properties)
pass
query = Person_details_df3.writeStream.outputMode("append").foreachBatch(foreach_batch_function).start()
query.awaitTermination()

How does the Databricks Delta Lake `mergeSchema` option handle differing data types?

What does the Databricks Delta Lake mergeSchema option do if a pre-existing column is appended with a different data type?
For example, given a Delta Lake table with schema foo INT, bar INT, what would happen when trying to write-append new data with schema foo INT, bar DOUBLE when specifying the option mergeSchema = true?
The write fails. (as of Delta Lake 0.5.0 on Databricks 6.3)
I think this is what you are looking for.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())
Just name 'field1', 'field2', etc., as your actual field names. Also, the 'ABC*.gz' does a wildcard search for files beginning with a specific string, like 'abc', or whatever, and the '*' character, which means any combination of characters, up the the backslash and the '.gz' which means it's a zipped file. Yours could be different, of course, so just change that convention to meet your specific needs.

When reading a CSV is there an option to start on row 2 or below?

I'm reading a bunch of CSV files into a dataframe using the sample code below.
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/corp/ABC*.gz")
I'm hoping there is a way to start on row 2 or below, because row 1 contains some basic metadata about these files, and the first row has 4 pipe characters, so Spark thinks the file has 4 columns, but it actually has over 100 columns in the actual data.
I tried playing with the inferSchema and header but I couldn't get anything to work.
If the first line in CSV doesnt match actual column count and names, you may need to define your schema by hand, and then try this combination:
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","false")
.option("header","true")
.schema(mySchema)
.option("enforceSchema","true")
.load(...
Full list of CSV options.
Note that for Spark 2.3 and above, you can use a shorthand, SQL-style notation for schema definition -- simple string "column1 type1, column2 type2, ...".
If however your header has more than one line, you will probably be forced to ignore all "errors" by using additional option .option("mode","DROPMALFORMED").
You are right! You need to define a custom schema! I ended up going with this.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())

Resources