I'm a beginner in kafka. Trying to code a spark application to read data from a kafka topic created. Kafka topic1 is up & running.
Is there any problem with the code provided below:
val kafka_bootstrap_servers = "localhost:9092"
val users_df = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap_servers)
.option("subscribe", kafka_topic_name)
.load()
val users_df_1 = users_df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
val user_schema = StructType(
List(
StructField("RecordNumber", IntegerType, true),
StructField("Zipcode", StringType, true),
StructField("ZipCodeType", StringType, true),
StructField("City", StringType, true),
StructField("State", StringType, true),
StructField("LocationType", StringType, true),
StructField("Lat", StringType, true),
StructField("Long", StringType, true),
StructField("Xaxis", StringType, true),
StructField("Yaxis", StringType, true),
StructField("Zaxis", StringType, true),
StructField("WorldRegion", StringType, true),
StructField("Country", StringType, true),
StructField("LocationText", StringType, true),
StructField("Location", StringType, true),
StructField("Decommisioned", StringType, true)
)
)
val users_df_2 = users_df_1.select(from_json(col("RecordNumber"), user_schema)
.as("user_detail"), col("Zipcode"))
val users_df_3 = users_df_2.select(col = "user_detail.*", "Zipcode")
users_df_3.printSchema()
users_df_3.show(numRows = 10, truncate = false)
spark.stop()
println("Apache spark application completed.")
}
}
json data sample below
{"RecordNumber":76511,"Zipcode":27007,"ZipCodeType":"STANDARD","City":"ASH HILL","State":"NC","LocationType":"NOT ACCEPTABLE","Lat":36.4,"Long":-80.56,"Xaxis":0.13,"Yaxis":-0.79,"Zaxis":0.59,"WorldRegion":"NA","Country":"US","LocationText":"Ash Hill, NC","Location":"NA-US-NC-ASH HILL","Decommisioned":false,"TaxReturnsFiled":842,"EstimatedPopulation":1666,"TotalWages":28876493}
Error msg below
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at streamingApp$.main(streamingApp.scala:25)
at streamingApp.main(streamingApp.scala)
Need help to read data from kafka topic.
Please follow the guide for spark streaming + kafka integration
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
You might be missing the artifact "spark-sql-kafka-0-10_2.12"
Related
I am using below code to write spark Streaming dataframe into MQSQL DB .Below is the kafka topic JSON data format and MYSQL table schema.Column name and types are same to same.
But I am unable to see records written in MYSQL table. Table is empty with zero records.Please suggest.
Kafka Topic Data Fomat
ssingh#RENLTP2N073:/mnt/d/confluent-6.0.0/bin$ ./kafka-console-consumer --topic sarvtopic --from-beginning --bootstrap-server localhost:9092
{"id":1,"firstname":"James ","middlename":"","lastname":"Smith","dob_year":2018,"dob_month":1,"gender":"M","salary":3000}
{"id":2,"firstname":"Michael ","middlename":"Rose","lastname":"","dob_year":2010,"dob_month":3,"gender":"M","salary":4000}
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("SSKafka") \
.getOrCreate()
dsraw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sarvtopic") \
.option("startingOffsets", "earliest") \
.load()
ds = dsraw.selectExpr("CAST(value AS STRING)")
dsraw.printSchema()
from pyspark.sql.types import StructField, StructType, StringType,LongType
from pyspark.sql.functions import *
custom_schema = StructType([
StructField("id", LongType(), True),
StructField("firstname", StringType(), True),
StructField("middlename", StringType(), True),
StructField("lastname", StringType(), True),
StructField("dob_year", StringType(), True),
StructField("dob_month", LongType(), True),
StructField("gender", StringType(), True),
StructField("salary", LongType(), True),
])
Person_details_df2 = ds\
.select(from_json(col("value"), custom_schema).alias("Person_details"))
Person_details_df3 = Person_details_df2.select("Person_details.*")
from pyspark.sql import DataFrameWriter
def foreach_batch_function(df, epoch_id):
Person_details_df3.write.jdbc(url='jdbc:mysql://172.16.23.27:30038/securedb', driver='com.mysql.jdbc.Driver', dbtable="sparkkafka", user='root',password='root$1234')
pass
query = Person_details_df3.writeStream.trigger(processingTime='20 seconds').outputMode("append").foreachBatch(foreach_batch_function).start()
query
Out[14]: <pyspark.sql.streaming.StreamingQuery at 0x1fb25503b08>
MYSQL table Schema:
create table sparkkafka(
id int,
firstname VARCHAR(40) NOT NULL,
middlename VARCHAR(40) NOT NULL,
lastname VARCHAR(40) NOT NULL,
dob_year int(40) NOT NULL,
dob_month int(40) NOT NULL,
gender VARCHAR(40) NOT NULL,
salary int(40) NOT NULL,
PRIMARY KEY (id)
);
I presume Person_details_df3 is your streaming dataframe and your spark version is above 2.4.0 version.
To use foreachBatch API write as below:
db_target_properties = {"user":"xxxx", "password":"yyyyy"}
def foreach_batch_function(df, epoch_id):
df.write.jdbc(url='jdbc:mysql://172.16.23.27:30038/securedb', table="sparkkafka", properties=db_target_properties)
pass
query = Person_details_df3.writeStream.outputMode("append").foreachBatch(foreach_batch_function).start()
query.awaitTermination()
I use spark structured streaming to consume a kafka topic which has several type of message(different schema of each type). I define a schema which has all fields for different kind of message.
How can i filter empty fields from dataframe for each row, or how can i read dataframe from kafka with dynamic schema.
val inputDS = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "overview")
.load()
val schemaa: StructType = StructType(
Array(
StructField("title", StringType, true),
StructField("url", StringType, true),
StructField("content", StringType, true),
StructField("collect_time", StringType, true),
StructField("time", StringType, true),
StructField("user_head", StringType, true),
StructField("image", StringType, true)
)
)
inputDS.withColumn("value", from_json($"value".cast(StringType), schemaa))
//.filter() // todo filter empty field
.writeStream
.format("console")
.start()
.awaitTermination()
What does the Databricks Delta Lake mergeSchema option do if a pre-existing column is appended with a different data type?
For example, given a Delta Lake table with schema foo INT, bar INT, what would happen when trying to write-append new data with schema foo INT, bar DOUBLE when specifying the option mergeSchema = true?
The write fails. (as of Delta Lake 0.5.0 on Databricks 6.3)
I think this is what you are looking for.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())
Just name 'field1', 'field2', etc., as your actual field names. Also, the 'ABC*.gz' does a wildcard search for files beginning with a specific string, like 'abc', or whatever, and the '*' character, which means any combination of characters, up the the backslash and the '.gz' which means it's a zipped file. Yours could be different, of course, so just change that convention to meet your specific needs.
import org.apache.spark.SparkContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd =
sc.textFile("/user/scvappdf/hdpcf/SIT/df/meta/1576373/SDM_Data/RDM/rdm_all_ctry_cd_ver1.txt")
val header = rdd.first()
val rdd1 = rdd.filter(x => x != header)
val rowRDD = rdd1.map(_.split("\\|")).map(p => Row(p(0).trim,
p(1).trim,p(2).trim,p(3).trim,p(4).trim,p(5).trim,p(6).trim,p(7).trim,p(8).trim,p(9).trim,
p(10).trim,p(11).trim,p(12).trim,p(13).trim,p(14).trim,p(15).trim,p(16).trim,p(17).trim,
p(18).trim,p(19).trim,p(20).trim,p(21).trim,p(22).trim,p(23).trim,p(24).trim,p(25).trim,p(26).trim))
val innerStruct = StructType(StructField("rowid", StringType, true)::StructField("s_startdt",
StringType, false)::StructField("s_starttime", StringType, false)::StructField("s_enddt",
StringType, false)::StructField("s_endtime", StringType, false)::StructField("s_deleted_flag",
StringType, false)::StructField("ctry_cd", StringType, false)::StructField("ctry_cd_desc",
StringType, false)::StructField("intrnl_ctry_rgn_cd", StringType, false)::
StructField("intrnl_ctry_rgn_desc", StringType, false)::StructField("iso_ind", StringType,
false)::StructField("iso_alp2_ctry_cd", StringType, false)::StructField("iso_alp3_ctry_cd",
StringType, false)::StructField("iso_num_cd", StringType, false)::StructField("npc_ind", StringType,
false)::StructField("cal_type_cd", StringType, false)::StructField("hemisphere_cd", StringType,
false)::StructField("hemisphere_desc", StringType, false)::StructField("ref_tbl_id", StringType,
false)::StructField("eff_strt_dt", StringType, false)::StructField("eff_end_dt", StringType,
false)::StructField("ent_st", StringType, false)::StructField("vers", StringType,
false)::StructField("crtd_tmst", StringType, false)::StructField("mod_tmst", StringType,
false)::StructField("src_id", StringType, false)::StructField("ods", StringType, false):: Nil)
val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, innerStruct)
peopleSchemaRDD.write.format("orc").mode("overwrite")
.save("/sit/regenv/hdata/DATAFABRIC/reg_tb_prd_dfab/stg_rdm_all_ctry_cd_ver1/ods=2019-12-06")
While saving the dataframe peopleSchemaRDD in ORC format I am getting the error "java.lang.ArrayIndexOutOfBoundsException: 3"
and I am getting this error Caused by: org.apache.spark.SparkException: Task failed while writing rows
There is no problem with the path as folder is getting created but no data is coming. I am using spark 1.6.3 version
There are some issues with SQLContext , you try to use the HiveContext. Use below configuration to resolve that:
spark.sql.orc.impl=native
Above is provided in the perspective of Spark-submit command.
Syntax for creating a schema in PySpark.
data.csv
id,name
1,sam
2,smith
val schema = new StructType().add("id", IntType).add("name", StringType)
val ds = spark.read.schema(schema).option("header", "true").csv("data.csv")
ds.show
define StructType with StructField(name, dataType, nullable=True)
from pyspark.sql.types you can import datatypes
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType,BooleanType
schema = StructType([
StructField("col_a", StringType(), True),
StructField("col_b", IntegerType(), True),
StructField("col_c", FloatType(), True),
StructField("col_d", BooleanType(), True)
])