Azure batch telemetry data inject to dstabade - azure

i have to develop iot solution for now i have telemetry data coming as batch Data combining all devices data around 300 devices at once in a particular interval it is not live.what is the best way to design this.sending batch Data is also we can define

Hummm, I just answered another question similar to this one. Using the code below, we can merge data from multiple files, all with a similar name, into a data frame and push the whole thing into SQL Server. This is Scala, so it needs to be run in your Azure Databricks environment.
# merge files with similar names into a single dataframe
val DF = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/corp/ABC*.gz")
DF.count()
# rename headers in dataframe
val newNames = Seq("ID", "FName", "LName", "Address", "ZipCode", "file_name")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema
# push the dataframe to sql server
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
// Aquire a DataFrame collection (val collection)
val config = Config(Map(
"url" -> "my_sql_server.database.windows.net",
"databaseName" -> "my_db_name",
"dbTable" -> "dbo.my_table",
"user" -> "xxxxx",
"password" -> "xxxxx",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
import org.apache.spark.sql.SaveMode
DF.write.mode(SaveMode.Append).sqlDB(config)
The code above will read every line of every file. If the headers are in the first line, this works great. If the headers and NOT in the first line, use the code below to crate a specific schema, and again, read every line of every file.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())
import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val bulkCopyConfig = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"user" -> "username",
"password" -> "*********",
"databaseName" -> "MyDatabase",
"dbTable" -> "dbo.Clients",
"bulkCopyBatchSize" -> "2500",
"bulkCopyTableLock" -> "true",
"bulkCopyTimeout" -> "600"
))
df.write.mode(SaveMode.Append).
//df.bulkCopyToSqlDB(bulkCopyConfig, bulkCopyMetadata)
//df.bulkCopyToSqlDB(bulkCopyConfig) if no metadata is specified.

Related

spark sql add comment with withComment, it is not work

I want to add remarks to the dataframe, then write hive table,but it is not work.That is to say, the remarks of the table are not added.
I try in spark 2.4 and spark 3, it is not work. But the lower version seems to work, I don't know why,I tried to read the source code but found nothing, if you know why, please tell me, thank you
The code as follows
val personRDD: RDD[Row] = GetTestRDD.map((line: String) => {
val arr: Array[String] = line.split(" ")
Row(arr(0).toInt, arr(1), arr(2).toInt)
})
val schema: StructType = StructType(List(
StructField("id", IntegerType, nullable = false),
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val frame: DataFrame = sparkSession.createDataFrame(personRDD, schema)
println("输出原始信息")
frame.schema.foreach((s: StructField) => println(s.name, s.metadata))
//添加备注后处理
val commentMap: Map[String, String] = Map("id" -> "唯一标识", "name" -> "姓名", "age" -> "年龄")
val newSchema: Seq[StructField] = frame.schema.map((s: StructField) => {
println(commentMap(s.name))
s.withComment(commentMap(s.name))
})
sparkSession.createDataFrame(frame.rdd, StructType(newSchema)).repartition(10)
println("输出处理后的信息")
frame.schema.foreach((s: StructField) => println(s.name, s.metadata))
the output
输出原始信息
(id,{})
(name,{})
(age,{})
输出处理后的信息
(id,{})
(name,{})
(age,{})

Apache Spark Data Generator Function on Databricks to Send Data To Azure Event Hubs

I created a question on how to use Databricks dummy data generator here Apache Spark Data Generator Function on Databricks Not working
Everything is working fine. However I would like to take it to next level and send the dummy data to Azure Event Hubs:
I attempted to do this myself with the following code:
import dbldatagen as dg
from pyspark.sql.types import IntegerType, StringType, FloatType
import json
from pyspark.sql.types import StructType, StructField, IntegerType, DecimalType, StringType, TimestampType
from pyspark.sql.functions import *
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
# will have implied column `id` for ordinal of row
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build(withStreaming=True, options={'rowsPerSecond': 100})
streamingDelays = (
df_flight_data
.groupBy(
df_flight_data.flightNumber,
df_flight_data.airline,
df_flight_data.original_departure,
df_flight_data.delay_minutes,
df_flight_data.delayed_departure,
df_flight_data.reason,
window(df_flight_data.original_departure, "1 hour")
)
.count()
)
writeConnectionString = event_hub_connection_string
checkpointLocation = "///checkpoint.txt"
ehWriteConf = {
'eventhubs.connectionString' : writeConnectionString
}
# Write body data from a DataFrame to EventHubs.
ds = streamingDelays \
.writeStream.format("eventhubs") \
.options(**ehWriteConf) \
.outputMode("complete") \
.option("checkpointLocation", checkpointLocation).start()
However, the Stream starts but abruptly stops and provides the following error:
org.apache.spark.sql.AnalysisException: Required attribute 'body' not found
Any thoughts on what could be causing this error?

Spark - How to convert text file into a multiple column schema DataFrame/Dataset

I am trying to read a text file and convert it into dataframe.
val inputDf: DataFrame = spark.read.text(filePath.get.concat("/").concat(fileName.get))
.map((row) => row.toString().split(","))
.map(attributes => {
Row(attributes(0), attributes(1), attributes(2), attributes(3), attributes(4))
}).as[Row]
When i do inputDf.printSchema, I am getting a single column;
root
|-- value: binary (nullable = true)
How can I convert this text file into a multiple column schema Dataframe/Dataset
Solved;
val inputSchema: StructType = StructType(
List(
StructField("1", StringType, true),
StructField("2", StringType, true),
StructField("3", StringType, true),
StructField("4", StringType, true),
StructField("5", StringType, true)
)
)
val encoder = RowEncoder(inputSchema)
val inputDf: DataFrame = spark.read.text(filePath.get.concat("/").concat(fileName.get))
.map((row) => row.toString().split(","))
.map(attributes => {
Row(attributes(0), attributes(1), attributes(2), attributes(3), "BUY")
})(encoder)

How to pro-grammatically generate Struct Type as StringType for all the fields in spark?

I have *n number of fields(like 200-300), all the fields Struct Type i want as string-type only. Any simple way are there, like below mentioned
val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))
Below code i tried,
StructType schema= new StructType()
.add("field1", StringType)
.add("field2", StringType)
.add("field3", StringType);
ExpressionEncoder express=RowCoder.apply(schema)
You can use pattern Matching
import org.apache.spark.sql.types._
val df = Seq(
(1L, BigDecimal(12.34), "a", BigDecimal(10.001)),
(2L, BigDecimal(56.78), "b", BigDecimal(20.002))
).toDF("c1", "c2", "c3", "c4")
val newSchema = df.schema.fields.map{
case StructField(name, _: DecimalType, nullable, _)
=> StructField(name, DoubleType, nullable)
case field => field
}

Read Excel in Spark Error :InputStream of class ZipArchiveInputStream is not implementing InputStreamStatistics

I am trying to read excel files from COS via spark , like this
def readExcelData(filePath: String, spark: SparkSession): DataFrame =
spark.read
.format("com.crealytics.spark.excel")
.option("path", filePath)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "False")
.option("addColorColumns", "False")
.load()
def readAllFiles: DataFrame = {
val objLst //contains the list the file paths
val schema = StructType(
StructField("col1", StringType, true) ::
StructField("col2", StringType, true) ::
StructField("col3", StringType, true) ::
StructField("col4", StringType, true) :: Nil
)
var initialDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
for (file <- objLst) {
initialDF = initialDF.union(
readExcelData(file, spark).select($"col1", $"col2", $"col3", $"col4"))
}
}
In this code , I am creating an empty dataframe first , then reading all the excel files (by iterating the filepaths ) and merging the data via a union operation.
It is throwing an error like this
java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63)
The sparkExcel version is 0.10.2
try removing the .show() for your original statement and convert to dataframe first.
def readExcel(file: String): DataFrame = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "False")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show()

Resources