I have a json
{
"name": "John",
"age": 30,
"car": "testing"}
I have a code in databricks
struct2 = StructType([StructField("name", StringType(), True), \
StructField("age", IntegerType(), True, None), \
StructField("car", StringType(), True, None)])
df2 = spark.readStream.schema(struct2).format('json') \
.load("abfss://i**********.dfs.core.windows.net/streamjson/")
and the next step I start the writestream to another folder
df2.select("name","age","car").writeStream.format('json')\
.option("checkpointLocation", "abfss://****#*****.dfs.core.windows.net/outputstream/jsoncheckpoint3") \
.start("abfss://***#******.dfs.core.windows.net/streamjsonoutput/")
I put new files there , and I check the files in streamjsonoutput , the files are like in the following
anyone can point out what I done wrongly ?
You need to add .option("multiLine", "true") to your spark.readStream, because by default, Spark expects that JSON file consists of individual JSON objects per each line, not spanning multiple lines (see docs):
df2 = spark.readStream.schema(struct2).option("multiLine", "true") \
.json("abfss://...")
Related
I am fetching pyspark stream data:
spark = SparkSession \
.builder \
.getOrCreate()
raw_stream = spark \
.readStream \
.option("endpoint", conf.get('config', 'endpoint')) \
.option("app.name", conf.get('config', 'app_name')) \
.option("app.secret", conf.get('config', 'app_key')) \
.option("dc", conf.get('config', 'dc')) \
.option("source.topic", conf.get('config', 'topic')) \
.option("group.name", conf.get('config', 'group')) \
.option("source.value.binaryType", 'false') \
.load()
raw_stream_str = raw_stream \
.selectExpr("CAST(value AS STRING)")
value_batch = raw_stream_str \
.writeStream \
.queryName("value_query") \
.format("memory") \
.start()
spark.sql("select * from value_query").show()
which output is as below:
+--------------------+
| value|
+--------------------+
|{"message":"DGc6K...|
+--------------------+
The whole content of value looks like that:
[Row(value='{"message": "xxx", "source": "xxx", "instance": "xxx", "metrics": {"metric1": "xxx", "metric2": "xxx", ...}}')]
where metrics is a dictionary like key-value pairs. I want to extract metrics content so I have something like that:
+-------+-------+-------------------+
|metric1|metric2| metric3|
+-------+-------+-------------------+
| "abc"| 12345|01/01/2022 00:00:00|
+-------+-------+-------------------+
I am able to achieve it by considering it as a list of string:
raw_stream_str.selectExpr("value", "split(value,',')[3] as message").drop("value")
raw_stream_str.selectExpr("metrics","split(value,',')[0] as metric1"...).drop("metrics")
Is there more efficent (spark) way of doing it in terms of distributed computing? Maybe by applying/mapping some function on every row of stream output dataframe with the help of json.loads so I can explore its key-value nature?
As suggested you can use from_json to get the data in the required schema and after that you can do df.select("metrics.*") to get the required dataframe.
I have some values in Azure Key Vault (AKV)
A simple initial googling was giving me
username = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-api-key")
pwd = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-secret")
from kafka import KafkaConsumer
consumer = KafkaConsumer('TOPIC',
bootstrap_servers = 'SERVER:PORT',
enable_auto_commit = False,
auto_offset_reset = 'earliest',
consumer_timeout_ms = 2000,
security_protocol = 'SASL_SSL',
sasl_mechanism = 'PLAIN',
sasl_plain_username = username,
sasl_plain_password = pwd)
This one works one time when the cell in databricks runs, however, after a single run it is finished, and it is not listening to Kafka messages anymore, and the cluster goes to the off state after the configured time (in my case 30 minutes)
So it doesn't solve my problem
My next google search was this blog on databricks (Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2)
from pyspark.sql.types import *
from pyspark.sql.functions import from_json
from pyspark.sql.functions import *
schema = StructType() \
.add("EventHeader", StructType() \
.add("UUID", StringType()) \
.add("APPLICATION_ID", StringType())
.add("FORMAT", StringType())) \
.add("EmissionReportMessage", StructType() \
.add("reportId", StringType()) \
.add("startDate", StringType()) \
.add("endDate", StringType()) \
.add("unitOfMeasure", StringType()) \
.add("reportLanguage", StringType()) \
.add("companies", ArrayType(StructType([StructField("ccid", StringType(), True)]))))
parsed_kafka = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "SERVER:PORT") \
.option("subscribe", "TOPIC") \
.option("startingOffsets", "earliest") \
.load()\
.select(from_json(col("value").cast("string"), schema).alias("kafka_parsed_value"))
There are some issues
Where should I put my GenID or user/pass info?
When I run the display command, it runs, but it will never stop, and it will never show the result
however, after a single run it is finished, and it is not listening to Kafka messages anymore
Given that you have enable_auto_commit = False, it should continue to work on following runs. But this isn't using Spark...
Where should I put my GenID or user/pass info
You would add SASL/SSL properties into option() parameters.
Ex. For SASL_PLAIN
option("kafka.sasl.jaas.config",
'org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(username, password))
See related question
it will never stop
Because you run a streaming query starting with readStream rather than a batched read.
it will never show the result
You'll need to use parsed_kafka.writeStream.format("console"), for example somewhere (assuming you want to start with readStream, rather than display() and read
I'm using databricks and trying to read in a csv file like this:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path_to_my_file)
)
and I'm getting the error:
AnalysisException: 'Unable to infer schema for CSV. It must be specified manually.;'
I've checked that my file is not empty, and I've also tried to specify schema myself like this:
schema = "datetime timestamp, id STRING, zone_id STRING, name INT, time INT, a INT"
df = (spark.read
.option("header", "true")
.schema(schema)
.csv(path_to_my_file)
)
But when try to see it using display(df), it just gives me this below, I'm totally lost and don't know what to do.
df.show() and df.printSchema() gives the following:
It looks like that data are not being read into the dataframe.
error snapshot:
Note, this is an incomplete answer as there isn't enough information about what your file looks like to understand why the inferSchema did not work. I've placed this response as an answer as it is too long as a comment.
Saying this, for programmatically specifying a schema, you would need to specify the schema using StructType().
Using your example of
datetime timestamp, id STRING, zone_id STRING, name INT, time INT, mod_a INT"
it would look something like this:
# Import data types
from pyspark.sql.types import *
schema = StructType(
[StructField('datetime', TimestampType(), True),
StructField('id', StringType(), True),
StructField('zone_id', StringType(), True),
StructField('name', IntegerType(), True),
StructField('time', IntegerType(), True),
StructField('mod_a', IntegerType(), True)
]
)
Note, how the df.printSchema() had specified that all of the columns were datatype string.
I discovered that the problem was caused by the filename.
Perhaps databrick is unable to read filename schemas that begin with '_'. (underscore).
I had the same problem, and when I uploaded the file without the first letter (ie, underscore), I was able to process it.
I want to retrieve the last 50 records inserted into Elasticsearch to find out their average for an Anomaly detection project.
This is how I am retrieving data from ES. However, it is fetching the entire data, not the last 50 records. Is there any way to do that?
edf = spark \
.read \
.format("org.elasticsearch.spark.sql") \
.option("es.read.metadata", "false") \
.option("es.nodes.wan.only","true") \
.option("es.port","9200")\
.option("es.net.ssl","false")\
.option("es.nodes", "http://localhost") \
.load("anomaly_detection/data")
# GroupBy based on the `sender` column
df3 = edf.groupBy("sender") \
.agg(expr("avg(amount)").alias("avg_amount"))
Here the sender column is fetching the entire row data, how to get only the last 50 DataFrame rows data?
Input data schema format:
|sender|receiver|amount|
You can also add the query while reading the data as
query='{"query": {"match_all": {}}, "size": 50, "sort": [{"_timestamp": {"order": "desc"}}]}'
and pass it as
edf = spark \
.read \
.format("org.elasticsearch.spark.sql") \
.option("es.read.metadata", "false") \
.option("es.nodes.wan.only","true") \
.option("es.port","9200")\
.option("es.net.ssl","false")\
.option("es.nodes", "http://localhost") \
.option("query", query)
.load("anomaly_detection/data")
I need to read dataset into a DataFrame, then write the data to Delta Lake. But I have the following exception :
AnalysisException: 'Incompatible format detected.\n\nYou are trying to write to `dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/` using Databricks Delta, but there is no\ntransaction log present. Check the upstream job to make sure that it is writing\nusing format("delta") and that you are trying to write to the table base path.\n\nTo disable this check, SET spark.databricks.delta.formatCheck.enabled=false\nTo learn more about Delta, see https://docs.azuredatabricks.net/delta/index.html\n;
Here is the code preceding the exception :
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
inputSchema = StructType([
StructField("InvoiceNo", IntegerType(), True),
StructField("StockCode", StringType(), True),
StructField("Description", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("InvoiceDate", StringType(), True),
StructField("UnitPrice", DoubleType(), True),
StructField("CustomerID", IntegerType(), True),
StructField("Country", StringType(), True)
])
rawDataDF = (spark.read
.option("header", "true")
.schema(inputSchema)
.csv(inputPath)
)
# write to Delta Lake
rawDataDF.write.mode("overwrite").format("delta").partitionBy("Country").save(DataPath)
This error message is telling you that there is already data at the destination path (in this case dbfs:/user/class#azuredatabrickstraining.onmicrosoft.com/delta/customer-data/), and that that data is not in the Delta format (i.e. there is no transaction log). You can either choose a new path (which based on the comments above, it seems like you did) or delete that directory and try again.
I found this Question with this search: "You are trying to write to *** using Databricks Delta, but there is no transaction log present."
In case someone searches for the same:
For me the solution was to explicitly code
.write.format("parquet")
because
.format("delta")
is the dafault since Databricks Runtime 8.0 and above and I need "parquet" for legacy reasons.
One can get this error if also tries to read the data in a format that is not supported by spark.read (or if does not specify the format).
The file format should be specified along the supported formats: csv, txt, json, parquet or arvo.
dataframe = spark.read.format('csv').load(path)