laoding csv with some malfromed data

laoding csv with some malfromed data - apache-spark

Im trying to read the csv in azure databricks and it gives me the following error.
df2 = (spark.read.format('csv')
.option("delimiter", ",")
.option("quote", '"')
#.option("quoteAll", '"')
.option("escape", '"')
.option("header", "false")
.option("path", '/mnt/d365/'+absolute+'/'+table_name+"/*.csv")
.option("mode", "failfast")
# .option("mode", "dropmalformed")
# .option("mode", "permissive")
.option("lineSep", "\r\n")
.option("multiLine", "true")
# .option("columnNameOfCorruptRecord", "_corrupt_record")
.schema(schema)
.load()
)
display(df2)
"Caused by: MalformedCSVException: Malformed CSV record".
I have found some ways to bypass this problem but I dont want to use .option("mode", "dropmalformed") or .option("mode", "permissive").
How am i suppose to discover which characters are the problem, identify where they are coming from, and fix them at the source

Related

Why pyspark cannot show any data?

when I use Windows local spark like below, it work and Can see "df.count()"
spark = SparkSession \
.builder \
.appName("Structured Streaming ") \
.master("local[*]") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest") \
.load()
flower_df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
flower_schema_string = "sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,species STRING"
flower_df2 = flower_df1.select(from_csv(col("value"), flower_schema_string).alias("flower"), "timestamp").select("flower.*", "timestamp")
flower_df2.createOrReplaceTempView("flower_find")
song_find_text = spark.sql("SELECT * FROM flower_find")
flower_agg_write_stream = song_find_text \
.writeStream \
.option("truncate", "false") \
.format("memory") \
.outputMode("update") \
.queryName("testedTable") \
.start()
while True:
df = spark.sql("SELECT * FROM testedTable")
print(df.count())
time.sleep(1)
But when I use my Virtual Box's Ubuntu's Spark, NEVER SEE any data.
below is the modification I made when I using Ubuntu's Spark.
SparkSession's master URL: "spark://192.168.15.2:7077"
Insert code flower_agg_write_stream.awaitTermination() above "while True:"
Did I do something wrong?
ADD.
when run modification code, log appears as below:
...
org.apache.spark.sql.AnalysisException: Table or view not found: testedTable;
...
unfortunately, I already try createOrReplaceGlobalTempView(). but it doesn't work too.

Kafka and pyspark program: Unable to determine why dataframe is empty

Below is my first program working with kafka and pyspark. The code seems to run without exceptions, but the output of my query is empty.
I am initiating spark and kafka. Later, in Kafka initiation, I subscribed the topic = "quickstart-events" and from terminal produced messages for this topic. But when I run this code, it gives me blank dataframes.
How do I resolve?
Code:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession, DataFrame
from pyspark.sql.types import StructType, ArrayType, StructField, IntegerType, StringType, DoubleType
spark = SparkSession.builder \
.appName("Spark-Kafka-Integration") \
.master("local[2]") \
.getOrCreate()
dsraw = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "quickstart-events") \
.load()
ds = dsraw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
print(type(ds))
rawQuery = dsraw \
.writeStream \
.queryName("query1")\
.format("memory")\
.start()
raw = spark.sql("select * from query1")
raw.show() # empty output
rawQuery = ds \
.writeStream \
.queryName("query2")\
.format("memory")\
.start()
raw = spark.sql("select * from query2")
raw.show() # empty output
print("complete")
Output:
+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
+---+-----+-----+---------+------+---------+-------------+
+---+-----+
|key|value|
+---+-----+
+---+-----+

if you are learning and experimenting with kafka spark streaming then it is fine.
just use:
while (True):
time.sleep(5)
print("queryresult")
raw.show() # it will start printing the result
instead of
raw.show() # it will run only once that's why not printig the result.
DO NOT USE for Production code.
Better to write like:
spark = SparkSession.builder \
.appName("Spark-Kafka-Integration") \
.master("local[2]") \
.getOrCreate()
dsraw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "quickstart-events") \
.load()
ds = dsraw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
rawQuery = \
ds \
.writeStream \
.format("console") \
.outputMode("append") \
.start()
rawQuery.awaitTermination()
it will automatically print the result on the console.

How to use multiple input and multiple output streams in a single pyspark session?

I am using spark v2.4.0 and I am reading two separate streams from kafka and doing some different transformation on each one of them, now I want to persist both the streaming data-frames, but only One of them is getting persisted and the other one does not seem to work simultaneously, would be highly grateful for any help provided.
Below is my code,
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import from_json, col, to_date
# Created a SparkSession here, as it is an entry point to underlying Spark functionality
spark = SparkSession.builder \
.master('spark://yash-tech:7077') \
.appName('Streaming') \
.getOrCreate()
# Defined a schema for our data being streamed from kafka
schema = StructType([
StructField("messageId", StringType(), True),
StructField("type", StringType(), True),
StructField("userId", StringType(), True),
StructField('data', StringType(), True),
StructField("timestamp", StringType(), True),
])
profileDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", 'test') \
.option("startingOffsets", "latest") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("value"))
# Using readStream on SparkSession to load a streaming Dataset from Kafka
clickStreamDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", 'test_new') \
.option("startingOffsets", "latest") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("value"))
# Selecting every column from the DF
clickStreamDFToPersist = clickStreamDF.select("value.*")
profileDFToPersist = profileDF.select("value.*")
# Added a new column containing date(yyyy-MM-dd) parsed from timestamp column for day wise partitioning
clickStreamDFToPersist = clickStreamDFToPersist.withColumn(
"date", to_date(col("timestamp"), "yyyy-MM-dd"))
# Writing data on local disk as json files, partitioned by userId.
clickStream_writing_sink = clickStreamDFToPersist.repartition(1) \
.writeStream \
.partitionBy('userId', 'date') \
.format("json") \
.option("path", "/home/spark/data/") \
.outputMode("append") \
.option("checkpointLocation", "/home/spark/event_checkpoint/") \
.trigger(processingTime='20 seconds') \
.start()
profile_writing_sink = profileDFToPersist.repartition(1) \
.writeStream \
.partitionBy('userId') \
.format("json") \
.option("path", "/home/spark/data/") \
.outputMode("append") \
.option("checkpointLocation", "/home/spark/profile_checkpoint/") \
.trigger(processingTime='30 seconds') \
.start()
clickStream_writing_sink.awaitTermination()
profile_writing_sink.awaitTermination()
NOTE:
I want both the writeStreams to write on the same path.
If I give different data paths in both the writeStreams then the code seems to work but the data gets persisted on different locations, is there a way that I can persist both the streams on same location, or if I can do both these transformation and persist data using single stream only as the location is same for both?
In one stream I am partitioning only using userId and in the other one I am doing userId + date partitioning.

Hi as we have the same path provided for the sink directory location so output are over written.
You cannot change the "part" prefix while using any of the standard output formats.
it could be possible if you can overwrite recordWriter().

How to transform dataframes to rdds in structured streaming?

I get data from kafka using pyspark streaming, and the result is a dataframe, when I transform dataframe to rdd, it went wrong:
Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 36, in <module>
df = df.rdd.map(lambda x: x.value.split(" ")).toDF()
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 91, in rdd
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
the right version code:
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load()
df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
df = df.withColumn("s", F.split(df['value'], " "))
df = df.withColumn('e', F.explode(df['s']))
# df = df.rdd.map(lambda x: x.value.split(" ")).toDF()
q = df.writeStream \
.format("console") \
.trigger(processingTime='30 seconds') \
.start()
q.awaitTermination()
this is the wrong version code:
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.load()
df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
# df = df.withColumn("s", F.split(df['value'], " "))
# df = df.withColumn('e', F.explode(df['s']))
df = df.rdd.map(lambda x: x.value.split(" ")).toDF()
q = df.writeStream \
.format("console") \
.trigger(processingTime='30 seconds') \
.start()
q.awaitTermination()
Why it cannot convert dataframe to rdd? and how can I do when I want to transform dataframe to rdd in pyspark streaming?

If your spark version is 2.4.0 and above then u can use below alternative to play around with each row of your dataframe.
query=df.writeStream.foreach(Customized method to work on each row of dataframe rather than RDD).outputMode("update").start()
ssc.start()
ssc.awaitTermination()

This RDD aspect is simply NOT supported. RDDs are legacy and Spark Structured Streaming is DF/DS based. Common abstraction whether streaming or batch.

To perform specific actions over your Dataframe fields you can use UDF functions or even you can create your Spark Custom Transformers. But there are some Dataframe operations that are not supported like transforming to RDD.

structured streaming is running on the spark-sql enginer.Conversion of dataframe or dataset to RDD is not supported.

Spark Structred Streaming Pyspark Sink Csv Does'nt Append

Write json to Kafka Topic and read json from kafka topic. Actually I subscribe topic and write console line by line. But I have to sink/write file csv. But I can't. I write csv one time but doesn't append.
You can see my code bellow.
Thank you!
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as func
spark = SparkSession.builder\
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0') \
.appName('kafka_stream_test')\
.getOrCreate()
ordersSchema = StructType() \
.add("a", StringType()) \
.add("b", StringType()) \
.add("c", StringType()) \
.add("d", StringType())\
.add("e", StringType())\
.add("f", StringType())
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "product-views") \
.load()\
df_query = df \
.selectExpr("cast(value as string)") \
.select(func.from_json(func.col("value").cast("string"),ordersSchema).alias("parsed"))\
.select("parsed.a","parsed.b","parsed.c","parsed.d","parsed.e","parsed.f")\
df = df_query \
.writeStream \
.format("csv")\
.trigger(processingTime = "5 seconds")\
.option("path", "/var/kafka_stream_test_out/")\
.option("checkpointLocation", "/user/kafka_stream_test_out/chk") \
.start()
df.awaitTermination()

Yes, because you need this extra option .option("format", "append") :
aa = df_query \
.writeStream \
.format("csv")\
.option("format", "append")\
.trigger(processingTime = "5 seconds")\
.option("path", "/var/kafka_stream_test_out/")\
.option("checkpointLocation", "/user/kafka_stream_test_out/chk") \
.outputMode("append") \
.start()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

laoding csv with some malfromed data - apache-spark

Related

Why pyspark cannot show any data?

Kafka and pyspark program: Unable to determine why dataframe is empty

How to use multiple input and multiple output streams in a single pyspark session?

How to transform dataframes to rdds in structured streaming?

Spark Structred Streaming Pyspark Sink Csv Does'nt Append

Categories

Resources