Reading gz.parquet file - apache-spark

Hello I need to read the data from gz.parquet files but dont know how to?? Tried with impala but i get the same result as parquet-tools cat without the table structure.
P.S: any suggestions to improve the spark code are most welcome.
I have the following parquet files gz.parquet as a result of a data pipe line created by twitter => flume => kafka => spark streaming => hive/gz.parquet files). For flume agent i am using agent1.sources.twitter-data.type = org.apache.flume.source.twitter.TwitterSource
Spark code de-queues the data from kafka and storing in hive as follows:
val sparkConf = new SparkConf().setAppName("KafkaTweet2Hive")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)//new org.apache.spark.sql.SQLContext(sc)
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Get the data (tweets) from kafka
val tweets = messages.map(_._2)
// adding the tweets to Hive
tweets.foreachRDD { rdd =>
val hiveContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val tweetsDF = rdd.toDF()
tweetsDF.write.mode("append").saveAsTable("tweet")
}
When i run the spark-streaming app it stores the data as gz.parquet files in hdfs: /user/hive/warehouse directory as follows:
[root#quickstart /]# hdfs dfs -ls /user/hive/warehouse/tweets
Found 469 items
-rw-r--r-- 1 root supergroup 0 2016-03-30 08:36 /user/hive/warehouse/tweets/_SUCCESS
-rw-r--r-- 1 root supergroup 241 2016-03-30 08:36 /user/hive/warehouse/tweets/_common_metadata
-rw-r--r-- 1 root supergroup 35750 2016-03-30 08:36 /user/hive/warehouse/tweets/_metadata
-rw-r--r-- 1 root supergroup 23518 2016-03-30 08:33 /user/hive/warehouse/tweets/part-r-00000-0133fcd1-f529-4dd1-9371-36bf5c3e5df3.gz.parquet
-rw-r--r-- 1 root supergroup 9552 2016-03-30 08:33 /user/hive/warehouse/tweets/part-r-00000-02c44f98-bfc3-47e3-a8e7-62486a1a45e7.gz.parquet
-rw-r--r-- 1 root supergroup 19228 2016-03-30 08:25 /user/hive/warehouse/tweets/part-r-00000-0321ce99-9d2b-4c52-82ab-a9ed5f7d5036.gz.parquet
-rw-r--r-- 1 root supergroup 241 2016-03-30 08:25 /user/hive/warehouse/tweets/part-r-00000-03415df3-c719-4a3a-90c6 462c43cfef54.gz.parquet
The schema from _metadata file is as follows:
[root#quickstart /]# parquet-tools meta hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/_metadata
creator: parquet-mr version 1.5.0-cdh5.5.0 (build ${buildNumber})
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"tweet","type":"string","nullable":true,"metadata":{}}]}
file schema: root
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tweet: OPTIONAL BINARY O:UTF8 R:0 D:1
Furthermore, if i load the data into a dataframe in spark i get the output of `df.show´ as follows:
+--------------------+
| tweet|
+--------------------+
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|ڕObjavro.sch...|
|��Objavro.sc...|
|ֲObjavro.sch...|
|��Objavro.sc...|
|��Objavro.sc...|
|֕Objavro.sch...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
+--------------------+
only showing top 20 rows
How ever i would like to see the tweets as plain text?

sqlContext.read.parquet("/user/hive/warehouse/tweets").show

Related

Spark write.parquet runs on executors but read.parquet runs on driver

I'm running Spark 3.2.0 on Kubernetes. The driver is running in a pod. The executors pods are configured to all attach to the same shared PV. I'm generating data, saving it to the shared PV, and then trying to reload the data. Saving the data seems to work as expected but loading does not:
(the Spark code here is based on this repo: https://github.com/bigstepinc/SparkBench/)
# cat /tmp/spark.properties
spark.driver.port=7078
spark.master=k8s\://https\://10.10.1.2\:6443
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=spark-pvc
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/var/data
spark.app.name=spark-testing
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
spark.submit.deployMode=cluster
spark.driver.host=spark-driver-svc.default.svc
spark.driver.blockManager.port=7079
spark.app.id=spark-3834e87e5d1241dc8834c53d3f170281
spark.kubernetes.container.image=xxx
spark.kubernetes.memoryOverheadFactor=0.4
spark.kubernetes.submitInDriver=true
spark.kubernetes.driver.pod.name=spark-driver
spark.executor.instances=3
# /opt/spark/bin/spark-shell --properties-file /tmp/spark.properties --deploy-mode client
--conf spark.driver.bindAddress=<this pod's address>
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val spark=SparkSession.getDefaultSession.get
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#2b720a2c
scala> val chars = 'A' to 'Z'
chars: scala.collection.immutable.NumericRange.Inclusive[Char] = NumericRange A to Z
scala> val randValue = udf( (rowId:Long) => {
| val rnd = new scala.util.Random(rowId)
| (1 to 100).map( i => chars(rnd.nextInt(chars.length))).mkString
| })
randValue: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3296/0x000000084135f040#27cac84b,StringType,List(Some(class[value[0]: bigint])),Some(class[value[0]: string]),None,true,true)
scala> val df=spark.range(1000).toDF("rowId").repartition(3)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [rowId: bigint]
scala> val df2=df.withColumn("value", randValue(df("rowId")))
df2: org.apache.spark.sql.DataFrame = [rowId: bigint, value: string]
scala> df2.write.parquet("/var/data/testing2")
At this point the parquet files are on the PV. From a pod with the PV attached at /var/data:
# find /var/data/testing2/ -name "*.parquet" -exec ls -lh {} \;
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_20220128155613744592892460434050_0003_m_000001/part-00001-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_202201281556136846007947886921010_0003_m_000000/part-00000-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_202201281556132279227184912186575_0003_m_000002/part-00002-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
But now if I try loading the data again I get an error:
scala> val df3 = spark.read.parquet("/var/data/testing2")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
On the driver pod (which is NOT attached to the shared PV), /var/data/testing2 is empty which I think is what causes that error.
My question is: why does write.parquet run on the executors but read.parquet apparently runs on the driver? Won't this be a problem if I have a dataset much larger than the memory available on the driver?
Solved: Attaching the PV to the driver pod fixed the issue. With the PV attached to the driver, the parquet files were written to /var/data/testing2/ instead of /var/data/testing2/_temporary..., and there was a file _SUCCESS in /var/data/testing2. So I suspect that even though the parquet files were being generated, the data generation step wasn't actually completing like I thought it was.

Spark read a parquet file(56MB) , get a df with two or more partitions

parquet file: (not set parquet.block.size when create this parquet file)
56.1 M 2021-06-17 10:32 /tmp/test/part-00002-ec3a9caa-a70e-4efe-8c5b-3f706f010610.c000.snappy.parquet
read the parquet file through spark:
1.spark-shell not set spark.default.parallelism
scala> spark.read.parquet("/tmp/test/part-00002-ec3a9caa-a70e-4efe-8c5b-3f706f010610.c000.snappy.parquet")
res1: org.apache.spark.sql.DataFrame = [$id: bigint, $event: string ... 216 more fields]
scala> res1.rdd.getNumPartitions
res2: Int = 2
2.set spark.default.parallelism=25
scala> spark.read.parquet("/tmp/test/part-00002-ec3a9caa-a70e-4efe-8c5b-3f706f010610.c000.snappy.parquet").rdd.getNumPartitions
res0: Int = 15
Q:
1.What is relation between the number of df partitions and the spark.default.parallelism ?
spark: 2.4.0.7.1.1.0-565

How to select Case Class Object as DataFrame in Kafka-Spark Structured Streaming

I have a case class:
case class clickStream(userid:String, adId :String, timestamp:String)
instance of which I wish to send with KafkaProducer as :
val record = new ProducerRecord[String,clickStream](
"clicktream",
"data",
clickStream(Random.shuffle(userIdList).head, Random.shuffle(adList).head, new Date().toString).toString
)
producer.send(record)
which sends record as string perfectly as expected in the TOPIC queue:
clickStream(user5,ad2,Sat Jul 18 20:48:53 IST 2020)
However, the problem is at Consumer end:
val clickStreamDF = spark.readStream
.format("kafka")
.options(kafkaMap)
.option("subscribe","clicktream")
.load()
clickStreamDF
.select($"value".as("string"))
.as[clickStream] //trying to leverage DataSet APIs conversion
.writeStream
.outputMode(OutputMode.Append())
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
Apparently using .as[clickStream] API does not work as Exception is:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`userid`' given input columns: [value];
This is what [value] column contains :
Batch: 2
-------------------------------------------
+----------------------------------------------------+
|value |
+----------------------------------------------------+
|clickStream(user3,ad11,Sat Jul 18 20:59:35 IST 2020)|
+----------------------------------------------------+
I tried using Custom Serializer as value.serializer and value.deserializer
But facing a different issue of ClassNotFoundException in my directory structure.
I have 3 questions:
How Kafka uses Custom Deserializer class here to parse the object?
I do not fully understand the concept of Encoders and how that can be used in this case?
What will be the best approach to send/receive Custom Case Class Objects with Kafka?
As you are passing clickStream object data as string to kafka & spark will read same string, In spark you have to parse & extract required fields from clickStream(user3,ad11,Sat Jul 18 20:59:35 IST 2020)
Check below code.
clickStreamDF
.select(split(regexp_extract($"value","\\(([^)]+)\\)",1),"\\,").as("value"))
.select($"value"(0).as("userid"),$"value"(1).as("adId"),$"value"(2).as("timestamp"))
.as[clickStream] # Extract all fields from the value string & then use .as[clickStream] option. I think this line is not required as data already parsed to required format.
.writeStream
.outputMode(OutputMode.Append())
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
Sample How to parse clickStream string data.
scala> df.show(false)
+---------------------------------------------------+
|value |
+---------------------------------------------------+
|clickStream(user5,ad2,Sat Jul 18 20:48:53 IST 2020)|
+---------------------------------------------------+
scala> df
.select(split(regexp_extract($"value","\\(([^)]+)\\)",1),"\\,").as("value"))
.select($"value"(0).as("userid"),$"value"(1).as("adId"),$"value"(2).as("timestamp"))
.as[clickStream]
.show(false)
+------+----+----------------------------+
|userid|adId|timestamp |
+------+----+----------------------------+
|user5 |ad2 |Sat Jul 18 20:48:53 IST 2020|
+------+----+----------------------------+
What will be the best approach to send/receive Custom Case Class Objects with Kafka?
Try to convert your case class to json or avro or csv then send message to kafka & read same message using spark.

How to read huge number of .gz S3 files into RDD?

aws s3api list-objects-v2 --bucket cw-milenko-tests | grep 'tick_c'
output shows
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-50-22.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-52-59.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-55-08.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-57-30.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-59-59.json.gz",
"Key": "Json_gzips/tick_calculated_4_2020-05-27T09-14-28.json.gz",
"Key": "Json_gzips/tick_calculated_4_2020-05-27T11-35-38.json.gz",
With wc -l
aws s3api list-objects-v2 --bucket cw-milenko-tests | grep 'tick_c' | wc -l
457
I can read one file into data frame.
val path ="tick_calculated_2_2020-05-27T00-01-21.json"
scala> val tick1DF = spark.read.json(path)
tick1DF: org.apache.spark.sql.DataFrame = [aml_barcode_canc: string, aml_barcode_payoff: string ... 70 more fields]
I was surprised to see negative votes.
What I want to know is how to load 457 files into RDD? I saw this SO question.
Is it possible at all? What are the limitations?
This is what I tried so far.
val rdd1 = sc.textFile("s3://cw-milenko-tests/Json_gzips/tick_calculated*.gz")
If I go for s3a
val rdd1 = sc.textFile("s3a://cw-milenko-tests/Json_gzips/tick_calculated*.gz")
rdd1: org.apache.spark.rdd.RDD[String] = s3a://cw-milenko-tests/Json_gzips/tick_calculated*.gz MapPartitionsRDD[3] at textFile at <console>:27
Doesn't work either.
Try to inspect my RDD.
scala> rdd1.take(1)
java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
FileSytem was not recognized.
My GOAL:
s3://json.gz -> rdd -> parquet
Try this-
/**
* /Json_gzips
* |- spark-test-data1.json.gz
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
/**/Json_gzips
*|- spark-test-data2.json.gz
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
val path = getClass.getResource("/Json_gzips").getPath
// path till the root directory which contains the all .gz files
spark.read.json(path).show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
You can convert this df to rdd if required
from pyspark.sql import SparkSession
//Create Spark Session
spark = SparkSession
.builder
.appName("Python Spark SQL basic example")
.getOrCreate()
//To read all files inside from S3 in under Json_gzips key
df = spark.read.json("s3a://cw-milenko-tests/Json_gzips/tick_calculated*.gz")
df.show()
rdd = df.rdd // to convert it to rdd
use s3a instead s3
why s3a over s3?
Also add dependency for hadoop-aws 2.7.3 and AWS SDK
Add AWS S3 supporting JARs

what is the use of _spark_metadata directory

I am trying to get my head around how streaming works in spark.
I have a file in a /data/flight-data/csv/ directory. It has the following data:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
I thought to test what will happen if I read the file as a stream instead of as a batch. I first created a Dataframe using read
scala> val dataDF = spark.read.option("inferSchema","true").option("header","true").csv("data/flight-data/csv/2015-summary.csv");
[Stage 0:> dataDF: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
then took the schema fromm it and created a new Dataframe
scala> val staticSchema = dataDF.schema;
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))
scala> val dataStream = spark.readStream.schema(staticSchema).option("header","true").csv("data/flight-data/csv");
dataStream: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
Then I started the stream. The path for checkpoint and output (I suppose) is `/home/manu/test" directory which is initially empty.
scala> dataStream.writeStream.option("checkpointLocation","home/manu/test").start("/home/manu/test");
res5: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#5c7df5f1
The return value of the start is StreamingQuery which I read is A handle to a query that is executing continuously in the background as new data arrives. All these methods are thread-safe.
I notice that now the directory has a directory _spark_metadatabut there is nothing else.
Question1 - What is _spark_metadata directory? I notice it is empty. What is it used for?
Question 2 - I don't see anything else happening. Is it because I am not running any query on the Dataframe dataStream (or shall I say that the query isn't doing anything useful)?

Resources