PySpark cannot parse metadata from openstack - apache-spark

I'm trying to read a json file stored on my OVH object storage (openstack).
I set everything up :
import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
also the hadoop conf :
sc=spark.sparkContext
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.swift.impl","org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem")
hadoopConf.set("fs.swift.service.auth.endpoint.prefix","/AUTH_")
hadoopConf.set("fs.swift.service.abc.http.port","443")
hadoopConf.set("fs.swift.service.abc.auth.url","https://auth.cloud.ovh.net/v2.0/tokens")
hadoopConf.set("fs.swift.service.abc.tenant","MYTENANT")
hadoopConf.set("fs.swift.service.abc.region","MYREG")
hadoopConf.set("fs.swift.service.abc.useApikey","false")
hadoopConf.set("fs.swift.service.abc.username","MYUSER")
hadoopConf.set("fs.swift.service.abc.password","MYPASS")
and then
spark.read.json("swift://mycontainer.abc/yyy.json")
throws the error
org.apache.hadoop.fs.swift.exceptions.SwiftException: Failed to parse Last-Modified: Tue, 21 Apr 2020 20:12:43 GMT
at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystemStore.getObjectMetadata(SwiftNativeFileSystemStore.java:237)
at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystemStore.getObjectMetadata(SwiftNativeFileSystemStore.java:182)
at ...
Caused by: java.text.ParseException: Unparseable date: "Tue, 21 Apr 2020 20:12:43 GMT"
like it is not able to parse the metadata date "Tue, 21 Apr 2020 20:12:43 GMT".
I cannot figure out how to solve this problem.

Related

Want YARN to edits spark's property

I am running Spark on YARN. in spark/conf/core-site.xml there are properties like
...
<property>
<name>...</name>
<value>...</value>
</property>
<property>
<name>my.target.keyA</name>
<value>MY_TARGET_VALUE_AAA</value>
</property>
...
Want to change MY_TARGET_VALUE_AAA to MY_TARGET_VALUE_BBB without editing any configs on Client Side but YARN itself edits it depending on some rules we've already deployed after spark submits the app to YARN.
So I rewrite yarn nodemanger code in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
#Override
public int launchContainer(ContainerStartContext ctx) throws IOException, ConfigurationException {
....
List<String> localDirs = ctx.getLocalDirs();
List<String> logDirs = ctx.getLogDirs()
Log.info("--------------- change start --------------------------------")
Map<Path, List<String>> localizedResources = ctx.getLocalizedResources();
Set<Path> paths = localizedResources.keySet();
for (Path path : paths) {
String pathName = path.toUri().getPath();
if (StringUtils.isNotEmpty(pathName) && pathName.contains("__spark_conf__.zip")) {
File[] configFiles = new File(pathName).listFiles();
for (File configFile:configFiles){
if (configFile.getName().equalsIgnoreCase("__spark_hadoop_conf__.xml")){
// backup the xml
// add MY_TARGET_VALUE_BBB to the xml
}
}
}
Log.info("--------------- change end --------------------------------")
FsPermission dirPerm = new FsPermission(APPDIR_PERM);
.....
}
then on NodeManager:
[yarn#test-nm]$ ll /data01/yarn/nm-local-dir/usercache/MY_USER/filecache/159/__spark_conf__.zip
total 612
drwx------ 2 yarn yarn 4096 Jan 13 17:15 __hadoop_conf__
-r-x------ 1 yarn yarn 2604 Jan 13 17:14 log4j.properties
-r-x------ 1 yarn yarn 8337 Jan 13 17:14 metrics.properties
-r-x------ 1 yarn yarn 4881 Jan 13 17:14 __spark_conf__.properties
-r-x------ 1 yarn yarn 298985 Jan 13 17:14 __spark_hadoop_conf___bak.xml # original xml
-rw-rw-r-- 1 yarn yarn 297313 Jan 13 17:15 __spark_hadoop_conf__.xml # new xml
but my newly added MY_TARGET_VALUE_BBB doesnt work.

Spark write.parquet runs on executors but read.parquet runs on driver

I'm running Spark 3.2.0 on Kubernetes. The driver is running in a pod. The executors pods are configured to all attach to the same shared PV. I'm generating data, saving it to the shared PV, and then trying to reload the data. Saving the data seems to work as expected but loading does not:
(the Spark code here is based on this repo: https://github.com/bigstepinc/SparkBench/)
# cat /tmp/spark.properties
spark.driver.port=7078
spark.master=k8s\://https\://10.10.1.2\:6443
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=spark-pvc
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/var/data
spark.app.name=spark-testing
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
spark.submit.deployMode=cluster
spark.driver.host=spark-driver-svc.default.svc
spark.driver.blockManager.port=7079
spark.app.id=spark-3834e87e5d1241dc8834c53d3f170281
spark.kubernetes.container.image=xxx
spark.kubernetes.memoryOverheadFactor=0.4
spark.kubernetes.submitInDriver=true
spark.kubernetes.driver.pod.name=spark-driver
spark.executor.instances=3
# /opt/spark/bin/spark-shell --properties-file /tmp/spark.properties --deploy-mode client
--conf spark.driver.bindAddress=<this pod's address>
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val spark=SparkSession.getDefaultSession.get
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#2b720a2c
scala> val chars = 'A' to 'Z'
chars: scala.collection.immutable.NumericRange.Inclusive[Char] = NumericRange A to Z
scala> val randValue = udf( (rowId:Long) => {
| val rnd = new scala.util.Random(rowId)
| (1 to 100).map( i => chars(rnd.nextInt(chars.length))).mkString
| })
randValue: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3296/0x000000084135f040#27cac84b,StringType,List(Some(class[value[0]: bigint])),Some(class[value[0]: string]),None,true,true)
scala> val df=spark.range(1000).toDF("rowId").repartition(3)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [rowId: bigint]
scala> val df2=df.withColumn("value", randValue(df("rowId")))
df2: org.apache.spark.sql.DataFrame = [rowId: bigint, value: string]
scala> df2.write.parquet("/var/data/testing2")
At this point the parquet files are on the PV. From a pod with the PV attached at /var/data:
# find /var/data/testing2/ -name "*.parquet" -exec ls -lh {} \;
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_20220128155613744592892460434050_0003_m_000001/part-00001-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_202201281556136846007947886921010_0003_m_000000/part-00000-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_202201281556132279227184912186575_0003_m_000002/part-00002-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
But now if I try loading the data again I get an error:
scala> val df3 = spark.read.parquet("/var/data/testing2")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
On the driver pod (which is NOT attached to the shared PV), /var/data/testing2 is empty which I think is what causes that error.
My question is: why does write.parquet run on the executors but read.parquet apparently runs on the driver? Won't this be a problem if I have a dataset much larger than the memory available on the driver?
Solved: Attaching the PV to the driver pod fixed the issue. With the PV attached to the driver, the parquet files were written to /var/data/testing2/ instead of /var/data/testing2/_temporary..., and there was a file _SUCCESS in /var/data/testing2. So I suspect that even though the parquet files were being generated, the data generation step wasn't actually completing like I thought it was.

How to select Case Class Object as DataFrame in Kafka-Spark Structured Streaming

I have a case class:
case class clickStream(userid:String, adId :String, timestamp:String)
instance of which I wish to send with KafkaProducer as :
val record = new ProducerRecord[String,clickStream](
"clicktream",
"data",
clickStream(Random.shuffle(userIdList).head, Random.shuffle(adList).head, new Date().toString).toString
)
producer.send(record)
which sends record as string perfectly as expected in the TOPIC queue:
clickStream(user5,ad2,Sat Jul 18 20:48:53 IST 2020)
However, the problem is at Consumer end:
val clickStreamDF = spark.readStream
.format("kafka")
.options(kafkaMap)
.option("subscribe","clicktream")
.load()
clickStreamDF
.select($"value".as("string"))
.as[clickStream] //trying to leverage DataSet APIs conversion
.writeStream
.outputMode(OutputMode.Append())
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
Apparently using .as[clickStream] API does not work as Exception is:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`userid`' given input columns: [value];
This is what [value] column contains :
Batch: 2
-------------------------------------------
+----------------------------------------------------+
|value |
+----------------------------------------------------+
|clickStream(user3,ad11,Sat Jul 18 20:59:35 IST 2020)|
+----------------------------------------------------+
I tried using Custom Serializer as value.serializer and value.deserializer
But facing a different issue of ClassNotFoundException in my directory structure.
I have 3 questions:
How Kafka uses Custom Deserializer class here to parse the object?
I do not fully understand the concept of Encoders and how that can be used in this case?
What will be the best approach to send/receive Custom Case Class Objects with Kafka?
As you are passing clickStream object data as string to kafka & spark will read same string, In spark you have to parse & extract required fields from clickStream(user3,ad11,Sat Jul 18 20:59:35 IST 2020)
Check below code.
clickStreamDF
.select(split(regexp_extract($"value","\\(([^)]+)\\)",1),"\\,").as("value"))
.select($"value"(0).as("userid"),$"value"(1).as("adId"),$"value"(2).as("timestamp"))
.as[clickStream] # Extract all fields from the value string & then use .as[clickStream] option. I think this line is not required as data already parsed to required format.
.writeStream
.outputMode(OutputMode.Append())
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
Sample How to parse clickStream string data.
scala> df.show(false)
+---------------------------------------------------+
|value |
+---------------------------------------------------+
|clickStream(user5,ad2,Sat Jul 18 20:48:53 IST 2020)|
+---------------------------------------------------+
scala> df
.select(split(regexp_extract($"value","\\(([^)]+)\\)",1),"\\,").as("value"))
.select($"value"(0).as("userid"),$"value"(1).as("adId"),$"value"(2).as("timestamp"))
.as[clickStream]
.show(false)
+------+----+----------------------------+
|userid|adId|timestamp |
+------+----+----------------------------+
|user5 |ad2 |Sat Jul 18 20:48:53 IST 2020|
+------+----+----------------------------+
What will be the best approach to send/receive Custom Case Class Objects with Kafka?
Try to convert your case class to json or avro or csv then send message to kafka & read same message using spark.

Convert streaming JSON to DataFrame

Question: How can I convert a JSON string to DataFrame and also selecting only the keys I want?
I just started using Spark last week and I'm still learning so please bear with me.
I'm using Spark(2.4) Structured Streaming. The spark app get data (via socket) from a twitter streaming and data sent is full tweet JSON string. Below is a one of the DataFrames. Each row is the full JSON tweet.
+--------------------+
| value|
+--------------------+
|{"created_at":"Tu...|
|{"created_at":"Tu...|
|{"created_at":"Tu...|
+--------------------+
As Venkata suggested, I did this, translated to python (full codes below)
schema = StructType().add('created_at', StringType(), False).add('id_str', StringType(), False)
df = lines.selectExpr('CAST(value AS STRING)').select(from_json('value', schema).alias('temp')).select('temp.*')
This is the return value
+------------------------------+-------------------+
|created_at |id_str |
+------------------------------+-------------------+
|Wed Feb 20 04:51:18 +0000 2019|1098082646511443968|
|Wed Feb 20 04:51:18 +0000 2019|1098082646285082630|
|Wed Feb 20 04:51:18 +0000 2019|1098082646444441600|
|Wed Feb 20 04:51:18 +0000 2019|1098082646557642752|
|Wed Feb 20 04:51:18 +0000 2019|1098082646494797824|
|Wed Feb 20 04:51:19 +0000 2019|1098082646817681408|
+------------------------------+-------------------+
As can be seen, only the 2 keys that I wanted was included in the DataFrame.
Hope this would help any newbie.
Full codes
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StringType
spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
sc = spark.sparkContext
lines = spark.readStream.format('socket').option('host', '127.0.0.1').option('port', 9999).load()
schema = StructType().add('created_at', StringType(), False).add('id_str', StringType(), False)
df = lines.selectExpr('CAST(value AS STRING)').select(from_json('value', schema).alias('temp')).select('temp.*')
query = df.writeStream.format('console').option('truncate', 'False').start()
# this part is only used to print out the query when running as an app. Not needed if using jupyter
import time
time.sleep(10)
lines.stop()
Here's a sample code snippet you can use to convert from json to DataFrame.
val schema = new StructType().add("id", StringType).add("pin",StringType)
val dataFrame= data.
selectExpr("CAST(value AS STRING)").as[String].
select(from_json($"value",schema).
alias("tmp")).
select("tmp.*")

Error observed when trying to create hive table from a Spark Data Frame

CREATED HIVE CONTEXT AND THEN TRYING TO CREATE TABLE USING A VIEW
Final_Data is a data frame.
val sqlCtx= new HiveContext(sc)
Final_Data.createOrReplaceTempView("Final_Prediction")
sqlCtx.sql("create table results as select * from Final_Prediction")
ERROR LOGS BELOW-
Name: org.apache.spark.sql.AnalysisException
Message: unresolved operator 'CreateHiveTableAsSelectLogicalPlan CatalogTable(
Table: `hve1`
Created: Mon May 01 17:44:38 CDT 2017
Last Access: Wed Dec 31 17:59:59 CST 1969
Type: MANAGED
Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false;;

Resources