I am running Spark on YARN. in spark/conf/core-site.xml there are properties like
...
<property>
<name>...</name>
<value>...</value>
</property>
<property>
<name>my.target.keyA</name>
<value>MY_TARGET_VALUE_AAA</value>
</property>
...
Want to change MY_TARGET_VALUE_AAA to MY_TARGET_VALUE_BBB without editing any configs on Client Side but YARN itself edits it depending on some rules we've already deployed after spark submits the app to YARN.
So I rewrite yarn nodemanger code in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
#Override
public int launchContainer(ContainerStartContext ctx) throws IOException, ConfigurationException {
....
List<String> localDirs = ctx.getLocalDirs();
List<String> logDirs = ctx.getLogDirs()
Log.info("--------------- change start --------------------------------")
Map<Path, List<String>> localizedResources = ctx.getLocalizedResources();
Set<Path> paths = localizedResources.keySet();
for (Path path : paths) {
String pathName = path.toUri().getPath();
if (StringUtils.isNotEmpty(pathName) && pathName.contains("__spark_conf__.zip")) {
File[] configFiles = new File(pathName).listFiles();
for (File configFile:configFiles){
if (configFile.getName().equalsIgnoreCase("__spark_hadoop_conf__.xml")){
// backup the xml
// add MY_TARGET_VALUE_BBB to the xml
}
}
}
Log.info("--------------- change end --------------------------------")
FsPermission dirPerm = new FsPermission(APPDIR_PERM);
.....
}
then on NodeManager:
[yarn#test-nm]$ ll /data01/yarn/nm-local-dir/usercache/MY_USER/filecache/159/__spark_conf__.zip
total 612
drwx------ 2 yarn yarn 4096 Jan 13 17:15 __hadoop_conf__
-r-x------ 1 yarn yarn 2604 Jan 13 17:14 log4j.properties
-r-x------ 1 yarn yarn 8337 Jan 13 17:14 metrics.properties
-r-x------ 1 yarn yarn 4881 Jan 13 17:14 __spark_conf__.properties
-r-x------ 1 yarn yarn 298985 Jan 13 17:14 __spark_hadoop_conf___bak.xml # original xml
-rw-rw-r-- 1 yarn yarn 297313 Jan 13 17:15 __spark_hadoop_conf__.xml # new xml
but my newly added MY_TARGET_VALUE_BBB doesnt work.
Related
I'm running Spark 3.2.0 on Kubernetes. The driver is running in a pod. The executors pods are configured to all attach to the same shared PV. I'm generating data, saving it to the shared PV, and then trying to reload the data. Saving the data seems to work as expected but loading does not:
(the Spark code here is based on this repo: https://github.com/bigstepinc/SparkBench/)
# cat /tmp/spark.properties
spark.driver.port=7078
spark.master=k8s\://https\://10.10.1.2\:6443
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=spark-pvc
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/var/data
spark.app.name=spark-testing
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
spark.submit.deployMode=cluster
spark.driver.host=spark-driver-svc.default.svc
spark.driver.blockManager.port=7079
spark.app.id=spark-3834e87e5d1241dc8834c53d3f170281
spark.kubernetes.container.image=xxx
spark.kubernetes.memoryOverheadFactor=0.4
spark.kubernetes.submitInDriver=true
spark.kubernetes.driver.pod.name=spark-driver
spark.executor.instances=3
# /opt/spark/bin/spark-shell --properties-file /tmp/spark.properties --deploy-mode client
--conf spark.driver.bindAddress=<this pod's address>
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val spark=SparkSession.getDefaultSession.get
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#2b720a2c
scala> val chars = 'A' to 'Z'
chars: scala.collection.immutable.NumericRange.Inclusive[Char] = NumericRange A to Z
scala> val randValue = udf( (rowId:Long) => {
| val rnd = new scala.util.Random(rowId)
| (1 to 100).map( i => chars(rnd.nextInt(chars.length))).mkString
| })
randValue: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3296/0x000000084135f040#27cac84b,StringType,List(Some(class[value[0]: bigint])),Some(class[value[0]: string]),None,true,true)
scala> val df=spark.range(1000).toDF("rowId").repartition(3)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [rowId: bigint]
scala> val df2=df.withColumn("value", randValue(df("rowId")))
df2: org.apache.spark.sql.DataFrame = [rowId: bigint, value: string]
scala> df2.write.parquet("/var/data/testing2")
At this point the parquet files are on the PV. From a pod with the PV attached at /var/data:
# find /var/data/testing2/ -name "*.parquet" -exec ls -lh {} \;
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_20220128155613744592892460434050_0003_m_000001/part-00001-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_202201281556136846007947886921010_0003_m_000000/part-00000-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
-rw-r--r-- 1 185 root 37K Jan 28 15:56 /var/data/testing2/_temporary/0/task_202201281556132279227184912186575_0003_m_000002/part-00002-6560f650-2a21-4e2a-a13c-c293ed63244f-c000.snappy.parquet
But now if I try loading the data again I get an error:
scala> val df3 = spark.read.parquet("/var/data/testing2")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
On the driver pod (which is NOT attached to the shared PV), /var/data/testing2 is empty which I think is what causes that error.
My question is: why does write.parquet run on the executors but read.parquet apparently runs on the driver? Won't this be a problem if I have a dataset much larger than the memory available on the driver?
Solved: Attaching the PV to the driver pod fixed the issue. With the PV attached to the driver, the parquet files were written to /var/data/testing2/ instead of /var/data/testing2/_temporary..., and there was a file _SUCCESS in /var/data/testing2. So I suspect that even though the parquet files were being generated, the data generation step wasn't actually completing like I thought it was.
I'm trying to read a json file stored on my OVH object storage (openstack).
I set everything up :
import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
also the hadoop conf :
sc=spark.sparkContext
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.swift.impl","org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem")
hadoopConf.set("fs.swift.service.auth.endpoint.prefix","/AUTH_")
hadoopConf.set("fs.swift.service.abc.http.port","443")
hadoopConf.set("fs.swift.service.abc.auth.url","https://auth.cloud.ovh.net/v2.0/tokens")
hadoopConf.set("fs.swift.service.abc.tenant","MYTENANT")
hadoopConf.set("fs.swift.service.abc.region","MYREG")
hadoopConf.set("fs.swift.service.abc.useApikey","false")
hadoopConf.set("fs.swift.service.abc.username","MYUSER")
hadoopConf.set("fs.swift.service.abc.password","MYPASS")
and then
spark.read.json("swift://mycontainer.abc/yyy.json")
throws the error
org.apache.hadoop.fs.swift.exceptions.SwiftException: Failed to parse Last-Modified: Tue, 21 Apr 2020 20:12:43 GMT
at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystemStore.getObjectMetadata(SwiftNativeFileSystemStore.java:237)
at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystemStore.getObjectMetadata(SwiftNativeFileSystemStore.java:182)
at ...
Caused by: java.text.ParseException: Unparseable date: "Tue, 21 Apr 2020 20:12:43 GMT"
like it is not able to parse the metadata date "Tue, 21 Apr 2020 20:12:43 GMT".
I cannot figure out how to solve this problem.
I want to run spark-shell in yarn mode with a certain number of cores.
the command I use is as follows
spark-shell --num-executors 25 --executor-cores 4 --executor-memory 1G \
--driver-memory 1G --conf spark.yarn.executor.memoryOverhead=2048 --master yarn \
--conf spark.driver.maxResultSize=10G \
--conf spark.serializer=org.apache.spark.serializer.KyroSerializer \
-i input.scala
input.scala looks something like this
import java.io.ByteArrayInputStream
// Plaintext sum on 10M rows
def aggrMapPlain(iter: Iterator[Long]): Iterator[Long] = {
var res = 0L
while (iter.hasNext) {
val cur = iter.next
res = res + cur
}
List[Long](res).iterator
}
val pathin_plain = <some file>
val rdd0 = sc.sequenceFile[Int, Long](pathin_plain)
val plain_table = rdd0.map(x => x._2).cache
plain_table.count
0 to 200 foreach { i =>
println("Plain - 10M rows - Run "+i+":")
plain_table.mapPartitions(aggrMapPlain).reduce((x,y)=>x+y)
}
On executing this the Spark UI first spikes to about 40 cores, and then settles at 26 cores.
On recommendation of this I changed the following in my yarn-site.xml
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>101</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>101</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>102400</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>102400</value>
</property>
But I still cannot force spark to use 100 cores, which I need as I am doing benchmarking against earlier tests.
I am using Apache Spark 1.6.1.
Each node on the cluster including the driver has 16 cores and 112GB of memory.
They are on Azure (hdinsight cluster).
2 driver nodes + 7 worker nodes.
I'm unfamiliar with Azure, but I guess YARN is YARN, so you should make sure that you have
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
in capacity-scheduler.xml.
(See this similar question and answer)
My spark is installed in CDH5 5.8.0 and run its application in yarn. There are 5 servers in the cluster. One server is for resource manager. The other four servers are node manager. Each server has 2 core and 8G memory.
The spark application main logic is not complex: Query table from postgres db. Do some business for each record and finally save result to db. Here is main code:
String columnName="id";
long lowerBound=1;
long upperBound=100000;
int numPartitions=20;
String tableBasic="select * from table1 order by id";
DataFrame dfBasic = sqlContext.read().jdbc(JDBC_URL, tableBasic, columnName, lowerBound, upperBound,numPartitions, dbProperties);
JavaRDD<EntityResult> rddResult = dfBasic.javaRDD().flatMap(new FlatMapFunction<Row, Result>() {
public Iterable<Result> call(Row row) {
List<Result> list = new ArrayList<Result>();
........
return list;
}
});
DataFrame saveDF = sqlContext.createDataFrame(rddResult, Result.class);
saveDF = saveDF.select("id", "column 1", "column 2",);
saveDF.write().mode(SaveMode.Append).jdbc(SQL_CONNECTION_URL, "table2", dbProperties);
I use this command to submit application to yarn:
spark-submit --master yarn-cluster --executor-memory 6G --executor-cores 2 --driver-memory 6G --conf spark.default.parallelism=90 --conf spark.storage.memoryFraction=0.4 --conf spark.shuffle.memoryFraction=0.4 --conf spark.executor.memory=3G --class com.Main1 jar1-0.0.1.jar
There are 7 executors and 20 partitions. When the table records is small, for example less than 200000, the 20 active tasks can assign to the 7 executors balance, like this:
Assign task averagely
But when the table records is huge, for example 1000000, the task will not assign averagely. There is always one executor run long time, the other executors run shortly. Some executors can't assign task. Like this:
enter image description here
Hello I need to read the data from gz.parquet files but dont know how to?? Tried with impala but i get the same result as parquet-tools cat without the table structure.
P.S: any suggestions to improve the spark code are most welcome.
I have the following parquet files gz.parquet as a result of a data pipe line created by twitter => flume => kafka => spark streaming => hive/gz.parquet files). For flume agent i am using agent1.sources.twitter-data.type = org.apache.flume.source.twitter.TwitterSource
Spark code de-queues the data from kafka and storing in hive as follows:
val sparkConf = new SparkConf().setAppName("KafkaTweet2Hive")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)//new org.apache.spark.sql.SQLContext(sc)
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Get the data (tweets) from kafka
val tweets = messages.map(_._2)
// adding the tweets to Hive
tweets.foreachRDD { rdd =>
val hiveContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val tweetsDF = rdd.toDF()
tweetsDF.write.mode("append").saveAsTable("tweet")
}
When i run the spark-streaming app it stores the data as gz.parquet files in hdfs: /user/hive/warehouse directory as follows:
[root#quickstart /]# hdfs dfs -ls /user/hive/warehouse/tweets
Found 469 items
-rw-r--r-- 1 root supergroup 0 2016-03-30 08:36 /user/hive/warehouse/tweets/_SUCCESS
-rw-r--r-- 1 root supergroup 241 2016-03-30 08:36 /user/hive/warehouse/tweets/_common_metadata
-rw-r--r-- 1 root supergroup 35750 2016-03-30 08:36 /user/hive/warehouse/tweets/_metadata
-rw-r--r-- 1 root supergroup 23518 2016-03-30 08:33 /user/hive/warehouse/tweets/part-r-00000-0133fcd1-f529-4dd1-9371-36bf5c3e5df3.gz.parquet
-rw-r--r-- 1 root supergroup 9552 2016-03-30 08:33 /user/hive/warehouse/tweets/part-r-00000-02c44f98-bfc3-47e3-a8e7-62486a1a45e7.gz.parquet
-rw-r--r-- 1 root supergroup 19228 2016-03-30 08:25 /user/hive/warehouse/tweets/part-r-00000-0321ce99-9d2b-4c52-82ab-a9ed5f7d5036.gz.parquet
-rw-r--r-- 1 root supergroup 241 2016-03-30 08:25 /user/hive/warehouse/tweets/part-r-00000-03415df3-c719-4a3a-90c6 462c43cfef54.gz.parquet
The schema from _metadata file is as follows:
[root#quickstart /]# parquet-tools meta hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/_metadata
creator: parquet-mr version 1.5.0-cdh5.5.0 (build ${buildNumber})
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"tweet","type":"string","nullable":true,"metadata":{}}]}
file schema: root
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tweet: OPTIONAL BINARY O:UTF8 R:0 D:1
Furthermore, if i load the data into a dataframe in spark i get the output of `df.show´ as follows:
+--------------------+
| tweet|
+--------------------+
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|ڕObjavro.sch...|
|��Objavro.sc...|
|ֲObjavro.sch...|
|��Objavro.sc...|
|��Objavro.sc...|
|֕Objavro.sch...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
|��Objavro.sc...|
+--------------------+
only showing top 20 rows
How ever i would like to see the tweets as plain text?
sqlContext.read.parquet("/user/hive/warehouse/tweets").show