hudi delta streamer job via apache livy - apache-spark

Please help how to pass --props file and --source-class file to LIVY API POST .
spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 \
--master yarn \
--deploy-mode cluster \
--conf spark.sql.shuffle.partitions=100 \
--driver-class-path $HADOOP_CONF_DIR \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field tst \
--target-base-path /user/hive/warehouse/stock_ticks_mor \
--target-table test \
--props /var/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--continuous

I have converted the configs you are using in a json file to be passed to LIVY API
{
"className": "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer",
"proxyUser": "root",
"driverCores": 1,
"executorCores": 2,
"executorMemory": "1G",
"numExecutors": 4,
"queue": "default",
"name": "stock_ticks_mor",
"file": "hdfs://tmp/hudi-utilities-bundle_2.12-0.8.0.jar",
"conf": {
"spark.sql.shuffle.partitions": "100",
"spark.jars.packages": "org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.2",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.task.cpus": "1",
"spark.executor.cores": "1"
},
"args": [
"--props","/var/demo/config/kafka-source.properties",
"--table-type","MERGE_ON_READ",
"--source-class", "org.apache.hudi.utilities.sources.JsonKafkaSource",
"--target-base-path","/user/hive/warehouse/stock_ticks_mor",
"--target-table","test",
"--schemaprovider-class","org.apache.hudi.utilities.schema.FilebasedSchemaProvider",
"--continuous"
]
}
You can submit this json to the LIVY endpoint like
curl -H "X-Requested-By: admin" -H "Content-Type: application/json" -X POST -d #config.json http://localhost:8999/batches
For reference : https://community.cloudera.com/t5/Community-Articles/How-to-Submit-Spark-Application-through-Livy-REST-API/ta-p/247502

Related

Passing in VM arguments to Spark application via Argo

I have a spark application which is being triggered from argo yaml via dockerized image.
The argo workflow yaml is as :
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: test-argo-spark
namespace: argo
spec:
entrypoint: sparkapp
templates:
- name: sparkapp
container:
name: main
command:
args: [
"/bin/sh",
"-c",
"/opt/spark/bin/spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=/test-spark:latest \
--conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' \
--conf spark.app.name=test-spark-job \
--conf spark.jars.ivy=/tmp/.ivy \
--conf spark.kubernetes.driverEnv.HTTP2_DISABLE=true \
--conf spark.kubernetes.namespace=argo \
--conf spark.executor.instances=1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--packages org.postgresql:postgresql:42.1.4 \
--conf spark.kubernetes.driver.pod.name=custom-app \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \
--class SparkMain \
local:///opt/app/test-spark.jar"
]
image: /test-spark:latest
imagePullPolicy: IfNotPresent
resources: {}
This is calling in the spark submit which will invoke the jar file that has code residing in
The java code is as :
public class SparkMain {
public static void main(String args[]) {
SparkSession spark = SparkHelper.getSparkSession("SparkMain_Application");
System.out.println("Spark Java appn " + spark.logName());
}
}
The Spark session is being created as follows:
public static SparkSession getSparkSession(String appName) {
String sparkMode = System.getProperty("spark_mode");
if (sparkMode == null) {
sparkMode = "cluster";
}
if (sparkMode.equalsIgnoreCase("cluster")) {
return createSparkSession(appName);
} else if (sparkMode.equalsIgnoreCase("local")) {
return createLocalSparkSession(appName);
} else {
throw new RuntimeException("Invalid spark_mode option " + sparkMode);
}
}
If we see here there is a system property which needs to be passed in ,
String sparkMode = System.getProperty("spark_mode");
Can anyone tell how can we pass in these VM args from argo yaml when calling in the Spark submit.
Also how can we pass in multiple properties for a program

dataproc create cluster gcloud equivalent command in python

How do I replicate the following gcloud command in python?
gcloud beta dataproc clusters create spark-nlp-cluster \
--region global \
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.3' \
--worker-machine-type n1-standard-1 \
--num-workers 2 \
--image-version 1.4-debian10 \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--optional-components=JUPYTER,ANACONDA \
--enable-component-gateway
Here is what I have so far in python:
cluster_data = {
"project_id": project,
"cluster_name": cluster_name,
"config": {
"gce_cluster_config": {"zone_uri": zone_uri},
"master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-1"},
"worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-1"},
"software_config":{"image_version":"1.4-debian10","optional_components":{"JUPYTER","ANACONDA"}}
},
}
cluster = dataproc.create_cluster(
request={"project_id": project, "region": region, "cluster": cluster_data}
)
Not sure how to convert these gcloud commands to python:
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.3' \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--enable-component-gateway
You can try as this :
cluster_data = {
"project_id": project,
"cluster_name": cluster_name,
"config": {
"gce_cluster_config": {"zone_uri": zone_uri},
"master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-1"},
"worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-1"},
"software_config":{"image_version":"1.4-debian10","optional_components":{"JUPYTER","ANACONDA"}},
"initialization_actions":{"executable_file" : "gs://dataproc-initialization-actions/python/pip-install.sh"},
"gce_cluster_config": {"metadata": "PIP_PACKAGES=google-cloud-storage,spark-nlp==2.5.3"},
"endpoint_config": {"enable_http_port_access":True},
},
}
You can access for more : GCP Cluster Configs

Sort does not work on Text query with parse-server

The parse-server documentation is a bit outdated: http://docs.parseplatform.org/rest/guide/
Try this query:
curl -X GET \
-H "X-Parse-Application-Id: ${APPLICATION_ID}" \
-H "X-Parse-REST-API-Key: ${REST_API_KEY}" \
-G \
--data-urlencode 'where={"name":{"$text":{"$search":{"$term":"Milk"}}}}' \
--data-urlencode 'order="$score"' \
--data-urlencode 'key="$score"' \
https://localhost:1337/parse/classes/Groceries
And it with return this error:
{
"code": 102,
"error": "Invalid parameter for query: key"
}
Question, how do you sort by "score" when doing Full-text search with Parse if the query parameters does not work?

Livy always run on local mode

I am trying to run Pyspark (or Spark) job via Livy server with "spark.master=yarn".
What I have done:
1) In spark-defaults.conf:
spark.master yarn
spark.submit.deployMode client
2) In livy.conf:
livy.spark.master = yarn
livy.spark.deployMode = client
3) I send request via CURL with "conf": {"spark.master": "yarn"}
Example:
curl -X POST -H "Content-Type: application/json" localhost:8998/batches --data '{"file": "hdfs:///user/grzegorz/hello-world.py", "name": "MY", "conf": {"spark.master": "yarn"} }'
{"id":3,"state":"running","appId":null,"appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":["stdout: ","\nstderr: "]}
And what I am always getting in logs:
18/01/02 14:45:07.880 qtp1758624236-28 INFO BatchSession$: Creating batch session 3: [owner: null, request: [proxyUser: None, file: hdfs:///user/grzegorz/hello-world.py, name: MY, conf: spark.master -> yarn]]
18/01/02 14:45:07.883 qtp1758624236-28 INFO SparkProcessBuilder: Running '/usr/local/share/spark/spark-2.0.2/bin/spark-submit' '--name' 'MY' '--conf' 'spark.master=local' 'hdfs:///user/grzegorz/hello-world.py'
I hope somebody have any ideas how to get it over. Thank you in advance.

How to enable gzip for yii2?

I need to add new rules to .htaccess or to add the code to index.php of YII2?
My site is on shared hosting.
I want to compress only .css and .js files. I don't want to compress all responses.
You can make it work by attaching event handler on yii\web\Response in index.php.
$application = new yii\web\Application($config);
$application->on(yii\web\Application::EVENT_BEFORE_REQUEST, function(yii\base\Event $event){
$event->sender->response->on(yii\web\Response::EVENT_BEFORE_SEND, function($e){
ob_start("ob_gzhandler");
});
$event->sender->response->on(yii\web\Response::EVENT_AFTER_SEND, function($e){
ob_end_flush();
});
});
$application->run();
I added the following rules to .htaccess:
<IfModule mod_filter.c>
AddOutputFilterByType DEFLATE "application/atom+xml" \
"application/javascript" \
"application/json" \
"application/ld+json" \
"application/manifest+json" \
"application/rdf+xml" \
"application/rss+xml" \
"application/schema+json" \
"application/vnd.geo+json" \
"application/vnd.ms-fontobject" \
"application/x-font-ttf" \
"application/x-javascript" \
"application/x-web-app-manifest+json" \
"application/xhtml+xml" \
"application/xml" \
"font/eot" \
"font/opentype" \
"image/bmp" \
"image/svg+xml" \
"image/vnd.microsoft.icon" \
"image/x-icon" \
"text/cache-manifest" \
"text/css" \
"text/html" \
"text/javascript" \
"text/plain" \
"text/vcard" \
"text/vnd.rim.location.xloc" \
"text/vtt" \
"text/x-component" \
"text/x-cross-domain-policy" \
"text/xml"
</IfModule>
Configuration list: https://github.com/h5bp/server-configs

Resources