How to submit PySpark and Python jobs to Livy - apache-spark

Ii am trying to submit a PySpark job to Livy using the /batches endpoint, but I haven't found any good documentation. Life has been easy because we are submitting Scala-compiled JAR files to Livy, and specifying the job with className.
For the JAR file, we use:
data={
'file': 's3://foo-bucket/bar.jar',
'className': 'com.foo.bar',
'jars': [
's3://foo-bucket/common.jar',
],
'args': [
bucket_name,
'https://foo.bar.com',
"oof",
spark_master
],
'name': 'foo-oof bar',
'driverMemory': '2g',
'executorMemory': '2g',
'driverCores': 1,
'executorCores': 3,
'conf': {
'spark.driver.memoryOverhead': '600',
'spark.executor.memoryOverhead': '600',
'spark.submit.deployMode': 'cluster'
}
I am unsure how to submit a PySpark job in a similar manner, where the package also has some relative imports...any thoughts?
For reference, the folder structure is below:
bar2
__init__.py
foo2.py
bar3
__init__.py
foo3.py
I would want to then run:
from foo2 import ClassFoo
class_foo = ClassFoo(arg1, arg2)
class_foo.auto_run()

You can try passing pyFiles
data={
'file': 's3://foo-bucket/bar.jar',
'className': 'com.foo.bar',
'jars': [
's3://foo-bucket/common.jar',
],
"pyFiles": ["s3://<busket>/<folder>/foo2.py", "s3://<busket>/<folder>/foo3.py”]
'args': [
bucket_name,
'https://foo.bar.com',
"oof",
spark_master
],
'name': 'foo-oof bar',
'driverMemory': '2g',
'executorMemory': '2g',
'driverCores': 1,
'executorCores': 3,
'conf': {
'spark.driver.memoryOverhead': '600',
'spark.executor.memoryOverhead': '600',
'spark.submit.deployMode': 'cluster'
}
In the above example
"pyFiles": ["s3://<busket>/<folder>/foo2.py", "s3://<busket>/<folder>/foo3.py”]
I have tried saving the files on the master node via bootstraping , but noticed that Livy would send the request randomly to the slave nodes where the files might not be present.
Also you may pass the files as a .zip,Although I havent tried it

You need to submit with file being the main Python executable, and pyFiles being the additional internal libraries that are being used. My advice would be to provision the server with a bootstrap action which copies your own libraries over, and installs the pip-installable libraries on the master and nodes.

Related

How to connect to Glue catalog from an EMR spark-submit step

We use pyspark in an EMR cluster to run queries against our glue database. We use two methods to execute the python scripts namely through a Zeppelin notebook and through EMR steps. Connecting to the glue database works greats in Zeppelin, but not in EMR steps. When we run a query against the glue database, we get the following error:
pyspark.sql.utils.AnalysisException: Database '{glue_database_name}' does not exist.
This is the spark configuration used in the executed .py file:
spark = SparkSession\
.builder\
.appName("dfp.sln.kunderelation.work")\
.config("spark.sql.broadcastTimeout", "36000")\
.config("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")\
.config("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")\
.config("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")\
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("spark.sql.sources.ignoreDataLocality.enabled", "true")
The step is submitted using boto3 with the following configuration:
response = emr_client.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{ 'Name': name,
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit', '--deploy-mode', 'client', '--master', 'yarn', "{path to .py script}"]
}
},
]
)
Both types of EC2 users have been given glue and s3 privileges.
What do we need to setup to connect to the glue database?

Hyperopt spark 3.0 issues

I am running runtime 8.1 (includes Apache Spark 3.1.1, Scala 2.12) trying to get hyperopt working as defined by
https://docs.databricks.com/applications/machine-learning/automl-hyperparam-tuning/hyperopt-
spark-mlflow-integration.html
py4j.Py4JException: Method maxNumConcurrentTasks([]) does not exist
when I try to
spark_trials = SparkTrials()
Is there anything special I need to do to get this working?
Here is the cluster I am using
{
"autoscale": {
"min_workers": 1,
"max_workers": 2
},
"cluster_name": "mlops_tiny_ml",
"spark_version": "8.2.x-cpu-ml-scala2.12",
"spark_conf": {},
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "us-west-2b",
"instance_profile_arn": "arn:aws:iam::112437402463:instance-profile/databricks_instance_role_s3",
"spot_bid_price_percent": 100,
"ebs_volume_type": "GENERAL_PURPOSE_SSD",
"ebs_volume_count": 3,
"ebs_volume_size": 100
},
"node_type_id": "m4.large",
"driver_node_type_id": "m4.large",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {},
"autotermination_minutes": 120,
"enable_elastic_disk": false,
"cluster_source": "UI",
"init_scripts": [],
"cluster_id": "0xxxxxt404"
}
this is the code I am using
https://docs.databricks.com/applications/machine-learning/automl-hyperparam-tuning/hyperopt-model-selection.html
Hyperopt is only included into the DBR ML runtimes, not into the stock runtimes. You can check it by looking into release notes for each of runtimes: DBR 8.1 vs. DBR 8.1 ML.
And from the docs:
Databricks Runtime for Machine Learning incorporates MLflow and Hyperopt, two open source tools that automate the process of model selection and hyperparameter tuning.

ESlint override rule by nested directory

I want to disable rule for all files inside nested directory. I found examples only for exact path or by file extension. But it is not what I want.
We use it for shared config and don't know where this directory will be. We have many of them.
I'm trying config like this:
{
overrides: [
{
files: [
'**/test/**/*',
],
rules: {
"import/no-extraneous-dependencies": "off"
}
},
],
}
But glob like **/test/**/* and many others didn't not work.
Can someone help to reach this goal?
The above code should work.
How were you testing this? If it's an extension like VSCode you may need to refresh things to see latest definitions loaded.
If you are using a eslint service like esprint you will also need to restart it to grab latest definitions.
Caching
Make sure that eslint is not configured to cache results to avoid having to cache bust when debugging things. eslint docs
Here's an example for a react-native app with multiple overrides
module.exports = {
...baseConfig,
overrides: [
typescriptOverrides,
e2eOverrides,
themeOverrides,
{
files: ['**/*.style.js'],
rules: {
'sort-keys': [
'error',
'asc',
{
caseSensitive: true,
natural: true,
},
],
},
},
{
files: ['**/*.test.js'],
rules: {
'arrow-body-style': 'off',
},
},
],
};
Debugging the glob matcher
Run eslint in debug mode and see all the files being run example DEBUG=eslint:cli-engine npx eslint src/**/*.test.js
You can test your glob patterns by running a ls command. Example: ls ./src/**/*.test.js will either return all the files or 'no matches found'.

Write DataFrame to parquet on HDFS partitioned by multiple columns with dynamic partitionOverwriteMode

I have a dataframe which I want to save in parquet format to HDFS. I'd like to partition it by multiple columns.
When I'm writing data to HDFS - directory itself and only _SUCCESS file in it are created, but no data. I use partitionOverwriteMode=dynamic and overwrite as save mode. By the time I execute code path does not exist. If I change save mode to append then it works fine.
I also tried to write to local file system. In that case, both modes work correctly.
If only 1 partition column specified, then it works fine too.
Any ideas on how I can make overwrite works with multi-columns partitioning? Any tips appreciated. Thanks!
Code sample:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
data = [
{'country': 'DE', 'fk_imported_at': '20191212', 'user_id': 15},
{'country': 'DE', 'fk_imported_at': '20191212', 'user_id': 14},
{'country': 'US', 'fk_imported_at': '20191212', 'user_id': 12},
{'country': 'US', 'fk_imported_at': '20191212', 'user_id': 13},
{'country': 'DE', 'fk_imported_at': '20191213', 'user_id': 4},
{'country': 'DE', 'fk_imported_at': '20191213', 'user_id': 2},
{'country': 'US', 'fk_imported_at': '20191213', 'user_id': 1},
]
if __name__ == '__main__':
conf = SparkConf()
conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
spark = (
SparkSession
.builder
.config(conf=conf)
.appName('test partitioning')
.enableHiveSupport()
.getOrCreate()
)
df = spark.createDataFrame(data)
df.show()
df.repartition(1).write.parquet('/tmp/spark_save_mode', 'overwrite', ['fk_imported_at', 'country'])
spark.stop()
I'm submitting application in client mode. Spark version is 2.3.0.
Hadoop version is 2.6.0

How to set PYTHONHASHSEED on AWS EMR

Is there any way to set an environment variable on all nodes of an EMR cluster?
I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it.
I have tried adding a variable to spark-env through the cluster configuration:
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYTHONHASHSEED": "123"
}
}
]
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
but this doesn't work. I have also tried adding a bootstrap script:
#!/bin/bash
export PYTHONHASHSEED=123
but this also doesn't seem to do the trick.
I believe that the /usr/bin/python3 isn't picking up the environment variable PYTHONHASHSEED that you are defining in the cluster configuration under the spark-env scope.
You ought using python34 instead of /usr/bin/python3 and set the configuration as followed :
[
{
"classification":"spark-defaults",
"properties":{
// [...]
}
},
{
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34",
"PYTHONHASHSEED":"123"
}
}
],
"classification":"spark-env",
"properties":{
// [...]
}
}
]
Now, let's test it. I define a bash script call both pythons :
#!/bin/bash
echo "using python34"
for i in `seq 1 10`;
do
python -c "print(hash('foo'))";
done
echo "----------------------"
echo "using /usr/bin/python3"
for i in `seq 1 10`;
do
/usr/bin/python3 -c "print(hash('foo'))";
done
The verdict :
[hadoop#ip-10-0-2-182 ~]$ bash test.sh
using python34
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
----------------------
using /usr/bin/python3
8867846273747294950
-7610044127871105351
6756286456855631480
-4541503224938367706
7326699722121877093
3336202789104553110
3462714165845110404
-5390125375246848302
-7753272571662122146
8018968546238984314
PS1: I am using AMI release emr-4.8.2.
PS2: Snippet inspired from this answer.
EDIT: I have tested the following using pyspark.
16/11/22 07:16:56 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1479798580078_0001
16/11/22 07:16:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Python version 3.4.3 (default, Sep 1 2016 23:33:38)
SparkContext available as sc, HiveContext available as sqlContext.
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
Also created a simple application (simple_app.py):
from pyspark import SparkContext
sc = SparkContext(appName = "simple-app")
numbers = [hash('foo') for i in range(10)]
print(numbers)
Which also seems to work perfectly :
[hadoop#ip-*** ~]$ spark-submit --master yarn simple_app.py
Output (truncated) :
[...]
16/11/22 07:28:42 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594] // THE RELEVANT LINE IS HERE.
16/11/22 07:28:42 INFO SparkContext: Invoking stop() from shutdown hook
[...]
As you can see it also works returning the same hash each time.
EDIT 2: From the comments, it seems like you are trying to compute hashes on the executors and not the driver, thus you'll need to set up spark.executorEnv.PYTHONHASHSEED, inside your spark application configuration so it can be propagated on the executors (it's one way to do it).
Note : Setting the environment variables for executors is the same with YARN client, use the spark.executorEnv.[EnvironmentVariableName].
Thus the following minimalist example with simple_app.py :
from pyspark import SparkContext, SparkConf
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED","123")
sc = SparkContext(appName="simple-app", conf=conf)
numbers = sc.parallelize(['foo']*10).map(lambda x: hash(x)).collect()
print(numbers)
And now let's test it again. Here is the truncated output :
16/11/22 14:14:34 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/simple_app.py:6, took 14.251514 s
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594]
16/11/22 14:14:34 INFO SparkContext: Invoking stop() from shutdown hook
I think that this covers all.
From the spark docs
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties for more information.
Properties are listed here so I think you want this:
Add the environment variable specified by EnvironmentVariableName to the Application Master process launched on YARN.
spark.yarn.appMasterEnv.PYTHONHASHSEED="XXXX"
EMR docs for configuring spark-defaults.conf are here.
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.appMasterEnv.PYTHONHASHSEED: "XXX"
}
}
]
Just encountered the same problem, adding the following configuration solved it:
# Some settings...
Configurations=[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "python34"
},
"Configurations": []
}
]
},
{
"Classification": "hadoop-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYTHONHASHSEED": "0"
},
"Configurations": []
}
]
}
],
# Some more settings...
Be careful: we do not use yarn as a cluster manager, for the moment the cluster is only running Hadoop and Spark.
EDIT : Following Tim B comment, this seems to work also with yarn installed as a cluster manager.
You could probably do it via the bootstrap script but you'll need to do something like this:
echo "PYTHONHASHSEED=XXXX" >> /home/hadoop/.bashrc
(or possibly .profile)
So that it's picked up by the spark processes when they are launched.
Your configuration looks reasonable though, it might be worth setting it in the hadoop-env section instead?

Resources