Airflow emrAddStepsOperator unable to execute spark shaded jar - apache-spark

what should be in step type for spark app .. I am facing issue that master type not set or unable to recognize yarn .. seems it is considering the application as simple jar rather than spark submit mode when using emrAddStepsOperator .. please find attached airflow dag , error and emr screenshot
amazon emr cloud console manually adding spark job as a step
After adding spark jar type step rather than custom jar step .. gives option to give spark submit args and main method args separately
step type can be streaming or spark app or custom jar
My Error Message:
> Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:385)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2574)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:934)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:928)
at com.mcf.ExtractcustomerCategoryWiseSummarizedViews$.main(ExtractcustomerCategoryWiseSummarizedViews.scala:13)
at com.mcf.ExtractcustomerCategoryWiseSummarizedViews.main(ExtractcustomerCategoryWiseSummarizedViews.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
This is an example dag for a AWS EMR Pipeline.
Starting by creating a cluster, adding steps/operations, checking steps and finally when finished
terminating the cluster.
import time
from airflow.operators.python import PythonOperator
from datetime import timedelta
from airflow import DAG
from airflow.providers.amazon.aws.operators.emr_add_steps import EmrAddStepsOperator
from airflow.providers.amazon.aws.operators.emr_create_job_flow import EmrCreateJobFlowOperator
from airflow.providers.amazon.aws.operators.emr_terminate_job_flow import EmrTerminateJobFlowOperator
from airflow.providers.amazon.aws.sensors.emr_step import EmrStepSensor
from airflow.utils.dates import days_ago
SPARK_STEPS = [
{
'Name': 'PerformETL',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3://mydata/jars/open-0.0.1-SNAPSHOT.jar',
#'MainClass': 'com.sadim.main',
'Args': ['spark-submit',
'--deploy-mode',
'cluster',
'--master',
'yarn',
'--class',
'com.sadim.ExtractcustomerCategoryWiseSummarizedViews',
'--mode',
'DeltaLoadByDays',
'--noOfDaysBehindTodayForDeltaLoad',
'1',
'--s3InputPath',
's3://data-lake/documents/accountscore/categoriseddata/',
'--s3OutputPathcustomerCategoryWiseSummarizedViews',
's3://test-data/spark_output//customerCategoryWiseSummarizedViews//test'],
},
}
]
SPARK_STEPS2 = [
{
'Name': 'sadim_test3',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3://test-data/jars/scalatestnadeem-0.0.1-SNAPSHOT_v2.jar',
'MainClass': 'com.sadim.scalatestnadeem.Test',
'Args': ['spark-submit',
'--deploy-mode',
'client',
'--master',
'yarn',
'--conf',
'spark.yarn.submit.waitAppCompletion=true'],
},
}
]
SPARK_STEPS3 = [
{
'Name': 'sadim_test3',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3://mydata/jars/open-0.0.1-SNAPSHOT_masteryarnwithoutdependencyandtest.jar',
'MainClass': 'com.sadim.TestSadim',
'Args': ['spark-submit',
'--deploy-mode',
'client',
'--master',
'yarn',
'--conf',
'spark.yarn.submit.waitAppCompletion=true'],
},
}
]
SPARK_STEPS4 = [
{
'Name': 'PerformETL',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3://mydata/jars/open-0.0.1-SNAPSHOT.jar',
#'MainClass': 'com.sadim.ExtractcustomerCategoryWiseSummarizedViews',
'Args': ['com.sadim.ExtractcustomerCategoryWiseSummarizedViews',
'spark-submit',
'--deploy-mode',
'client',
'--master',
'yarn',
'--mode',
'DeltaLoadByDays',
'--noOfDaysBehindTodayForDeltaLoad',
'1',
'--s3InputPath',
's3://data-lake/documents/accountscore/categoriseddata/',
'--s3OutputPathcustomerCategoryWiseSummarizedViews',
's3://test-data/spark_output//customerCategoryWiseSummarizedViews//test'],
},
}
]
SPARK_STEPS5 = [
{
'Name': 'PerformETL',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3://mydata/jars/open-0.0.1-SNAPSHOT.jar',
#'MainClass': 'com.sadim.main',
'Args': ['com.sadim.ExtractcustomerCategoryWiseSummarizedViews',
'--mode',
'DeltaLoadByDays',
'--noOfDaysBehindTodayForDeltaLoad',
'1',
'--s3InputPath',
's3://data-lake/documents/accountscore/categoriseddata/',
'--s3OutputPathcustomerCategoryWiseSummarizedViews',
's3://test-data/spark_output//customerCategoryWiseSummarizedViews//test'],
},
}
]
JOB_FLOW_OVERRIDES = {
'Name': 'ob_emr_airflow_automation',
'ReleaseLabel': 'emr-6.6.0',
'LogUri': 's3://test-data/emr_logs/',
'Instances': {
'InstanceGroups': [
{
'Name': 'Master node',
'Market': 'ON_DEMAND',
'InstanceRole': 'MASTER',
'InstanceType': 'm5.xlarge',
'InstanceCount': 1
},
{
'Name': "Slave nodes",
'Market': 'ON_DEMAND',
'InstanceRole': 'CORE',
'InstanceType': 'm5.xlarge',
'InstanceCount': 1
}
],
'Ec2SubnetId': 'subnet-03129248888a14196',
'Ec2KeyName': 'datalake-emr-nodes',
'KeepJobFlowAliveWhenNoSteps': True,
'TerminationProtected': False
},
'BootstrapActions': [
{
'Name': 'Java11InstallBootstrap',
'ScriptBootstrapAction': {
'Path': 's3://test-data/jars/bootstrap.sh',
'Args': [
]
}
}
],
'Configurations': [
{
"Classification":"spark-defaults",
"Properties":{
"spark.driver.defaultJavaOptions":"-XX:OnOutOfMemoryError='kill -9 %p' - XX:MaxHeapFreeRatio=70",
"spark.executor.defaultJavaOptions":"-verbose:gc -Xlog:gc*::time - XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:MaxHeapFreeRatio=70 -XX:+IgnoreUnrecognizedVMOptions"
}
}
],
'JobFlowRole': 'DL_EMR_EC2_DefaultRole',
'ServiceRole': 'EMR_DefaultRole',
}
with DAG(
dag_id='emr_job_flow_manual_steps_dag_v6',
default_args={
'owner': 'airflow',
'depends_on_past': False,
'email': ['sadim.nadeem#sadim.com'],
'email_on_failure': False,
'email_on_retry': False,
},
dagrun_timeout=timedelta(hours=1),
start_date=days_ago(1),
schedule_interval='0 3 * * *',
tags=['example'],
) as dag:
# [START howto_operator_emr_manual_steps_tasks]
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
)
delay_python_task: PythonOperator = PythonOperator(task_id="delay_python_task",
dag=dag,
python_callable=lambda: time.sleep(400))
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id=cluster_creator.output,
aws_conn_id='aws_default',
steps=SPARK_STEPS5,
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id=cluster_creator.output,
step_id="{{ task_instance.xcom_pull(task_ids='add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id=cluster_creator.output,
aws_conn_id='aws_default',
)
cluster_creator >> step_adder >> step_checker >> cluster_remover
# [END howto_operator_emr_manual_steps_tasks]
# Task dependencies created via `XComArgs`:
# cluster_creator >> step_checker
# cluster_creator >> cluster_remover

The issue got fixed. We need to use the command-runner jar as jar option in emrAddStepsOperator and pass your specific ETL job jar inside args as
SPARK_STEPS = [
{
'Name': 'DetectAndLoadOrphanTransactions',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
#'MainClass': 'com.sadim.main',
'Args': [
'spark-submit',
'--deploy-mode',
'cluster',
'--master',
'yarn',
'--class',
'com.sadim.DetectAndLoadOrphanTransactions',
's3://nadeem-test-data/jars/open-0.0.1-SNAPSHOT_yarn_yarndep_maininpom.jar',
'--Txnspath',
's3://nadeem-test-data/spark_output//CategorizedTxnsDataset//2022_load_v2',
'--demographyPath',
's3://nadeem-test-data/spark_output//customerDemographics//2022_load_v2'
'--outPath',
's3://nadeem-test-data/spark_output//orphan_ihub_ids//2022_load_v2'],
},
}
]

Related

collectstaic command fails gives me post-processing error

While deploying my django-app to heroku I get this error .I disabled collectstatic and it pushed but when I went to the site it wasn't working.I tried a lot of things but nothing seems to work
Post-processing 'vendors/owl-carousel/assets/owl.carousel.css' failed!
Traceback (most recent call last):
File "manage.py", line 21, in <module>
main()
File "manage.py", line 17, in main
execute_from_command_line(sys.argv)
File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/app/.heroku/python/lib/python3.6/site-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 188, in handle
collected = self.collect()
File "/app/.heroku/python/lib/python3.6/site-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 134, in collect
raise processed
whitenoise.storage.MissingFileError: The file 'vendors/owl-carousel/assets/owl.video.play.png' could not be found with <whitenoise.storage.CompressedManifestStaticFilesStorage object at 0x7f580a2b32e8>.
The CSS file 'vendors/owl-carousel/assets/owl.carousel.css' references a file which could not be found:
vendors/owl-carousel/assets/owl.video.play.png
Please check the URL references in this CSS file, particularly any
relative paths which might be pointing to the wrong location.
my settings.py looks like this
"""
Django settings for personal_portfolio project.
Generated by 'django-admin startproject' using Django 2.2.4.
For more information on this file, see
https://docs.djangoproject.com/en/2.2/topics/settings/
For the full list of settings and their values, see
https://docs.djangoproject.com/en/2.2/ref/settings/
"""
import django_heroku
import os
# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/2.2/howto/deployment/checklist/
# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = 'y5hmuk*-4*%wor)ek2up+w+x!yx8l40j*n93#1zn!p66)k=c#2'
# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = True
ALLOWED_HOSTS = []
# Application definition
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'ckeditor',
'ckeditor_uploader',
'blog',
'portfolio'
]
MIDDLEWARE = [
'django.middleware.security.SecurityMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.middleware.common.CommonMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'django.middleware.clickjacking.XFrameOptionsMiddleware',
]
ROOT_URLCONF = 'personal_portfolio.urls'
TEMPLATES = [
{
'BACKEND': 'django.template.backends.django.DjangoTemplates',
'DIRS': ['templates'],
'APP_DIRS': True,
'OPTIONS': {
'context_processors': [
'personal_portfolio.custom_context.blog_context',
'personal_portfolio.custom_context.portfolio_context',
'django.template.context_processors.debug',
'django.template.context_processors.request',
'django.contrib.auth.context_processors.auth',
'django.contrib.messages.context_processors.messages',
],
},
},
]
WSGI_APPLICATION = 'personal_portfolio.wsgi.application'
# Database
# https://docs.djangoproject.com/en/2.2/ref/settings/#databases
# DATABASES = {
# 'default': {
# 'ENGINE': 'django.db.backends.postgresql_psycopg2',
# 'NAME': 'portfolio_db',
# 'USER': 'sajib1066',
# 'PASSWORD': 'sajib1099',
# 'HOST': 'localhost',
# 'PORT': '',
# }
# }
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
}
}
# Password validation
# https://docs.djangoproject.com/en/2.2/ref/settings/#auth-password-validators
AUTH_PASSWORD_VALIDATORS = [
{
'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator',
},
]
# Internationalization
# https://docs.djangoproject.com/en/2.2/topics/i18n/
LANGUAGE_CODE = 'en-us'
TIME_ZONE = 'UTC'
USE_I18N = True
USE_L10N = True
USE_TZ = True
# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/2.2/howto/static-files/
# static files
STATIC_URL = '/static/'
STATICFILES_DIRS = [
os.path.join(BASE_DIR, 'static'),
]
# media files
MEDIA_URL = '/media/'
MEDIA_ROOT = os.path.join(BASE_DIR, 'media')
CKEDITOR_JQUERY_URL = 'https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js'
CKEDITOR_UPLOAD_PATH = "uploads/"
CKEDITOR_CONFIGS = {
'default': {
'toolbar': None,
},
}
# Activate Django-Heroku.
django_heroku.settings(locals())
I tried a lot for it to work but I am not able to figure out what's wrong.Please help!
it is because you haven't put any specified static root location that is the reason do one thing add this in your settings.py in the end and try again thats it
STATIC_ROOT = os.path.join(BASE_DIR, 'static_root')

Cannot see SQL statements when logging in Django 2.1.4

When trying to log SQL statements in Django I can see params, duration but not the SQL statement itself. The context parameter sql returns None
I am using MySQL as backend
Settings setup:
DEBUG=True
...
LOGGING = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'verbose': {
'format': '{levelname} SQL:{sql} {duration} params:{params}',
'style': '{',
}},
'handlers': {
'console': {
'level': 'DEBUG',
'class': 'logging.StreamHandler',
'formatter': 'verbose'
}
},
'loggers': {
'django.db.backends': {
'handlers': ['console'],
'level': 'DEBUG',
},
}
}
Sample output:
DEBUG SQL:None 0.036980390548706055 params:(1,)
So I can the duration, the parameters but not the SQL statements. And I do not know why.
I have based my configuration on https://docs.djangoproject.com/en/dev/topics/logging/#django-db-backends
During a long long search found out this is a bug in version 2.1.4 upgrading to 2.1.5 fixed the problem. I based this on this
https://github.com/jazzband/django-debug-toolbar/issues/1124
Checked and verified. Output after update:
DEBUG SQL:SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED 0.035988569259643555 params:None

ftpPublish declarative pipeline

I have the following pipeline:
pipeline {
agent any
stages {
... building stuff...
stage('push to develop'){
when {
branch 'develop'
}
steps {
ftpPublisher paramPublish: [ parameterName: "" ], alwaysPublishFromMaster: true, masterNodeName: master, continueOnError: false, failOnError: false, publishers: [
[configName: 'cp-front', usePromotionTimestamp: false, useWorkspaceInPromotion: false, verbose: true, transfers: [
[asciiMode: false, cleanRemote: false, excludes: '', flatten: false, makeEmptyDirs: false, noDefaultExcludes: false, patternSeparator: '[, ]+', remoteDirectorySDF: false, removePrefix: '', sourceFiles: '**/*']
]]
]
}
}
}
}
Unfortunately, this throws:
groovy.lang.MissingPropertyException: No such property: master for
class: groovy.lang.Binding at
groovy.lang.Binding.getVariable(Binding.java:63) at
org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:264)
at org.kohsuke.groovy.sandbox.impl.Checker$6.call(Checker.java:288)
at
org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:292)
at
org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:268)
at
org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:268)
at
com.cloudbees.groovy.cps.sandbox.SandboxInvoker.getProperty(SandboxInvoker.java:29)
at
com.cloudbees.groovy.cps.impl.PropertyAccessBlock.rawGet(PropertyAccessBlock.java:20)
at WorkflowScript.run(WorkflowScript:22)
Which gives me about 0 idea what is going on. Any pointers?
master (an object, property) is not equal to 'master' - which is an instance of String. Maybe you made a simple mistake?

Getting tags from python 3 dictionary object. AWS Boto3 Python 3

I have a dictionary object that is being returned to me from AWS. I need to pull the tag "based_on_ami" out of this dictionary. I have tried converting to a list, but I am new to programming and have not been able to figure out how to access Tags since they are a few levels down in the dictionary.
What is the best way for me to pull that tag out of the dictionary and put it into a variable i can use?
{
'Images':[
{
'Architecture':'x86_64',
'CreationDate':'2017-11-27T14:41:30.000Z',
'ImageId':'ami-8e73e0f4',
'ImageLocation':'23452345234545/java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5',
'ImageType':'machine',
'Public':False,
'OwnerId':'23452345234545',
'State':'available',
'BlockDeviceMappings':[
{
'DeviceName':'/dev/sda1',
'Ebs':{
'Encrypted':False,
'DeleteOnTermination':True,
'SnapshotId':'snap-0c10e8f5ced5b5240',
'VolumeSize':8,
'VolumeType':'gp2'
}
},
{
'DeviceName':'/dev/sdb',
'VirtualName':'ephemeral0'
},
{
'DeviceName':'/dev/sdc',
'VirtualName':'ephemeral1'
}
],
'EnaSupport':True,
'Hypervisor':'xen',
'Name':'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5',
'RootDeviceName':'/dev/sda1',
'RootDeviceType':'ebs',
'SriovNetSupport':'simple',
'Tags':[
{
'Key':'service',
'Value':'baseami'
},
{
'Key':'cloudservice',
'Value':'ami'
},
{
'Key':'Name',
'Value':'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5'
},
{
'Key':'os',
'Value':'ubuntu 16.04 lts'
},
{
'Key':'based_on_ami',
'Value':'ami-aa2ea8d0'
}
],
'VirtualizationType':'hvm'
}
],
'ResponseMetadata':{
'RequestId':'2c376c75-c31f-4aba-a058-173f3b125a00',
'HTTPStatusCode':200,
'HTTPHeaders':{
'content-type':'text/xml;charset=UTF-8',
'transfer-encoding':'chunked',
'vary':'Accept-Encoding',
'date':'Fri, 01 Dec 2017 18:17:53 GMT',
'server':'AmazonEC2'
},
'RetryAttempts':0
}
}
The best way to approach this type of problem is to find the value you're looking for, and then work outwards until you find a solution. You need to look at what is at each of those levels.
So, what are you looking for? You're looking for the Value for based_on_ami's Key. So your final step is going to be:
if obj['Key'] == 'based_on_ami':
# do something with obj['Value'].
But how do you get there? Well, the object is inside of a list, so you'll need to iterate the list:
for tag in <some list>:
if tag['Key'] == 'based_on_ami':
# do something with tag['Value'].
What is that list? It's the list of tags:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
# do something with tag['Value'].
And where are those tags? In an image object that you find in a list:
for image in image_list:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
# do something with tag['Value'].
The image list is the value found at the Images key in your initial dict.
image_list = my_data['Images']
for image in image_list:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
# do something with tag['Value'].
And now you're collecting all of those values, so you'll need a list and you'll need to append to it:
result = []
image_list = my_data['Images']
for image in image_list:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
result.append(tag['Value'])
So, I took your example above, and added another based_on_ami node with the value quack:
{'ResponseMetadata': {'RequestId': '2c376c75-c31f-4aba-a058-173f3b125a00', 'RetryAttempts': 0, 'HTTPHeaders': {'vary': 'Accept-Encoding', 'transfer-encoding': 'chunked', 'server': 'AmazonEC2', 'content-type': 'text/xml;charset=UTF-8', 'date': 'Fri, 01 Dec 2017 18:17:53 GMT'}, 'HTTPStatusCode': 200}, 'Images': [{'Public': False, 'CreationDate': '2017-11-27T14:41:30.000Z', 'BlockDeviceMappings': [{'Ebs': {'SnapshotId': 'snap-0c10e8f5ced5b5240', 'VolumeSize': 8, 'Encrypted': False, 'VolumeType': 'gp2', 'DeleteOnTermination': True}, 'DeviceName': '/dev/sda1'}, {'VirtualName': 'ephemeral0', 'DeviceName': '/dev/sdb'}, {'VirtualName': 'ephemeral1', 'DeviceName': '/dev/sdc'}], 'OwnerId': '23452345234545', 'ImageLocation': '23452345234545/java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'RootDeviceName': '/dev/sda1', 'ImageType': 'machine', 'Hypervisor': 'xen', 'RootDeviceType': 'ebs', 'State': 'available', 'Architecture': 'x86_64', 'Name': 'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'Tags': [{'Value': 'baseami', 'Key': 'service'}, {'Value': 'ami', 'Key': 'cloudservice'}, {'Value': 'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'Key': 'Name'}, {'Value': 'ubuntu 16.04 lts', 'Key': 'os'}, {'Value': 'ami-aa2ea8d0', 'Key': 'based_on_ami'}], 'EnaSupport': True, 'SriovNetSupport': 'simple', 'ImageId': 'ami-8e73e0f4'}, {'Public': False, 'CreationDate': '2017-11-27T14:41:30.000Z', 'BlockDeviceMappings': [{'Ebs': {'SnapshotId': 'snap-0c10e8f5ced5b5240', 'VolumeSize': 8, 'Encrypted': False, 'VolumeType': 'gp2', 'DeleteOnTermination': True}, 'DeviceName': '/dev/sda1'}, {'VirtualName': 'ephemeral0', 'DeviceName': '/dev/sdb'}, {'VirtualName': 'ephemeral1', 'DeviceName': '/dev/sdc'}], 'VirtualizationType': 'hvm', 'OwnerId': '23452345234545', 'ImageLocation': '23452345234545/java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'RootDeviceName': '/dev/sda1', 'ImageType': 'machine', 'Hypervisor': 'xen', 'RootDeviceType': 'ebs', 'State': 'available', 'Architecture': 'x86_64', 'Name': 'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'Tags': [{'Value': 'baseami', 'Key': 'service'}, {'Value': 'ami', 'Key': 'cloudservice'}, {'Value': 'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'Key': 'Name'}, {'Value': 'ubuntu 16.04 lts', 'Key': 'os'}, {'Value': 'quack', 'Key': 'based_on_ami'}], 'EnaSupport': True, 'SriovNetSupport': 'simple', 'ImageId': 'ami-8e73e0f4'}]}
My result:
['ami-aa2ea8d0', 'quack']
info = {...}
tags = []
for image in info['Images']:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
tags.append(tag['Value'])
print(tags)

How to query datasets in avro format?

this works with parquet
val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'")
I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro.
When I execute the following query:
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`")
I get the AnalysisException. Why?
org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;; line 1 pos 51
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:61)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:38)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:38)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:37)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
Changing the name of the format to com.databricks.spark.avro does not make any difference and queries fail.
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`file-path`")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.' expecting {<EOF>, ',', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 65)
== SQL ==
SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`/uat/myfile`
-----------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
... 48 elided
Spark SQL supports avro format through a separate spark-avro module.
A library for reading and writing Avro data from Spark SQL.
Please note that spark-avro is a seaprate module that is not included by default in Spark.
You should load the module using spark-submit --packages, e.g.
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
See With spark-shell or spark-submit.
Jaceks answer works in general but in my environment it was not working due to obscure reasons. and spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 is hanging for a long with out producing any result.
I solved this problems using --jars option along with spark-shell
Steps :
1) go to https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/4.0.0
copy link address of jar http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
2) wget http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar .
3) spark-shell --jars <pathwhere you downloaded jar file>/spark-avro_2.11-4.0.0.jar
4)spark.read.format("com.databricks.spark.avro").load("s3://MYAVROLOCATION.avro")
which got converted in to dataframe and was able to print it.
In your case once you get the dataframe you can do sql on your way.
Note : If you are not using spark-shell you can make uber jar using sbt or maven with spark-avro_2.11-4.0.0.jar using below maven coordinates.
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Note : Avro datasource was introduced in spark 2.4 on wards.. SparkSPARK-24768
Have a built-in AVRO data source implementation
Which means that all the above things are not necessary any more.
See spark-release-2-4-0 release notes
Spark Avro Integration:
By using Spark, we can integrate avro format using spark-avro module. spark-avro library originally developed by databricks as a open source library. spark-avro module is external and not included in the spark-submit or spark-shell by default. So externally we need to specify while submitting spark job.
In the following section, i will explain how to integrate Spark and Avro data format.
Spark version > 2.4
Spark 2.4 release onwards, Spark SQL provides built-in support for reading and writing Apache Avro data.
Maven Dependency:
https://mvnrepository.com/artifact/org.apache.spark/spark-avro
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>2.4.5</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
SparkShell:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
Example:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.4/SparkAvro
Spark version < 2.4
In Spark version < 2.4, explicitly we need to specify avro format as com.databricks.spark.avro otherwise we will get org.apache.spark.sql.AnalysisException: Failed to find data source: avro. error.
Maven Dependency:
Spark Version Compatible version of Avro Data Source for Spark
1.2 0.2.0
1.3 1.0.0
1.4+ 2.0.1
2.0 - 2.1 3.2.0
2.2 - 2.3 4.0.0
https://mvnrepository.com/artifact/com.databricks/spark-avro
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 ...
SparkShell:
./bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0 ...
Examples:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("com.databricks.spark.avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("com.databricks.spark.avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.3/SparkAvro
Thats all folks!!

Resources