While deploying my django-app to heroku I get this error .I disabled collectstatic and it pushed but when I went to the site it wasn't working.I tried a lot of things but nothing seems to work
Post-processing 'vendors/owl-carousel/assets/owl.carousel.css' failed!
Traceback (most recent call last):
File "manage.py", line 21, in <module>
main()
File "manage.py", line 17, in main
execute_from_command_line(sys.argv)
File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/app/.heroku/python/lib/python3.6/site-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 188, in handle
collected = self.collect()
File "/app/.heroku/python/lib/python3.6/site-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 134, in collect
raise processed
whitenoise.storage.MissingFileError: The file 'vendors/owl-carousel/assets/owl.video.play.png' could not be found with <whitenoise.storage.CompressedManifestStaticFilesStorage object at 0x7f580a2b32e8>.
The CSS file 'vendors/owl-carousel/assets/owl.carousel.css' references a file which could not be found:
vendors/owl-carousel/assets/owl.video.play.png
Please check the URL references in this CSS file, particularly any
relative paths which might be pointing to the wrong location.
my settings.py looks like this
"""
Django settings for personal_portfolio project.
Generated by 'django-admin startproject' using Django 2.2.4.
For more information on this file, see
https://docs.djangoproject.com/en/2.2/topics/settings/
For the full list of settings and their values, see
https://docs.djangoproject.com/en/2.2/ref/settings/
"""
import django_heroku
import os
# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/2.2/howto/deployment/checklist/
# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = 'y5hmuk*-4*%wor)ek2up+w+x!yx8l40j*n93#1zn!p66)k=c#2'
# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = True
ALLOWED_HOSTS = []
# Application definition
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'ckeditor',
'ckeditor_uploader',
'blog',
'portfolio'
]
MIDDLEWARE = [
'django.middleware.security.SecurityMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
'django.middleware.common.CommonMiddleware',
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
'django.middleware.clickjacking.XFrameOptionsMiddleware',
]
ROOT_URLCONF = 'personal_portfolio.urls'
TEMPLATES = [
{
'BACKEND': 'django.template.backends.django.DjangoTemplates',
'DIRS': ['templates'],
'APP_DIRS': True,
'OPTIONS': {
'context_processors': [
'personal_portfolio.custom_context.blog_context',
'personal_portfolio.custom_context.portfolio_context',
'django.template.context_processors.debug',
'django.template.context_processors.request',
'django.contrib.auth.context_processors.auth',
'django.contrib.messages.context_processors.messages',
],
},
},
]
WSGI_APPLICATION = 'personal_portfolio.wsgi.application'
# Database
# https://docs.djangoproject.com/en/2.2/ref/settings/#databases
# DATABASES = {
# 'default': {
# 'ENGINE': 'django.db.backends.postgresql_psycopg2',
# 'NAME': 'portfolio_db',
# 'USER': 'sajib1066',
# 'PASSWORD': 'sajib1099',
# 'HOST': 'localhost',
# 'PORT': '',
# }
# }
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
}
}
# Password validation
# https://docs.djangoproject.com/en/2.2/ref/settings/#auth-password-validators
AUTH_PASSWORD_VALIDATORS = [
{
'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator',
},
{
'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator',
},
]
# Internationalization
# https://docs.djangoproject.com/en/2.2/topics/i18n/
LANGUAGE_CODE = 'en-us'
TIME_ZONE = 'UTC'
USE_I18N = True
USE_L10N = True
USE_TZ = True
# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/2.2/howto/static-files/
# static files
STATIC_URL = '/static/'
STATICFILES_DIRS = [
os.path.join(BASE_DIR, 'static'),
]
# media files
MEDIA_URL = '/media/'
MEDIA_ROOT = os.path.join(BASE_DIR, 'media')
CKEDITOR_JQUERY_URL = 'https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js'
CKEDITOR_UPLOAD_PATH = "uploads/"
CKEDITOR_CONFIGS = {
'default': {
'toolbar': None,
},
}
# Activate Django-Heroku.
django_heroku.settings(locals())
I tried a lot for it to work but I am not able to figure out what's wrong.Please help!
it is because you haven't put any specified static root location that is the reason do one thing add this in your settings.py in the end and try again thats it
STATIC_ROOT = os.path.join(BASE_DIR, 'static_root')
I have a dictionary object that is being returned to me from AWS. I need to pull the tag "based_on_ami" out of this dictionary. I have tried converting to a list, but I am new to programming and have not been able to figure out how to access Tags since they are a few levels down in the dictionary.
What is the best way for me to pull that tag out of the dictionary and put it into a variable i can use?
{
'Images':[
{
'Architecture':'x86_64',
'CreationDate':'2017-11-27T14:41:30.000Z',
'ImageId':'ami-8e73e0f4',
'ImageLocation':'23452345234545/java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5',
'ImageType':'machine',
'Public':False,
'OwnerId':'23452345234545',
'State':'available',
'BlockDeviceMappings':[
{
'DeviceName':'/dev/sda1',
'Ebs':{
'Encrypted':False,
'DeleteOnTermination':True,
'SnapshotId':'snap-0c10e8f5ced5b5240',
'VolumeSize':8,
'VolumeType':'gp2'
}
},
{
'DeviceName':'/dev/sdb',
'VirtualName':'ephemeral0'
},
{
'DeviceName':'/dev/sdc',
'VirtualName':'ephemeral1'
}
],
'EnaSupport':True,
'Hypervisor':'xen',
'Name':'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5',
'RootDeviceName':'/dev/sda1',
'RootDeviceType':'ebs',
'SriovNetSupport':'simple',
'Tags':[
{
'Key':'service',
'Value':'baseami'
},
{
'Key':'cloudservice',
'Value':'ami'
},
{
'Key':'Name',
'Value':'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5'
},
{
'Key':'os',
'Value':'ubuntu 16.04 lts'
},
{
'Key':'based_on_ami',
'Value':'ami-aa2ea8d0'
}
],
'VirtualizationType':'hvm'
}
],
'ResponseMetadata':{
'RequestId':'2c376c75-c31f-4aba-a058-173f3b125a00',
'HTTPStatusCode':200,
'HTTPHeaders':{
'content-type':'text/xml;charset=UTF-8',
'transfer-encoding':'chunked',
'vary':'Accept-Encoding',
'date':'Fri, 01 Dec 2017 18:17:53 GMT',
'server':'AmazonEC2'
},
'RetryAttempts':0
}
}
The best way to approach this type of problem is to find the value you're looking for, and then work outwards until you find a solution. You need to look at what is at each of those levels.
So, what are you looking for? You're looking for the Value for based_on_ami's Key. So your final step is going to be:
if obj['Key'] == 'based_on_ami':
# do something with obj['Value'].
But how do you get there? Well, the object is inside of a list, so you'll need to iterate the list:
for tag in <some list>:
if tag['Key'] == 'based_on_ami':
# do something with tag['Value'].
What is that list? It's the list of tags:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
# do something with tag['Value'].
And where are those tags? In an image object that you find in a list:
for image in image_list:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
# do something with tag['Value'].
The image list is the value found at the Images key in your initial dict.
image_list = my_data['Images']
for image in image_list:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
# do something with tag['Value'].
And now you're collecting all of those values, so you'll need a list and you'll need to append to it:
result = []
image_list = my_data['Images']
for image in image_list:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
result.append(tag['Value'])
So, I took your example above, and added another based_on_ami node with the value quack:
{'ResponseMetadata': {'RequestId': '2c376c75-c31f-4aba-a058-173f3b125a00', 'RetryAttempts': 0, 'HTTPHeaders': {'vary': 'Accept-Encoding', 'transfer-encoding': 'chunked', 'server': 'AmazonEC2', 'content-type': 'text/xml;charset=UTF-8', 'date': 'Fri, 01 Dec 2017 18:17:53 GMT'}, 'HTTPStatusCode': 200}, 'Images': [{'Public': False, 'CreationDate': '2017-11-27T14:41:30.000Z', 'BlockDeviceMappings': [{'Ebs': {'SnapshotId': 'snap-0c10e8f5ced5b5240', 'VolumeSize': 8, 'Encrypted': False, 'VolumeType': 'gp2', 'DeleteOnTermination': True}, 'DeviceName': '/dev/sda1'}, {'VirtualName': 'ephemeral0', 'DeviceName': '/dev/sdb'}, {'VirtualName': 'ephemeral1', 'DeviceName': '/dev/sdc'}], 'OwnerId': '23452345234545', 'ImageLocation': '23452345234545/java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'RootDeviceName': '/dev/sda1', 'ImageType': 'machine', 'Hypervisor': 'xen', 'RootDeviceType': 'ebs', 'State': 'available', 'Architecture': 'x86_64', 'Name': 'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'Tags': [{'Value': 'baseami', 'Key': 'service'}, {'Value': 'ami', 'Key': 'cloudservice'}, {'Value': 'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'Key': 'Name'}, {'Value': 'ubuntu 16.04 lts', 'Key': 'os'}, {'Value': 'ami-aa2ea8d0', 'Key': 'based_on_ami'}], 'EnaSupport': True, 'SriovNetSupport': 'simple', 'ImageId': 'ami-8e73e0f4'}, {'Public': False, 'CreationDate': '2017-11-27T14:41:30.000Z', 'BlockDeviceMappings': [{'Ebs': {'SnapshotId': 'snap-0c10e8f5ced5b5240', 'VolumeSize': 8, 'Encrypted': False, 'VolumeType': 'gp2', 'DeleteOnTermination': True}, 'DeviceName': '/dev/sda1'}, {'VirtualName': 'ephemeral0', 'DeviceName': '/dev/sdb'}, {'VirtualName': 'ephemeral1', 'DeviceName': '/dev/sdc'}], 'VirtualizationType': 'hvm', 'OwnerId': '23452345234545', 'ImageLocation': '23452345234545/java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'RootDeviceName': '/dev/sda1', 'ImageType': 'machine', 'Hypervisor': 'xen', 'RootDeviceType': 'ebs', 'State': 'available', 'Architecture': 'x86_64', 'Name': 'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'Tags': [{'Value': 'baseami', 'Key': 'service'}, {'Value': 'ami', 'Key': 'cloudservice'}, {'Value': 'java8server_ubuntu16-2b71edd1-f95e-4ee5-8fd6-d8a46975fdb5', 'Key': 'Name'}, {'Value': 'ubuntu 16.04 lts', 'Key': 'os'}, {'Value': 'quack', 'Key': 'based_on_ami'}], 'EnaSupport': True, 'SriovNetSupport': 'simple', 'ImageId': 'ami-8e73e0f4'}]}
My result:
['ami-aa2ea8d0', 'quack']
info = {...}
tags = []
for image in info['Images']:
for tag in image['Tags']:
if tag['Key'] == 'based_on_ami':
tags.append(tag['Value'])
print(tags)
this works with parquet
val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'")
I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro.
When I execute the following query:
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`")
I get the AnalysisException. Why?
org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;; line 1 pos 51
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:61)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:38)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:38)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:37)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
Changing the name of the format to com.databricks.spark.avro does not make any difference and queries fail.
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`file-path`")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.' expecting {<EOF>, ',', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 65)
== SQL ==
SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`/uat/myfile`
-----------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
... 48 elided
Spark SQL supports avro format through a separate spark-avro module.
A library for reading and writing Avro data from Spark SQL.
Please note that spark-avro is a seaprate module that is not included by default in Spark.
You should load the module using spark-submit --packages, e.g.
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
See With spark-shell or spark-submit.
Jaceks answer works in general but in my environment it was not working due to obscure reasons. and spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 is hanging for a long with out producing any result.
I solved this problems using --jars option along with spark-shell
Steps :
1) go to https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/4.0.0
copy link address of jar http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
2) wget http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar .
3) spark-shell --jars <pathwhere you downloaded jar file>/spark-avro_2.11-4.0.0.jar
4)spark.read.format("com.databricks.spark.avro").load("s3://MYAVROLOCATION.avro")
which got converted in to dataframe and was able to print it.
In your case once you get the dataframe you can do sql on your way.
Note : If you are not using spark-shell you can make uber jar using sbt or maven with spark-avro_2.11-4.0.0.jar using below maven coordinates.
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Note : Avro datasource was introduced in spark 2.4 on wards.. SparkSPARK-24768
Have a built-in AVRO data source implementation
Which means that all the above things are not necessary any more.
See spark-release-2-4-0 release notes
Spark Avro Integration:
By using Spark, we can integrate avro format using spark-avro module. spark-avro library originally developed by databricks as a open source library. spark-avro module is external and not included in the spark-submit or spark-shell by default. So externally we need to specify while submitting spark job.
In the following section, i will explain how to integrate Spark and Avro data format.
Spark version > 2.4
Spark 2.4 release onwards, Spark SQL provides built-in support for reading and writing Apache Avro data.
Maven Dependency:
https://mvnrepository.com/artifact/org.apache.spark/spark-avro
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>2.4.5</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
SparkShell:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
Example:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.4/SparkAvro
Spark version < 2.4
In Spark version < 2.4, explicitly we need to specify avro format as com.databricks.spark.avro otherwise we will get org.apache.spark.sql.AnalysisException: Failed to find data source: avro. error.
Maven Dependency:
Spark Version Compatible version of Avro Data Source for Spark
1.2 0.2.0
1.3 1.0.0
1.4+ 2.0.1
2.0 - 2.1 3.2.0
2.2 - 2.3 4.0.0
https://mvnrepository.com/artifact/com.databricks/spark-avro
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 ...
SparkShell:
./bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0 ...
Examples:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("com.databricks.spark.avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("com.databricks.spark.avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.3/SparkAvro
Thats all folks!!