Write DataFrame to parquet on HDFS partitioned by multiple columns with dynamic partitionOverwriteMode

Write DataFrame to parquet on HDFS partitioned by multiple columns with dynamic partitionOverwriteMode - apache-spark

I have a dataframe which I want to save in parquet format to HDFS. I'd like to partition it by multiple columns.
When I'm writing data to HDFS - directory itself and only _SUCCESS file in it are created, but no data. I use partitionOverwriteMode=dynamic and overwrite as save mode. By the time I execute code path does not exist. If I change save mode to append then it works fine.
I also tried to write to local file system. In that case, both modes work correctly.
If only 1 partition column specified, then it works fine too.
Any ideas on how I can make overwrite works with multi-columns partitioning? Any tips appreciated. Thanks!
Code sample:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
data = [
{'country': 'DE', 'fk_imported_at': '20191212', 'user_id': 15},
{'country': 'DE', 'fk_imported_at': '20191212', 'user_id': 14},
{'country': 'US', 'fk_imported_at': '20191212', 'user_id': 12},
{'country': 'US', 'fk_imported_at': '20191212', 'user_id': 13},
{'country': 'DE', 'fk_imported_at': '20191213', 'user_id': 4},
{'country': 'DE', 'fk_imported_at': '20191213', 'user_id': 2},
{'country': 'US', 'fk_imported_at': '20191213', 'user_id': 1},
]
if __name__ == '__main__':
conf = SparkConf()
conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
spark = (
SparkSession
.builder
.config(conf=conf)
.appName('test partitioning')
.enableHiveSupport()
.getOrCreate()
)
df = spark.createDataFrame(data)
df.show()
df.repartition(1).write.parquet('/tmp/spark_save_mode', 'overwrite', ['fk_imported_at', 'country'])
spark.stop()
I'm submitting application in client mode. Spark version is 2.3.0.
Hadoop version is 2.6.0

Related

EMR Hudi cannot create hive connection jdbc:hive2://localhost:10000/

Trying to save hudi table in Jupyter notebook with hive-sync enabled. I am using EMR: 5.28.0 with AWS Glue as catalog enabled:
# Create a DataFrame
inputDF = spark.createDataFrame(
[
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
],
["id", "creation_date", "last_update_time"]
)
# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
'hoodie.table.name': 'my_hudi_table',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'my_hudi_table',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}
# Write a DataFrame as a Hudi dataset
(inputDF.write
.format('org.apache.hudi')
.option('hoodie.datasource.write.operation', 'insert')
.options(**hudiOptions)
.mode('overwrite')
.save('s3://dytyniak-test-data/myhudidataset/'))
receiving the following error:
An error occurred while calling o309.save.
: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://localhost:10000/

I assume you are following the tutorial from AWS documentation. I got it to work using Hudi 0.9.0 by setting hive_sync.mode to hms in hudiOptions (see hudi docs):
hudiOptions = {
'hoodie.table.name': 'my_hudi_table',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'my_hudi_table',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.mode': 'hms'
}

How to submit PySpark and Python jobs to Livy

Ii am trying to submit a PySpark job to Livy using the /batches endpoint, but I haven't found any good documentation. Life has been easy because we are submitting Scala-compiled JAR files to Livy, and specifying the job with className.
For the JAR file, we use:
data={
'file': 's3://foo-bucket/bar.jar',
'className': 'com.foo.bar',
'jars': [
's3://foo-bucket/common.jar',
],
'args': [
bucket_name,
'https://foo.bar.com',
"oof",
spark_master
],
'name': 'foo-oof bar',
'driverMemory': '2g',
'executorMemory': '2g',
'driverCores': 1,
'executorCores': 3,
'conf': {
'spark.driver.memoryOverhead': '600',
'spark.executor.memoryOverhead': '600',
'spark.submit.deployMode': 'cluster'
}
I am unsure how to submit a PySpark job in a similar manner, where the package also has some relative imports...any thoughts?
For reference, the folder structure is below:
bar2
__init__.py
foo2.py
bar3
__init__.py
foo3.py
I would want to then run:
from foo2 import ClassFoo
class_foo = ClassFoo(arg1, arg2)
class_foo.auto_run()

You can try passing pyFiles
data={
'file': 's3://foo-bucket/bar.jar',
'className': 'com.foo.bar',
'jars': [
's3://foo-bucket/common.jar',
],
"pyFiles": ["s3://<busket>/<folder>/foo2.py", "s3://<busket>/<folder>/foo3.py”]
'args': [
bucket_name,
'https://foo.bar.com',
"oof",
spark_master
],
'name': 'foo-oof bar',
'driverMemory': '2g',
'executorMemory': '2g',
'driverCores': 1,
'executorCores': 3,
'conf': {
'spark.driver.memoryOverhead': '600',
'spark.executor.memoryOverhead': '600',
'spark.submit.deployMode': 'cluster'
}
In the above example
"pyFiles": ["s3://<busket>/<folder>/foo2.py", "s3://<busket>/<folder>/foo3.py”]
I have tried saving the files on the master node via bootstraping , but noticed that Livy would send the request randomly to the slave nodes where the files might not be present.
Also you may pass the files as a .zip,Although I havent tried it

You need to submit with file being the main Python executable, and pyFiles being the additional internal libraries that are being used. My advice would be to provision the server with a bootstrap action which copies your own libraries over, and installs the pip-installable libraries on the master and nodes.

Python nested json to csv

I can't convert this Json to csv. I have been trying with different solutions posted here using panda or other parser but non solved this.
This is a small extract of the big json
{'data': {'items': [{'category': 'cat',
'coupon_code': 'cupon 1',
'coupon_name': '$829.99/€705.79 ',
'coupon_url': 'link3',
'end_time': '2017-12-31 00:00:00',
'language': 'sp',
'start_time': '2017-12-19 00:00:00'},
{'category': 'LED ',
'coupon_code': 'code',
'coupon_name': 'text',
'coupon_url': 'link',
'end_time': '2018-01-31 00:00:00',
'language': 'sp',
'start_time': '2017-10-07 00:00:00'}],
'total_pages': 1,
'total_results': 137},
'error_no': 0,
'msg': '',
'request': 'GET api/ #2017-12-26 04:50:02'}
I'd like to get an output like this with the columns:
category, coupon_code, coupon_name, coupon_url, end_time, language, start_time
I'm running python 3.6 with no restrictions.

How to read data from HBase table using pyspark?

I have created a dummy HBase table called emp having one record. Below is the data.
> hbase(main):005:0> put 'emp','1','personal data:name','raju' 0 row(s)
> in 0.1540 seconds
> hbase(main):006:0> scan 'emp' ROW
> COLUMN+CELL 1 column=personal
> data:name, timestamp=1512478562674, value=raju 1 row(s) in 0.0280
> seconds
Now I have establish a connection between HBase and pySparkusing shc. Can you please help me with the code to read the aboveHBase table as a dataframe in PySpark.
Version Details:
Spark Version 2.2.0, HBase 1.3.1, HCatalog 2.3.1

you can try like this
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf.cloudera.hbase/hbase-site.xml
empdata = ''.join("""
{
'table': {
'namespace': 'default',
'name': 'emp'
},
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
}
}
""".split())
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.load()
df.show()
[Refer this blog for more info]
https://diogoalexandrefranco.github.io/interacting-with-hbase-from-pyspark/

How to query datasets in avro format?

this works with parquet
val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'")
I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro.
When I execute the following query:
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`")
I get the AnalysisException. Why?
org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;; line 1 pos 51
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:61)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:38)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:38)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:37)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
Changing the name of the format to com.databricks.spark.avro does not make any difference and queries fail.
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`file-path`")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.' expecting {<EOF>, ',', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 65)
== SQL ==
SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`/uat/myfile`
-----------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
... 48 elided

Spark SQL supports avro format through a separate spark-avro module.
A library for reading and writing Avro data from Spark SQL.
Please note that spark-avro is a seaprate module that is not included by default in Spark.
You should load the module using spark-submit --packages, e.g.
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
See With spark-shell or spark-submit.

Jaceks answer works in general but in my environment it was not working due to obscure reasons. and spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 is hanging for a long with out producing any result.
I solved this problems using --jars option along with spark-shell
Steps :
1) go to https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/4.0.0
copy link address of jar http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
2) wget http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar .
3) spark-shell --jars <pathwhere you downloaded jar file>/spark-avro_2.11-4.0.0.jar
4)spark.read.format("com.databricks.spark.avro").load("s3://MYAVROLOCATION.avro")
which got converted in to dataframe and was able to print it.
In your case once you get the dataframe you can do sql on your way.
Note : If you are not using spark-shell you can make uber jar using sbt or maven with spark-avro_2.11-4.0.0.jar using below maven coordinates.
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Note : Avro datasource was introduced in spark 2.4 on wards.. SparkSPARK-24768
Have a built-in AVRO data source implementation
Which means that all the above things are not necessary any more.
See spark-release-2-4-0 release notes

Spark Avro Integration:
By using Spark, we can integrate avro format using spark-avro module. spark-avro library originally developed by databricks as a open source library. spark-avro module is external and not included in the spark-submit or spark-shell by default. So externally we need to specify while submitting spark job.
In the following section, i will explain how to integrate Spark and Avro data format.
Spark version > 2.4
Spark 2.4 release onwards, Spark SQL provides built-in support for reading and writing Apache Avro data.
Maven Dependency:
https://mvnrepository.com/artifact/org.apache.spark/spark-avro
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>2.4.5</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
SparkShell:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
Example:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.4/SparkAvro
Spark version < 2.4
In Spark version < 2.4, explicitly we need to specify avro format as com.databricks.spark.avro otherwise we will get org.apache.spark.sql.AnalysisException: Failed to find data source: avro. error.
Maven Dependency:
Spark Version Compatible version of Avro Data Source for Spark
1.2 0.2.0
1.3 1.0.0
1.4+ 2.0.1
2.0 - 2.1 3.2.0
2.2 - 2.3 4.0.0
https://mvnrepository.com/artifact/com.databricks/spark-avro
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 ...
SparkShell:
./bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0 ...
Examples:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("com.databricks.spark.avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("com.databricks.spark.avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.3/SparkAvro
Thats all folks!!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Write DataFrame to parquet on HDFS partitioned by multiple columns with dynamic partitionOverwriteMode - apache-spark

Related

EMR Hudi cannot create hive connection jdbc:hive2://localhost:10000/

How to submit PySpark and Python jobs to Livy

Python nested json to csv

How to read data from HBase table using pyspark?

How to query datasets in avro format?

Categories

Resources