How do I output bucketed parquet files in spark? - apache-spark

Background
I have 8k parquet files representing a table that I want to bucket by a particular column, creating a new set of 8k parquet files. I want to do this so that joins from other data sets on the bucketed column won't require re-shuffling. The documentation I'm working off of is here:
https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#bucketing-sorting-and-partitioning
Question
What's the easiest way to output parquet files that are bucketed? I want to do something like this:
df.write()
.bucketBy(8000, "myBucketCol")
.sortBy("myBucketCol")
.format("parquet")
.save("path/to/outputDir");
But according to the documentation linked above:
Bucketing and sorting are applicable only to persistent tables
I'm guessing I need to use saveAsTable as opposed to save. However saveAsTable doesn't take a path. Do I need to create a table prior to calling saveAsTable. Is it in that table creation statement that I declare where the parquet files should be written? If so, how do I do that?

spark.sql("drop table if exists myTable");
spark.sql("create table myTable ("
+ "myBucketCol string, otherCol string ) "
+ "using parquet location '" + outputPath + "' "
+ "clustered by (myBucketCol) sorted by (myBucketCol) into 8000 buckets"
);
enlDf.write()
.bucketBy(8000, "myBucketCol")
.sortBy("myBucketCol")
.format("parquet")
.mode(SaveMode.Append)
.saveAsTable("myTable");

You can use the path option:
df.write()
.bucketBy(8000, "myBucketCol")
.sortBy("myBucketCol")
.format("parquet")
.option("path", "path/to/outputDir")
.saveAsTable("whatever")

Related

create table stored as parquet and compressed with snappy not work

I have tryed to save data to hdfs with parquet-snappy:
spark.sql("drop table if exists onehands.parquet_snappy_not_work")
spark.sql(""" CREATE TABLE onehands.parquet_snappy_not_work (`trans_id` INT) PARTITIONED by ( `year` INT) STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY") """)
spark.sql("""insert into onehands.parquet_snappy_not_work values (20,2021)""")
spark.sql("drop table if exists onehands.parquet_snappy_works_well")
val df = spark.createDataFrame(Seq(
(20, 2021)
)) toDF("trans_id", "year")
df.show()
df.write.format("parquet").partitionBy("year").mode("append").option("compression","snappy").saveAsTable("onehands.parquet_snappy_works_well")
df.write.format("parquet").partitionBy("year").mode("append").option("compression","snappy").saveAsTable("onehands.parquet_snappy_works_well")
but it`s not working with pre-created table
for onehands.parquet_snappy_not_work , the file is not ending with .snappy.parquet,
onehands.parquet_snappy_works_well looks working very well
[***]$ hadoop fs -ls /data/spark/warehouse/onehands.db/parquet_snappy_works_well/year=2021
/data/spark/warehouse/onehands.db/parquet_snappy_works_well/year=2021/part-00000-f5ec0f2d-525f-41c9-afee-ce5589ddfe27.c000.snappy.parquet
[****]$ hadoop fs -ls /data/spark/warehouse/onehands.db/parquet_snappy_not_work/year=2021
/data/spark/warehouse/onehands.db/parquet_snappy_not_work/year=2021/part-00000-85e2a7a5-c281-4960-9786-4c0ea88faf15.c000
even if I have tryed add some properties:
SET hive.exec.compress.output=true;
SET mapred.compress.map.output=true;
SET mapred.output.compress=true;
SET mapred.output.compression=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec;
but it still not work
by the way , the sql I got with "show create table onehands.parquet_snappy_works_well",e.g.
CREATE TABLE `onehands`.`parquet_snappy_works_well` (`trans_id` INT, `year` INT) USING parquet OPTIONS ( `compression` 'snappy', `serialization.format` '1' ) PARTITIONED BY (year)
can not be run with spark-sql
spark vrtsion: 2.3.1
hadoop version:2.9.2
What`s the problem with my code ? Thanks for your help

databricks overwriting entire table instead of adding new partition

I have this table
CREATE TABLE `db`.`customer_history` (
`name` STRING,
`addrress` STRING,
`filename` STRING,
`dt` DATE)
USING delta
PARTITIONED BY (dt)
When I use this to load a partition data to the table
df
.write
.partitionBy("dt")
.mode("overwrite")
.format("delta")
.saveAsTable("db.customer_history")
For some reason, it overwrites the entire table. I thought the overwrite mode only overwrites the partition data (if it exists). Is my understanding correct?
Delta makes it easy to update certain disk partitions with the replaceWhere option. You can selectively overwrite only the data that matches predicates over partition columns as like this ,
dataset.write.repartition(1)\
.format("delta")\
.mode("overwrite")\
.partitionBy('Year','Week')\
.option("replaceWhere", "Year == '2019' AND Week >='01' AND Week <='02'")\ #to avoid overwriting Week3
.save("\curataed\dataset")
Note : replaceWhere is particularly useful when you have to run a computationally expensive algorithm, but only on certain partitions'
You can ref : link
In order to overwrite a single partition, use:
df
.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "dt >= '2021-01-01'")
.save("data_path")

How to get new/updated records from Delta table after upsert using merge?

Is there any way to get updated/inserted rows after upsert using merge to Delta table in spark streaming job?
val df = spark.readStream(...)
val deltaTable = DeltaTable.forName("...")
def upsertToDelta(events: DataFrame, batchId: Long) {
deltaTable.as("table")
.merge(
events.as("event"),
"event.entityId == table.entityId")
.whenMatched()
.updateExpr(...))
.whenNotMatched()
.insertAll()
.execute()
}
df
.writeStream
.format("delta")
.foreachBatch(upsertToDelta _)
.outputMode("update")
.start()
I know I can create another job to read updates from the delta table. But is it possible to do the same job? From what I can see, execute() returns Unit.
You can enable Change Data Feed on the table, and then have another stream or batch job to fetch the changes, so you'll able to receive information on what rows has changed/deleted/inserted. It could be enabled with:
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
if thable isn't registered, you can use path instead of table name:
ALTER TABLE delta.`path` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
The changes will be available if you add the .option("readChangeFeed", "true") option when reading stream from a table:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("table_name")
and it will add three columns to table describing the change - the most important is _change_type (please note that there are two different types for update operation).
If you're worried about having another stream - it's not a problem, as you can run multiple streams inside the same job - you just don't need to use .awaitTermination, but something like spark.streams.awaitAnyTermination() to wait on multiple streams.
P.S. But maybe this answer will change if you explain why you need to get changes inside the same job?

InsertInto(tablename) always saving Dataframe in default database in Hive

Hi I have 2 table in my hive in which from first table i m selecting data creating dataframe and saving that dataframe into another table in orc format.I have created both the tables in same database.
when I am saving this dataframe into 2nd table I'm getting table not found in database issue.and if i m not using any databasename then it always creating and saving my df in hive default database.can someone please guide me why its not taking userdefined database and always taking as default database?below is code which I m using,and also i m using HDP.
//creating hive session
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(sparksession).build()
hive.setDatabase("dbname")
var a= "SELECT 'all columns' from dbname.tablename"
val a1=hive.executeQuery(a)
a1.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "dbname")
.option("table", "table_name")
.mode("Append")
.insertInto("dbname.table_name")
instead of insertInto(dbname.table_name) if I'm using insertInto(table_name) then its is saving dataframe in default database. But if I'm giving dbname.tablename then its showing table not found in database.
I also tried same using dbSession using.
val dbSession = HiveWarehouseSession.session(sparksession).build()
dbSession.setDatabase("dbname")
Note: My second table(target table where I'm writing data) is a partitioned and bucketed table.
// 2. partitionBy(...)
{ a1.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "dbname")
.option("table", "table_name")
.mode("Append")
.insertInto("dbname.table_name")
// My second table(target table where I'm writing data) is a partitioned and bucketed table. add .partitionBy(<list cols>)
}

Spark write to Hive mistaken table_name as Partition spec and throws "Partition spec contains non-partition columns" error

My Hive table was defined with PARTITIONED BY (ds STRING, model STRING)
And when writing to the table in PySpark, I did
output_df
.repartition(250)
.write
.mode('overwrite')
.format('parquet')\
.partitionBy('ds', 'model')\
.saveAsTable('{table_schema}.{table_name}'.format(table_schema=table_schema,
table_name=table_name))
However I encountered the following error:
org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException: Partition spec {ds=2019-10-06, model=p1kr, table_name=drv_projection_table} contains non-partition columns
It seems Spark or Hive mistaken table_name as a partition. My S3 path for the table is s3://some_path/qubole/table_name=drv_projection_table, but table_name wasn't specified as part of the partition.

Resources