Can I specify format Spark-Redshift uses to load to S3? - apache-spark

Specifically when I'm reading from an existing Redshift table, how can I specify the format it will use during the load to the temporary directory?
My load looks like this:
data = spark.read.format('com.databricks.spark.redshift') \
.option('url', REDSHIFT_URL_DEV) \
.option('dbtable', 'ods_misc.requests_2014_04') \
.option('tempdir', REDSHIFT_WEBLOG_DIR + '/2014_04') \
.load()
When I look at the data from the default load it looks like csv where it splits the columns across multiple files, so for example col1 col2... are in 0000_part1 and so on.

Related

reading delta table specific file in folder

I am trying to read a specific file from a folder which contain multiple delta files,Please refer attached screenshot
Reason I am looking to read the delta file based on the schema version. The folder mentioned above contains files with different different schema structure.
code snippet for writing a file :
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/home/games/Documents/test_delta/")
Code for reading a delta file
import pyspark[![enter image description here][1]][1]
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
path_to_data = '/home/games/Documents/test_delta/_delta_log/00000000000000000001.json'
df = spark.read.format("delta").load(path_to_data)
df.show()
error :
org.apache.spark.sql.delta.DeltaAnalysisException: /home/games/Documents/test_delta/_delta_log/ is not a Delta table.
You should use:
df = spark.read.format("delta").option("versionAsOf", 0).load(path_to_data)
You can specify other versions instead of 0 depending upon how many times how have overwritten the data. You can also use timestamps. Please see delta quick-start for more info.
Also, the delta_log folder actually contains delta transaction log in json format, not the actual data. The data is present in parent folder (test_delta in your case). The files starting with part-0000 are the ones that contain the actual data. These are .parquet files. There are no files with .delta extensions.

Column names appearing as record data in Pyspark databricks

I'm working on Pyspark python. I downloaded a sample csv file from Kaggle (Covid Live.csv) and the data from the table is as follows when opened in visual code
(Raw CSV data only partial data)
#,"Country,
Other","Total
Cases","Total
Deaths","New
Deaths","Total
Recovered","Active
Cases","Serious,
Critical","Tot Cases/
1M pop","Deaths/
1M pop","Total
Tests","Tests/
1M pop",Population
1,USA,"98,166,904","1,084,282",,"94,962,112","2,120,510","2,970","293,206","3,239","1,118,158,870","3,339,729","334,805,269"
2,India,"44,587,307","528,629",,"44,019,095","39,583",698,"31,698",376,"894,416,853","635,857","1,406,631,776"........
The problem i'm facing here, the column names are also being displayed as records in pyspark databricks console when executed with below code
from pyspark.sql.types import *
df1 = spark.read.format("csv") \
.option("inferschema", "true") \
.option("header", "true") \
.load("dbfs:/FileStore/shared_uploads/mahesh2247#gmail.com/Covid_Live.csv") \
.select("*")
Spark Jobs -->
df1:pyspark.sql.dataframe.DataFrame
#:string
Country,:string
As can be observed above , spark is detecting only two columns # and Country but not aware that 'Total Cases', 'Total Deaths' . . are also columns
How do i tackle this malformation ?
Few ways to go about this.
Fix the header in the csv before reading (should be on a single
line). Also pay attention to quoting and escape settings.
Read in PySpark with manually provided schema and filter out the bad lines.
Read using pandas, skip the first 12 lines. Add proper column names, convert to PySpark dataframe.
So , the solution is pretty simple and does not require you to 'edit' the data manually or anything of those sorts.
I just had to add .option("multiLine","true") \ and the data is displaying as desired!

Issue with Apache Hudi Update and Delete Operation on Parquet S3 File

Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS.
Attempting Record Update with a withdrawal object
withdrawalID_mutate = 10382495
updateDF = final_df.filter(col("withdrawalID") == withdrawalID_mutate) \
.withColumn("accountHolderName", lit("Hudi_Updated"))
updateDF.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
hudiDF = spark.read \
.format("hudi") \
.load(tablePath).filter(col("withdrawalID") == withdrawalID_mutate).show()
Shows the updated record but it is actually appended in the Athena table. Probably something to do with Glue Catalogue?
Attempting Record Delete
deleteDF = updateDF #deleting the updated record above
deleteDF.write.format("hudi") \
.option('hoodie.datasource.write.operation', 'upsert') \
.option('hoodie.datasource.write.payload.class', 'org.apache.hudi.common.model.EmptyHoodieRecordPayload') \
.options(**hudi_options) \
.mode("append") \
.save(tablePath)
still reflects the deleted record in the Athena table
Also tried using mode("overwrite") but as expected it deletes the older partitions and keeps only the latest.
Did anyone faced same issue and can guide in the right direction

Delta lake in databricks - creating a table for existing storage

I currently have an append table in databricks (spark 3, databricks 7.5)
parsedDf \
.select("somefield", "anotherField",'partition', 'offset') \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(f"/mnt/defaultDatalake/{append_table_name}")
It was created with a create table command before and I don't use INSERT commands to write to it (as seen above)
Now I want to be able to use SQL logic to query it without everytime going through createOrReplaceTempView every time. Is is possible to add a table to the current data without removing it? what changes do I need to support this?
UPDATE:
I've tried:
res= spark.sql(f"CREATE TABLE exploration.oplog USING DELTA LOCATION '/mnt/defaultDataLake/{append_table_name}'")
But get an AnalysisException
You are trying to create an external table exploration.dataitems_oplog
from /mnt/defaultDataLake/specificpathhere using Databricks Delta, but the schema is not specified when the
input path is empty.
While the path isn't empty.
Starting with Databricks Runtime 7.0, you can create table in Hive metastore from the existing data, automatically discovering schema, partitioning, etc. (see documentation for all details). The base syntax is following (replace values in <> with actual values):
CREATE TABLE <database>.<table>
USING DELTA
LOCATION '/mnt/defaultDatalake/<append_table_name>'
P.S. there is more documentation on different aspects of the managed vs unmanaged tables that could be useful to read.
P.P.S. Works just fine for me on DBR 7.5ML:

How can I use the saveAsTable function when I have two Spark streams running in parallel in the same notebook?

I have two Spark streams set up in a notebook to run in parallel like so.
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df1 = spark \
.readStream.format("delta") \
.table("test_db.table1") \
.select('foo', 'bar')
writer_df1 = df1.writeStream.option("checkpoint_location", checkpoint_location_1) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df2 = spark \
.readStream.format("delta") \
.table("test_db.table2") \
.select('foo', 'bar')
writer_df2 = merchant_df.writeStream.option("checkpoint_location", checkpoint_location_2) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
These dataframes then get processed row by row, with each row being sent to an API. If the API call reports an error, I then convert the row into JSON and append this row to a common failures table in databricks.
columns = ['table_name', 'record', 'time_of_failure', 'error_or_status_code']
vals = [(table_name, json.dumps(row.asDict()), datetime.now(), str(error_or_http_code))]
error_df = spark.createDataFrame(vals, columns)
error_df.select('table_name','record','time_of_failure', 'error_or_status_code').write.format('delta').mode('Append').saveAsTable("failures_db.failures_db)"
When attempting to add the row to this table, the saveAsTable() call here throws the following exception.
py4j.protocol.Py4JJavaError: An error occurred while calling o3578.saveAsTable.
: java.lang.IllegalStateException: Cannot find the REPL id in Spark local properties. Spark-submit and R doesn't support transactional writes from different clusters. If you are using R, please switch to Scala or Python. If you are using spark-submit , please convert it to Databricks JAR job. Or you can disable multi-cluster writes by setting 'spark.databricks.delta.multiClusterWrites.enabled' to 'false'. If this is disabled, writes to a single table must originate from a single cluster. Please check https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions-faq for more details.
If I comment out one of the streams and re-run the notebook, any errors from the API calls get inserted into the table with no issues. I feel like there's some configuration I need to add but am not sure of where to go from here.
Not sure if this is the best solution, but I believe the problem comes from each stream writing to the table at the same time. I split this table into separate tables for each stream and it worked after that.

Resources