How to switch names between columns in Delta Table - Databricks?

How to switch names between columns in Delta Table - Databricks? - apache-spark

how to switch in the most effective way names between 2 columns in Delta Lake? Let's assume that I have the following columns:
Address | Name
And I'd like to swap names, to have:
Name | Address
First I was renaming two columns:
spark.read.table(„table”) \
.withColumnRenamed("address", "name1") \
.withColumnRenamed("name", "address1") \
.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("table”")
Then I rename already renamed columns into the final one:
spark.read.table("table”") \
.withColumnRenamed("name1", "name") \
.withColumnRenamed("address1", "address") \
.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("table”")

What about just using toDF function on DataFrame that just sets the new names instead of existing:
spark.read.table("table”") \
.toDF("name", "address")
.write....
If you have more columns, then you can change it a bit by using mapping between existing & new name, and generate the correct list of columns:
mapping = {"address":"name", "name":"address"}
df = spark.read.table("table”")
new_cols = [mapping.get(cl, cl) for cl in df.columns]
df.toDF(*new_cols).write....

Related

Can i exclude the column used for partitioning when writing to parquet?

i need to create parquet files, reading from jdbc. The table is quite big and all columns are varchars. So i created a new column with a random int to make partitioning.
so my read jdbc looks something like this:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
and my write to parquet looks something like this:
data_df.write.mode("overwrite").parquet("parquetfile.parquet").partitionBy('random_number')
The generated parquet also contains the 'random_number' column, but i only made that column for partitioning, is there a way to exclude that column to the writing of the parquet files?
Thanks for any help, i'm new to spark :)
I'm expecting to exclude the random_number column, but lack the knowledge if this is possible if i need the column for partitioning

So do you want to repartition in memory using a column but not writing it, you can just use .repartition(col("random_number")) before writing droping the column then write your data:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
.repartition(col("random_number")).drop("random_number")
then:
data_df.write.mode("overwrite").parquet("parquetfile.parquet")

Is there any way to print out the exact column name and value that lead to the record going into corrupt records in spark

I have used permissive mode to catch corrupt records based on the schema, it works but I want to know the exact column of the said record which caused the record to go into corrupt records.
df = spark.read \
.format("csv") \
.option("mode", "PERMISSIVE") \
.option("header", "true") \
.option("timestampFormat", "yyyy-mm-dd HH.mm.ss") \
.option("dateFormat", "yyyy-mm-dd") \
.option("delimiter", ",") \
.option("escapeQuotes", "true") \
.option("multiLine", "true") \
.option("columnNameOfCorruptRecord","_corrupt_record") \
.schema(final_schema) \
.load(s3path)
Currently, I get the whole record in corrupt_records.

Not exactly the column name but the reason why it was rejected can be seen by using badrecordspath instead of permissive + columnnameofcorruptrecord , using badrecords path generates a json with a path, reason and record schema which mentions the reason due to which record was rejected
df = spark.read \
.format("csv") \
.option("header", "true") \
.option("timestampFormat", "yyyy-mm-dd HH.mm.ss") \
.option("dateFormat", "yyyy-mm-dd") \
.option("delimiter", ",") \
.option("escapeQuotes", "true") \
.option("multiLine", "true") \
.option("badRecordsPath","<some path>") \
.schema(final_schema) \
.load(s3path)

upsert (merge) delta with spark structured streaming

I need to upsert data in real time (with spark structured streaming) in python
This data is read in realtime (format csv) and then is written as a delta table (here we want to update the data that's why we use merge into from delta)
I am using delta engine with databricks
I coded this:
from delta.tables import *
spark = SparkSession.builder \
.config("spark.sql.streaming.schemaInference", "true")\
.appName("SparkTest") \
.getOrCreate()
sourcedf= spark.readStream.format("csv") \
.option("header", True) \
.load("/mnt/user/raw/test_input") #csv data that we read in real time
spark.conf.set("spark.sql.shuffle.partitions", "1")
spark.createDataFrame([], sourcedf.schema) \
.write.format("delta") \
.mode("overwrite") \
.saveAsTable("deltaTable")
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output")\
.option("path", "/mnt/user/raw/PARQUET/output") \
.start() \
.awaitTermination()
but nothing gets written as expected in the output path , the checkpoint path gets filled in as expected , a display in the delta table gives me results too
display(table("deltaTable"))
in the spark UI I see the writestream step :
sourcedf.writeStream \ .format("delta") \ ....
first at Snapshot.scala:156+details
RDD: Delta Table State #1 - dbfs:/user/hive/warehouse/deltatable/_delta_log
any idea how to fix this so I can upsert csv data into delta tables in S3 in real time with spark
Best regards

Apologies for a late reply, but just in case anyone else has the same problem. I have found the below worked for me, I wonder is it because you didn't use "cloudFiles" on your readstream to make use of autoloader?:
%python
sourcedf= spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.includeExistingFiles","true") \
.schema(csvSchema) \
.load("/mnt/user/raw/test_input")
%sql
CREATE TABLE IF NOT EXISTS deltaTable(
col1 int NOT NULL,
col2 string NOT NULL,
col3 bigint,
col4 int
)
USING DELTA
LOCATION '/mnt/user/raw/PARQUET/output'
%python
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
%python
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output") \
.start("/mnt/user/raw/PARQUET/output")

How to include partitioned column in pyspark dataframe read method

I am writing Avro file-based from a parquet file. I have read the file as below:
Reading data
dfParquet = spark.read.format("parquet").option("mode", "FAILFAST")
.load("/Users/rashmik/flight-time.parquet")
Writing data
I have written the file in Avro format as below:
dfParquetRePartitioned.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
As expected, I got data partitioned by OP_CARRIER.
Reading Avro partitioned data from a specific partition
In another job, I need to read data from the output of the above job, i.e. from datasink/avro directory. I am using the below code to read from datasink/avro
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load("datasink/avro/OP_CARRIER=AA")
It reads data successfully, but as expected OP_CARRIER column is not available in dfAvro dataframe as it is a partition column of the first job. Now my requirement is to include OP_CARRIER field also in 2nd dataframe i.e. in dfAvro. Could somebody help me with this?
I am referring documentation from the spark document, but I am not able to locate the relevant information. Any pointer will be very helpful.

You replicate the same column value with a different alias.
dfParquetRePartitioned.withColumn("OP_CARRIER_1", lit(df.OP_CARRIER)) \
.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
This would give you what you wanted. But with a different alias.
Or you can also do it during reading. If location is dynamic then you can easily append the column.
path = "datasink/avro/OP_CARRIER=AA"
newcol = path.split("/")[-1].split("=")
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load(path).withColumn(newcol[0], lit(newcol[1]))
If the value is static its way more easy to add it during the data read.

Fetch the most recent N records from Elasticsearch using Spark

I want to retrieve the last 50 records inserted into Elasticsearch to find out their average for an Anomaly detection project.
This is how I am retrieving data from ES. However, it is fetching the entire data, not the last 50 records. Is there any way to do that?
edf = spark \
.read \
.format("org.elasticsearch.spark.sql") \
.option("es.read.metadata", "false") \
.option("es.nodes.wan.only","true") \
.option("es.port","9200")\
.option("es.net.ssl","false")\
.option("es.nodes", "http://localhost") \
.load("anomaly_detection/data")
# GroupBy based on the `sender` column
df3 = edf.groupBy("sender") \
.agg(expr("avg(amount)").alias("avg_amount"))
Here the sender column is fetching the entire row data, how to get only the last 50 DataFrame rows data?
Input data schema format:
|sender|receiver|amount|

You can also add the query while reading the data as
query='{"query": {"match_all": {}}, "size": 50, "sort": [{"_timestamp": {"order": "desc"}}]}'
and pass it as
edf = spark \
.read \
.format("org.elasticsearch.spark.sql") \
.option("es.read.metadata", "false") \
.option("es.nodes.wan.only","true") \
.option("es.port","9200")\
.option("es.net.ssl","false")\
.option("es.nodes", "http://localhost") \
.option("query", query)
.load("anomaly_detection/data")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to switch names between columns in Delta Table - Databricks? - apache-spark

Related

Can i exclude the column used for partitioning when writing to parquet?

Is there any way to print out the exact column name and value that lead to the record going into corrupt records in spark

upsert (merge) delta with spark structured streaming

How to include partitioned column in pyspark dataframe read method

Fetch the most recent N records from Elasticsearch using Spark

Categories

Resources