Pyspark refer to table created using sql - apache-spark

When I create a table using SQL in Spark, for example:
sql('CREATE TABLE example SELECT a, b FROM c')
How can I pull that table into the python namespace (I can't think of a better term) so that I can update it? Let's say I want to replace NaN values in the table like so:
import pyspark.sql.functions as F
table = sql('SELECT * FROM example')
for column in columns:
table = table.withColumn(column,F.when(F.isnan(F.col(column)),F.col(column)).otherwise(None))
Does this operation update the original example table created with SQL? If I were to run sql('SELECT * FROM example')show() would I see the updated results? When the original CREATE TABLE example ... SQL runs, is example automatically added to the python namespace?

The sql function returns a new DataFrame, so the table is not modified. If you want to write a DataFrame's contents into a table created in spark, do it like this:
table.write.mode("append").saveAsTable("example")
But what you are doing is actually changing the schema of a table, in that case
table.createOrReplaceTempView("mytempTable")
sql("create table example2 as select * from mytempTable");

Related

How to SELECT multiple columns dynamically in DolphinDB?

Below is part of my table schema.
my table schema
I can use statement select factor.column(2) as code from factor to get the second column.
I wonder if I can SELECT the 2nd column, and the 4th to the last column.
Using metaprogramming is a good choice.
Get the column names of the table first using function columnNames.
Access column names by index and join them.
Create and execute a SQL statement using function sql and eval with the metacode generated by function sqlCol.
factor=table(2015.01.15 as date,`00000.SZ as code,-1.05 as factor_value,1.1 as factor01,1.2 as factor02)
colNames = factor.columnNames()
finalColNames = colNames[1] join colNames[3:]
sql(sqlCol(finalColNames), factor).eval()
code factor01 factor02
-------- -------- --------
00000.SZ 1.1 1.2

Try to avoid shuffle by manual control of table read per executor

I have:
really huge (let's say 100s if Tb) Iceberg table B which is partitioned by main_col, truncate[N, stamp]
small table S with columns main_col, stamp_as_key
I want to get a dataframe (actually table) with logic:
b = spark.read.table(B)
s = spark.read.table(S)
df = b.join(F.broadcast(s), (b.main_col == s.main_col) & (s.stamp_as_key - W0 <= b.stamp <= s.stamp_as_key + W0))
df = df.groupby('main_col', 'stamp_as_key').agg(make_some_transformations)
I want to avoid shuffle when reading B table. Iceberg has some meta tables about all parquet files in table and its content. What is possible to do:
read only metainfo table of B table
join it with S table
repartition by expected columns
collect s3 paths of real B data
read these files from executors independently.
Is there a better way to make this work? Also I can change the schema of B table if needed. But main_col should stay as a first paritioner.
One more question: suppose I have such dataframe and I saved it as a table. I need effectively join such tables. Am I correct that it is also impossible to do without shuffle with classic spark code?

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pattern after writing to delta table.
My DataFrame Output column holds the value in this format 2022-05-13 17:52:09.771,
But After writing it to the Table, The column value is getting populated as
2022-05-13T17:52:09.771+0000
I am using below function to generate this Dataframe output
val pretsUTCText = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
val tsUTCText: String = pretsUTCTextNew.format(ts)
val tsUTCCol : Column = lit(tsUTCText)
val df = df2.withColumn(to_timestamp(timestampConverter.tsUTCCol,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
The Dataframe output is returning 2022-05-13 17:52:09.771 as TIMESTAMP pattern.
But After writing it to Delta Table I see the same value is getting populated as 2022-05-13T17:52:09.771+0000
Thanks in Advance. I could not find any solution.
I have just found the same behaviour on Databricks as you, and it behaves differently than the Databricks document. It seems after some versions Databricks show timezone as a default so you see additional +0000. I think you can use date_format function when you populate data if you don't want it. Also, I think you don't need 'Z' in format text as it is for timezone. See the screenshot below.

dynamically create columns in pyspark sql select statement

I have a pyspark dataframe called unique_attributes. the dataframe has columns productname, productbrand, producttype, weight, id. I am partitioning by some columns and trying to get the first value of the id column using a window function. I would like to be able to dynamically pass a list of columns to partition by. so for example if I wanted to add the weight column to the partition without having to code another 'col('weight') in the select, just instead pass a list. does anyone have a suggestion how to accomplish this? I have an example below.
current code:
w2 = Window().partitionBy(['productname',
'productbrand',
'producttype']).orderBy(unique_attributes.id.asc())
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype')),first("id",True).over(w2).alias('matchid')).distinct()
desired dynamic code:
column_list=['productname',
'productbrand',
'producttype',
'weight']
w2 = Window().partitionBy(column_list).orderBy(unique_attributes.id.asc())
# somehow creates
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype'), col('weight'),first("id",True).over(w2).alias('matchid')).distinct()

How to insert csv data into an existing SQL table

I have created a database and a table (table1) using an SQL syntax and execute them using spark.sql :
spark.sql("CREATE TABLE table1...");
I also loaded a csv file data into a dataframe using :
Dataset<Row> firstDF = spark.read().format("csv").load("C:/file.csv");
Now i use the following code to populate the existing table with the csv data :
firstDF.toDF().writeTo("table1").append();
But when i select all from the table1 :
Dataset<Row> firstDFRes = spark.sql("SELECT * FROM table1");
firstDFRes.show();
i get it empty (with only the schema of the table and no data)
My question is how to populate an existing SQL table with a dataframe ?
PS : using DataFrameReader's InsertInto or else SaveAsTable will create the table using the csv data and will ignore the schema of the SQL created table.
Thank you.

Resources