Pyspark refer to table created using sql

Pyspark refer to table created using sql - apache-spark

When I create a table using SQL in Spark, for example:
sql('CREATE TABLE example SELECT a, b FROM c')
How can I pull that table into the python namespace (I can't think of a better term) so that I can update it? Let's say I want to replace NaN values in the table like so:
import pyspark.sql.functions as F
table = sql('SELECT * FROM example')
for column in columns:
table = table.withColumn(column,F.when(F.isnan(F.col(column)),F.col(column)).otherwise(None))
Does this operation update the original example table created with SQL? If I were to run sql('SELECT * FROM example')show() would I see the updated results? When the original CREATE TABLE example ... SQL runs, is example automatically added to the python namespace?

The sql function returns a new DataFrame, so the table is not modified. If you want to write a DataFrame's contents into a table created in spark, do it like this:
table.write.mode("append").saveAsTable("example")
But what you are doing is actually changing the schema of a table, in that case
table.createOrReplaceTempView("mytempTable")
sql("create table example2 as select * from mytempTable");

Related

How to SELECT multiple columns dynamically in DolphinDB?

Below is part of my table schema.
my table schema
I can use statement select factor.column(2) as code from factor to get the second column.
I wonder if I can SELECT the 2nd column, and the 4th to the last column.

Using metaprogramming is a good choice.
Get the column names of the table first using function columnNames.
Access column names by index and join them.
Create and execute a SQL statement using function sql and eval with the metacode generated by function sqlCol.
factor=table(2015.01.15 as date,`00000.SZ as code,-1.05 as factor_value,1.1 as factor01,1.2 as factor02)
colNames = factor.columnNames()
finalColNames = colNames[1] join colNames[3:]
sql(sqlCol(finalColNames), factor).eval()
code factor01 factor02
-------- -------- --------
00000.SZ 1.1 1.2

Try to avoid shuffle by manual control of table read per executor

I have:
really huge (let's say 100s if Tb) Iceberg table B which is partitioned by main_col, truncate[N, stamp]
small table S with columns main_col, stamp_as_key
I want to get a dataframe (actually table) with logic:
b = spark.read.table(B)
s = spark.read.table(S)
df = b.join(F.broadcast(s), (b.main_col == s.main_col) & (s.stamp_as_key - W0 <= b.stamp <= s.stamp_as_key + W0))
df = df.groupby('main_col', 'stamp_as_key').agg(make_some_transformations)
I want to avoid shuffle when reading B table. Iceberg has some meta tables about all parquet files in table and its content. What is possible to do:
read only metainfo table of B table
join it with S table
repartition by expected columns
collect s3 paths of real B data
read these files from executors independently.
Is there a better way to make this work? Also I can change the schema of B table if needed. But main_col should stay as a first paritioner.
One more question: suppose I have such dataframe and I saved it as a table. I need effectively join such tables. Am I correct that it is also impossible to do without shuffle with classic spark code?

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pattern after writing to delta table.
My DataFrame Output column holds the value in this format 2022-05-13 17:52:09.771,
But After writing it to the Table, The column value is getting populated as
2022-05-13T17:52:09.771+0000
I am using below function to generate this Dataframe output
val pretsUTCText = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
val tsUTCText: String = pretsUTCTextNew.format(ts)
val tsUTCCol : Column = lit(tsUTCText)
val df = df2.withColumn(to_timestamp(timestampConverter.tsUTCCol,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
The Dataframe output is returning 2022-05-13 17:52:09.771 as TIMESTAMP pattern.
But After writing it to Delta Table I see the same value is getting populated as 2022-05-13T17:52:09.771+0000
Thanks in Advance. I could not find any solution.

I have just found the same behaviour on Databricks as you, and it behaves differently than the Databricks document. It seems after some versions Databricks show timezone as a default so you see additional +0000. I think you can use date_format function when you populate data if you don't want it. Also, I think you don't need 'Z' in format text as it is for timezone. See the screenshot below.

dynamically create columns in pyspark sql select statement

I have a pyspark dataframe called unique_attributes. the dataframe has columns productname, productbrand, producttype, weight, id. I am partitioning by some columns and trying to get the first value of the id column using a window function. I would like to be able to dynamically pass a list of columns to partition by. so for example if I wanted to add the weight column to the partition without having to code another 'col('weight') in the select, just instead pass a list. does anyone have a suggestion how to accomplish this? I have an example below.
current code:
w2 = Window().partitionBy(['productname',
'productbrand',
'producttype']).orderBy(unique_attributes.id.asc())
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype')),first("id",True).over(w2).alias('matchid')).distinct()
desired dynamic code:
column_list=['productname',
'productbrand',
'producttype',
'weight']
w2 = Window().partitionBy(column_list).orderBy(unique_attributes.id.asc())
# somehow creates
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype'), col('weight'),first("id",True).over(w2).alias('matchid')).distinct()

How to insert csv data into an existing SQL table

I have created a database and a table (table1) using an SQL syntax and execute them using spark.sql :
spark.sql("CREATE TABLE table1...");
I also loaded a csv file data into a dataframe using :
Dataset<Row> firstDF = spark.read().format("csv").load("C:/file.csv");
Now i use the following code to populate the existing table with the csv data :
firstDF.toDF().writeTo("table1").append();
But when i select all from the table1 :
Dataset<Row> firstDFRes = spark.sql("SELECT * FROM table1");
firstDFRes.show();
i get it empty (with only the schema of the table and no data)
My question is how to populate an existing SQL table with a dataframe ?
PS : using DataFrameReader's InsertInto or else SaveAsTable will create the table using the csv data and will ignore the schema of the SQL created table.
Thank you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark refer to table created using sql - apache-spark

Related

How to SELECT multiple columns dynamically in DolphinDB?

Try to avoid shuffle by manual control of table read per executor

Azure Databricks Delta Table modifies the TIMESTAMP format while writing from Spark DataFrame

dynamically create columns in pyspark sql select statement

How to insert csv data into an existing SQL table

Categories

Resources