I have a data (df_1) with following scheme,
|-- Column1: string (nullable = true)
|-- Column2: string (nullable = true)
|-- Column3: long (nullable = true)
|-- Column4: double (nullable = true)
The type of df_1 being "pyspark.sql.dataframe.DataFrame"
I want to create a new column as Rank which ranks the rows as per the window (security_window) function defined;
import pyspark.sql.functions as F
from pyspark.sql import Window
window=Window.partitionBy(F.col("Column1"),F.col('Column2')).orderBy(F.col("Column3"))).rangeBetween(-20,0)
df_1.withColumn('Rank',F.rank().over(window))
However, when I use this window function with mentioned dataframe (df_1),
I face following exception as AnalysisException. Anyone know what can be the cause?
pyspark.sql.utils.AnalysisException: Window Frame RANGE BETWEEN 20 PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Related
I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))
I have a data frame with the schema look like this
root
|-- key: string (nullable = true)
|-- column1: string (nullable = true)
|-- column2: string (nullable = true)
|-- column3: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
and my window looks like
w = Window.partitionBy("key").orderBy("timestamp")
I would like to add a new column calculation to my data frame which requires doing some logic with column1, 2, 3 from both current row and the previous row in the window. Something like
df2 = df.withColumn(”calculation“, my_udf( currentCol1, currentCol2, currentCol3, lastCol1, lastCol2, lastCol3 ) )
lag only allows getting one column from previous row a time and I don't think applying window functions 3 times to get all 3 previous values are appropriate approach. Is there a way I can achieve getting the entire previous row? (getting all the columns)
To get all the previous rows, you can use Window.unboundedPreceding in the window frame.
w = Window.partitionBy("key").orderBy("timestamp").rowsBetween(Window.unboundedPreceding, 0) # 0 is the current row
But to apply an UDF to a window, I think you have the option to use the expensive collect_list function. Also you need to do it to every columns your UDF needs. For example :
import pyspark.sql.functions as F
w = Window.partitionBy("key").orderBy("timestamp").rowsBetween(Window.unboundedPreceding, 0) # 0 is the current row
df = df.withColumn('result', your_udf(F.collect_list('col1').over(w).alias('a'),
F.collect_list('col2').over(w).alias('b')))
I don't know how your UDF works, so this might not work for you.
My cassandra table columns in lower case like below
CREATE TABLE model_family_by_id(
model_family_id int PRIMARY KEY,
model_family text,
create_date date,
last_update_date date,
model_family_name text
);
my dataframe schema is like this
root
|-- MODEL_FAMILY_ID: decimal(38,10) (nullable = true)
|-- MODEL_FAMILY: string (nullable = true)
|-- CREATE_DATE: timestamp (nullable = true)
|-- LAST_UPDATE_DATE: timestamp (nullable = true)
|-- MODEL_FAMILY_NAME: string (nullable = true)
So while insert into cassandra I am getting below error
tabException in thread "main" java.util.NoSuchElementException: Columns not found in table sample_cbd.model_family_by_id: MODEL_FAMILY_ID, MODEL_FAMILY, CREATE_DATE, LAST_UPDATE_DATE, MODEL_FAMILY_NAME
at com.datastax.spark.connector.SomeColumns.selectFrom(ColumnSelector.scala:44)
If I correctly understand the source code, the Spark Connector wraps the columns in to the double quotes, so they become case-sensitive, and don't match to the names in the CQL definition.
You need to change schema of your DataFrame - either run the withColumnRenamed on it for every column, or use select with alias for every column.
I'm performing an inner join between dataframes to only keep the sales for specific days:
val days_df = ss.createDataFrame(days_array.map(Tuple1(_))).toDF("DAY_ID")
val filtered_sales = sales.join(days_df,Seq("DAY_ID")
filtered_sales.show()
This results in an empty filtered_sales dataframe (0 records), both columns DAY_ID have the same type (string).
root
|-- DAY_ID: string (nullable = true)
root
|-- SKU: string (nullable = true)
|-- DAY_ID: string (nullable = true)
|-- STORE_ID: string (nullable = true)
|-- SALES_UNIT: integer (nullable = true)
|-- SALES_REVENUE: decimal(20,5) (nullable = true)
The sales df is populated from a 20GB file.
Using the same code with a small file of some KB will work fine with the join and I can see the results. The empty result dataframe occurs only with bigger dataset.
If I change the code and use the following one, it works fine even with the 20GB sales file:
sales.filter(sales("DAY_ID").isin(days_array:_*))
.show()
What is wrong with the inner join?
Try to broadcast days_array and then apply inner join. As days_array is too small compared to another table, broadcasting will help.
A very huge DataFrame with schema:
root
|-- id: string (nullable = true)
|-- ext: array (nullable = true)
| |-- element: integer (containsNull = true)
So far I try to explode data, then collect_list:
select
id,
collect_list(cast(item as string))
from default.dual
lateral view explode(ext) t as item
group by
id
But this way is too expansive.
You can simply cast the ext column to a string array
df = source.withColumn("ext", source.ext.cast("array<string>"))
df.printSchema()
df.show()