My cassandra table columns in lower case like below
CREATE TABLE model_family_by_id(
model_family_id int PRIMARY KEY,
model_family text,
create_date date,
last_update_date date,
model_family_name text
);
my dataframe schema is like this
root
|-- MODEL_FAMILY_ID: decimal(38,10) (nullable = true)
|-- MODEL_FAMILY: string (nullable = true)
|-- CREATE_DATE: timestamp (nullable = true)
|-- LAST_UPDATE_DATE: timestamp (nullable = true)
|-- MODEL_FAMILY_NAME: string (nullable = true)
So while insert into cassandra I am getting below error
tabException in thread "main" java.util.NoSuchElementException: Columns not found in table sample_cbd.model_family_by_id: MODEL_FAMILY_ID, MODEL_FAMILY, CREATE_DATE, LAST_UPDATE_DATE, MODEL_FAMILY_NAME
at com.datastax.spark.connector.SomeColumns.selectFrom(ColumnSelector.scala:44)
If I correctly understand the source code, the Spark Connector wraps the columns in to the double quotes, so they become case-sensitive, and don't match to the names in the CQL definition.
You need to change schema of your DataFrame - either run the withColumnRenamed on it for every column, or use select with alias for every column.
Related
Does creating a Spark DataFrame and saving it in Parquet format guarantee that the order of columns in the parquet file will be preserved?
Ex) A Spark DataFrame is created with columns A, B, C, and saved as Parquet. When the Parquet files are read, will the column order always be A, B, C?
I've noticed that if I save a Spark DataFrame, and then read the parquet files, the column order is preserved:
df.select("A", "B", "C").write.save(...)
df = spark.read.load(...)
df.printSchema()
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
Then, if I save by selecting a different order of columns, and then read the parquet files, I can see the order is also what I expect:
df.select("C", "B", "A").write.save(...)
df = spark.read.load(...)
df.printSchema()
root
|-- C: string (nullable = true)
|-- B: string (nullable = true)
|-- A: string (nullable = true)
However, I can't seem to find any documentation supporting this, and the comments of this post: Is there a possibility to keep column order when reading parquet? have conflicting information.
I am trying to read orc file of a managed hive table using below pyspark code.
spark.read.format('orc').load('hive managed table path')
when i do a print schema on fetched dataframe, it is as follow
root
|-- operation: integer (nullable = true)
|-- originalTransaction: long (nullable = true)
|-- bucket: integer (nullable = true)
|-- rowId: long (nullable = true)
|-- currentTransaction: long (nullable = true)
|-- row: struct (nullable = true)
| |-- col1: float (nullable = true)
| |-- col2: integer (nullable = true)
|-- partition_by_column: date (nullable = true)
Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying
java.lang.IllegalArgumentException: Include vector the wrong length
did someone face the same issue? if yes can you please suggest how to resolve it.
It's a known issue.
You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.
Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here
As i am not sure about the versions You can try other ways to load the ORC file.
Using SqlContext
val df = sqlContext.read.format("orc").load(orcfile)
OR
val df= spark.read.option("inferSchema", true).orc("filepath")
OR SparkSql(recommended)
import spark.sql
sql("SELECT * FROM table_name").show()
I'm trying to convert json files to parquet with very few transformations (adding date) but I then need to partition this data before saving it to parquet.
I'm hitting a wall on this area.
Here is the creation process of the table:
df_temp = spark.read.json(data_location) \
.filter(
cond3
)
df_temp = df_temp.withColumn("date", fn.to_date(fn.lit(today.strftime("%Y-%m-%d"))))
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
then regarding the save of the conversion:
df_final.write.mode("append").format("parquet").partitionBy("customer_id", "date").saveAsTable('duration')
but this generates the following error:
pyspark.sql.utils.AnalysisException: '\nSpecified partitioning does not match that of the existing table default.duration.\nSpecified partition columns: [customer_id, date]\nExisting partition columns: []\n ;'
the schema being:
root
|-- action_id: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- duration: long (nullable = true)
|-- initial_value: string (nullable = true)
|-- item_class: string (nullable = true)
|-- set_value: string (nullable = true)
|-- start_time: string (nullable = true)
|-- stop_time: string (nullable = true)
|-- undo_event: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- date: date (nullable = true)
Thus I tried to change the create table to:
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp PARTITIONED BY (customer_id, date) LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
But this create an error like:
...mismatched input 'PARTITIONED' expecting ...
So I discovered that PARTITIONED BY doesn't work with LIKE but I'm running out of ideas.
if using USING instead of LIKE I got the error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to specify partition columns when the table schema is not defined. When the table schema is not provided, schema and partition columns will be inferred.;'
How am I supposed to add a partition when creating the table?
Ps - Once the schema of the table is defined with the partitions, I want to simply use:
df_final.write.format("parquet").insertInto('duration')
I finally figured out how to do it with spark.
df_temp.read.json...
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("""
CREATE TABLE IF NOT EXISTS {1}
USING PARQUET
PARTITIONED BY (customer_id, date)
LOCATION '{2}/{1}' AS SELECT * FROM {0}_tmp
""".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
df_temp.write.mode("append").partitionBy("customer_id", "date").saveAsTable('duration')
I don't know why, but if I can't use insertInto, it uses a weird customer_id out of nowhere and doesn't append the different dates.
I have a data (df_1) with following scheme,
|-- Column1: string (nullable = true)
|-- Column2: string (nullable = true)
|-- Column3: long (nullable = true)
|-- Column4: double (nullable = true)
The type of df_1 being "pyspark.sql.dataframe.DataFrame"
I want to create a new column as Rank which ranks the rows as per the window (security_window) function defined;
import pyspark.sql.functions as F
from pyspark.sql import Window
window=Window.partitionBy(F.col("Column1"),F.col('Column2')).orderBy(F.col("Column3"))).rangeBetween(-20,0)
df_1.withColumn('Rank',F.rank().over(window))
However, when I use this window function with mentioned dataframe (df_1),
I face following exception as AnalysisException. Anyone know what can be the cause?
pyspark.sql.utils.AnalysisException: Window Frame RANGE BETWEEN 20 PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
I'm performing an inner join between dataframes to only keep the sales for specific days:
val days_df = ss.createDataFrame(days_array.map(Tuple1(_))).toDF("DAY_ID")
val filtered_sales = sales.join(days_df,Seq("DAY_ID")
filtered_sales.show()
This results in an empty filtered_sales dataframe (0 records), both columns DAY_ID have the same type (string).
root
|-- DAY_ID: string (nullable = true)
root
|-- SKU: string (nullable = true)
|-- DAY_ID: string (nullable = true)
|-- STORE_ID: string (nullable = true)
|-- SALES_UNIT: integer (nullable = true)
|-- SALES_REVENUE: decimal(20,5) (nullable = true)
The sales df is populated from a 20GB file.
Using the same code with a small file of some KB will work fine with the join and I can see the results. The empty result dataframe occurs only with bigger dataset.
If I change the code and use the following one, it works fine even with the 20GB sales file:
sales.filter(sales("DAY_ID").isin(days_array:_*))
.show()
What is wrong with the inner join?
Try to broadcast days_array and then apply inner join. As days_array is too small compared to another table, broadcasting will help.