Index names must be exactly matched currently - featuretools

I am trying to add koalas dataframe in an entitySet. Here is the code for it
subset_kdf_fp_eta_gt_prd.spark.print_schema()
root
|-- booking_code: string (nullable = true)
|-- order_id: string (nullable = true)
|-- restaurant_id: string (nullable = true)
|-- country_id: long (nullable = true)
|-- inferred_prep_time: long (nullable = true)
|-- inferred_wait_time: long (nullable = true)
|-- is_integrated_model: integer (nullable = true)
|-- sub_total: double (nullable = true)
|-- total_quantity: integer (nullable = true)
|-- dish_name: string (nullable = true)
|-- sub_total_in_sgd: double (nullable = true)
|-- city_id: long (nullable = true)
|-- hour: integer (nullable = true)
|-- weekday: integer (nullable = true)
|-- request_time_epoch_utc: timestamp (nullable = true)
|-- year: string (nullable = true)
|-- month: string (nullable = true)
|-- day: string (nullable = true)
|-- is_takeaway: string (nullable = false)
|-- is_scheduled: string (nullable = false)
es = ft.EntitySet(id="koalas_es")
from woodwork.logical_types import Categorical, Double, Integer, NaturalLanguage, Datetime, Boolean
es.add_dataframe(dataframe_name="fp_eta_gt_prd",
dataframe=subset_kdf_fp_eta_gt_prd,
index="order_id",
time_index="request_time_epoch_utc",
already_sorted="false",
logical_types={
"booking_code": Categorical,
"order_id": Categorical,
"restaurant_id": Categorical,
"country_id": Double,
"inferred_prep_time": Double,
"inferred_wait_time": Double,
"is_integrated_model": Categorical,
"sub_total": Double,
"total_quantity": Integer,
"dish_name": NaturalLanguage,
"sub_total_in_sgd": Double,
"city_id": Categorical,
"hour": Categorical,
"weekday": Categorical,
"request_time_epoch_utc": Datetime,
"year": Categorical,
"month": Categorical,
"day": Categorical,
"is_takeaway": Categorical,
"is_scheduled": Categorical,
})
On running this, I am encountering the error Index names must be exactly matched currently. I have double checked all the field names, index uniqueness etc. Not sure what might be the cause of error here.

Related

Pyspark - Expand column with struct of arrays into new columns

I have a DataFrame with a single column which is a struct type and contains an array.
users_tp_df.printSchema()
root
|-- x: struct (nullable = true)
| |-- ActiveDirectoryName: string (nullable = true)
| |-- AvailableFrom: string (nullable = true)
| |-- AvailableFutureAllocation: long (nullable = true)
| |-- AvailableFutureHours: double (nullable = true)
| |-- CreateDate: string (nullable = true)
| |-- CurrentAllocation: long (nullable = true)
| |-- CurrentAvailableHours: double (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Type: string (nullable = true)
| | | |-- Value: string (nullable = true)
I'm trying to convert the CustomFields array column in 3 three columns:
Country;
isExternal;
Service.
So for example, I've these values:
and the final dataframe output excepted for that row will be:
Can anyone please help me in achieving this?
Thank you!
This would work:
initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");
final_df = initial_expansion\
.join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))\
.select("*", "CustomFields.*")\
.groupBy("id").pivot("Name").agg(F.first("Value")), \
"id").drop("CustomFields")
Sample Input:
Json - {'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}
Input Structure:
root
|-- x: struct (nullable = true)
| |-- CurrentAvailableHours: integer (nullable = true)
| |-- CustomFields: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Name: string (nullable = true)
| | | |-- Value: string (nullable = true)
Output:
Output Structure (Id can be dropped):
root
|-- id: long (nullable = false)
|-- CurrentAvailableHours: integer (nullable = true)
|-- Country: string (nullable = true)
|-- Service: string (nullable = true)
Considering the mockup structure below, similar with the one from your example,
you can do it the sql way by using the inline function:
with alpha as (
select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
, array( named_struct("a", "x", "b", "y", "c", "z")
, named_struct("a", "xx", "b", "yy", "c","zz"))
) root
)
select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c)
from alpha
The result:
Mockup structure:

A schema mismatch detected when writing to the Delta table Data stream write

I am having .option("mergeSchema", "true") in my code still I am getting schema mismatch error. I am reading schema for parquet my timestamp was in bigint format so I converted to timestamp format and then created new column date which I want to partition my data on.
df = df.withColumn("_processed_delta_timestamp", F.current_timestamp()) \
.withColumn("_input_file_name", F.input_file_name())\
.withColumn('date', F.date_format(F.date_trunc('Day', (F.col("timestamp") / 1000).cast(TimestampType())), 'yyyy-MM-dd')) \
.withColumn('date', to_date(F.col('date'), 'yyyy-MM-dd'))
df.writeStream.format('delta') \
.outputMode("append") \
.option("mergeSchema", "true") \
.option('checkpointLocation', checkpoint_path) \
.partitionBy('date')\
.option('path', output_path)\
.toTable(f"{output_database_name}.{output_table_name}")
Error that I am getting
To enable schema migration using DataFrameWriter or DataStreamWriter, please set:
'.option("mergeSchema", "true")'.
For other operations, set the session configuration
spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation
specific to the operation for details.
Table schema:
root
-- metric_stream_name: string (nullable = true)
-- account_id: string (nullable = true)
-- region: string (nullable = true)
-- namespace: string (nullable = true)
-- metric_name: string (nullable = true)
-- dimension: struct (nullable = true)
|-- ApiName: string (nullable = true)
-- timestamp: long (nullable = true)
-- value: struct (nullable = true)
|-- max: double (nullable = true)
|-- min: double (nullable = true)
|-- sum: double (nullable = true)
|-- count: double (nullable = true)
-- unit: string (nullable = true)
-- _processed_delta_timestamp: timestamp (nullable = true)
-- _input_file_name: string (nullable = true)
Data schema:
root
-- metric_stream_name: string (nullable = true)
-- account_id: string (nullable = true)
-- region: string (nullable = true)
-- namespace: string (nullable = true)
-- metric_name: string (nullable = true)
-- dimension: struct (nullable = true)
|-- ApiName: string (nullable = true)
-- timestamp: long (nullable = true)
-- value: struct (nullable = true)
|-- max: double (nullable = true)
|-- min: double (nullable = true)
|-- sum: double (nullable = true)
|-- count: double (nullable = true)
-- unit: string (nullable = true)
-- _processed_delta_timestamp: timestamp (nullable = true)
-- _input_file_name: string (nullable = true)
-- date: date (nullable = true)
Partition columns do not match the partition columns of the table.
Given: [`date`]
Table: [`timestamp`]

Pyspark Cannot modify a column based on a condition when a column values are in other list

I am using Pyspark 3.0.1
I want to modify the value of a column is in a list.
df.printSchema()
root
|-- ID: decimal(4,0) (nullable = true)
|-- Provider: string (nullable = true)
|-- Principal: float (nullable = false)
|-- PRINCIPALBALANCE: float (nullable = true)
|-- STATUS: integer (nullable = true)
|-- Installment Rate: float (nullable = true)
|-- Yearly Percentage: float (nullable = true)
|-- Processing Fee Percentage: double (nullable = true)
|-- Disb Date: string (nullable = true)
|-- ZOHOID: integer (nullable = true)
|-- UPFRONTPROCESSINGFEEBALANCE: float (nullable = true)
|-- WITHHOLDINGTAXBALANCE: float (nullable = true)
|-- UPFRONTPROCESSINGFEEPERCENTAGE: float (nullable = true)
|-- UPFRONTPROCESSINGFEEWHTPERCENTAGE: float (nullable = true)
|-- PROCESSINGFEEWHTPERCENTAGE: float (nullable = true)
|-- PROCESSINGFEEVATPERCENTAGE: float (nullable = true)
|-- BUSINESSSHORTCODE: string (nullable = true)
|-- EXCTRACTIONDATE: timestamp (nullable = true)
|-- fake Fee: double (nullable = false)
|-- fake WHT: string (nullable = true)
|-- fake Fee_WHT: string (nullable = true)
|-- Agency Fee CP: string (nullable = true)
|-- Agency VAT CP: string (nullable = true)
|-- Agency WHT CP: string (nullable = true)
|-- Agency Fee_VAT_WHT CP: string (nullable = true)
|-- write_offs: integer (nullable = false)
df.head(1)
[Row(ID=Decimal('16'), Provider='fake', Principal=2000.01, PRINCIPALBALANCE=0.2, STATUS=4, Installment Rate=0.33333333, Yearly Percentage=600.0, Processing Fee Percentage=0.20, Disb Date=None, ZOHOID=3000, UPFRONTPROCESSINGFEEBALANCE=None, WITHHOLDINGTAXBALANCE=None, UPFRONTPROCESSINGFEEPERCENTAGE=None, UPFRONTPROCESSINGFEEWHTPERCENTAGE=None, PROCESSINGFEEWHTPERCENTAGE=None, PROCESSINGFEEVATPERCENTAGE=16.0, BUSINESSSHORTCODE='20005', EXCTRACTIONDATE=datetime.datetime(2020, 11, 25, 5, 7, 58, 6000), fake Fee=1770.7, fake WHT='312.48', fake Fee_WHT='2,083.18', Agency Fee CP='566.62', Agency VAT CP='566.62', Agency WHT CP='186.39', Agency Fee_VAT_WHT CP='5,394.41')]
The value of the column of 'write_offs' is 0 for all rows then I want to convert it to 1 if the column ID is in the following list: list1 = [299, 570, 73, 401]
Then I am doing:
df.withColumn('write_offs', when((df.filter(df['ID'].isin(list1))),1).otherwise(df['ID']))
and I am getting this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-de9f9cd49ea5> in <module>
----> 1 df.withColumn('write_offs', when((df.filter(df['ID'].isin(userinput_write_offs_ids))),lit(1)).otherwise(df['ID']))
/usr/local/spark/python/pyspark/sql/functions.py in when(condition, value)
789 sc = SparkContext._active_spark_context
790 if not isinstance(condition, Column):
--> 791 raise TypeError("condition should be a Column")
792 v = value._jc if isinstance(value, Column) else value
793 jc = sc._jvm.functions.when(condition._jc, v)
TypeError: condition should be a Column
I don't know why is giving this error because I did a similar operation that the condition returns a dataframe and works.
I read how to use this isin function here:
Pyspark isin function
You need a Boolean column for when, not a dataframe
import pyspark.sql.functions as F
df.withColumn(
'write_offs',
F.when(F.col('ID').isin(list1), 1)
.otherwise(F.col('ID'))
)

could not find implicit value for parameter sparkSession

I have a notebook with code below that throws error of:
could not find implicit value for parameter sparkSession
import org.apache.spark.sql.{SparkSession, Row, DataFrame}
import org.apache.spark.ml.clustering.KMeans
def createBalancedDataframe(df:DataFrame, reductionCount:Int)(implicit sparkSession:SparkSession) = {
val kMeans = new KMeans().setK(reductionCount).setMaxIter(30)
val kMeansModel = kMeans.fit(df)
import sparkSession.implicits._
kMeansModel.clusterCenters.toList.map(v => (v, 0)).toDF("features", "label")
}
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)
Error:
Name: Compile Error
Message: <console>:82: error: could not find implicit value for parameter sparkSession: org.apache.spark.sql.SparkSession
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)
^
StackTrace:
It would be greatly appreciated if anyone can offer any help, thank you very much in advance.
UPDATE:
Thanks to Reddy's input, after I changed it to
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)(spark)
I receive the following error:
Name: java.lang.IllegalArgumentException
Message: Field "features" does not exist.
Available fields: cc_num, trans_num, trans_time, category, merchant, amt, merch_lat, merch_long, distance, age, is_fraud
StackTrace: Available fields: cc_num, trans_num, trans_time, category, merchant, amt, merch_lat, merch_long, distance, age, is_fraud
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:93)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:254)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:340)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:305)
at createBalancedDataframe(<console>:45)
UPDATE2:
featureDF.printSchema
root
|-- cc_num: long (nullable = true)
|-- category: string (nullable = true)
|-- merchant: string (nullable = true)
|-- distance: double (nullable = true)
|-- amt: integer (nullable = true)
|-- age: integer (nullable = true)
|-- is_fraud: integer (nullable = true)
|-- category_indexed: double (nullable = false)
|-- category_encoded: vector (nullable = true)
|-- merchant_indexed: double (nullable = false)
|-- merchant_encoded: vector (nullable = true)
|-- features: vector (nullable = true)
val fraudDF = featureDF
.filter($"is_fraud" === 1)
.withColumnRenamed("is_fraud", "label")
.select("features", "label")
fraudDF.printSchema
root
|-- cc_num: long (nullable = true)
|-- trans_num: string (nullable = true)
|-- trans_time: string (nullable = true)
|-- category: string (nullable = true)
|-- merchant: string (nullable = true)
|-- amt: integer (nullable = true)
|-- merch_lat: double (nullable = true)
|-- merch_long: double (nullable = true)
|-- distance: double (nullable = true)
|-- age: integer (nullable = true)
|-- is_fraud: integer (nullable = true)
Why feature is gone???
Assuming you have your SparkSession and is named spark
you can pass it explicitly this way
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)(spark)
or create an implicit reference (spark2 or any name) in the calling environment. Example:
implicit val spark2 = spark
//some calls
// others
val balancedNonFraudDF = createBalancedDataframe(nonFraudDF, fraudCount.toInt)

Dividing rows of dataframe to simple rows in Pyspark

I have this schema and I would like to split the inside of result into columns in order to have col1: EventCode, col2: Message, etc... I'm using Pyspark, I tried the explode function but it doesn't seem to work on structType, is there a way to do this in Spark ?
root
|-- result: struct (nullable = true)
| |-- EventCode: string (nullable = true)
| |-- Message: string (nullable = true)
| |-- _bkt: string (nullable = true)
| |-- _cd: string (nullable = true)
| |-- _indextime: string (nullable = true)
| |-- _pre_msg: string (nullable = true)
| |-- _raw: string (nullable = true)
| |-- _serial: string (nullable = true)
| |-- _si: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- _sourcetype: string (nullable = true)
| |-- _time: string (nullable = true)
| |-- host: string (nullable = true)
| |-- index: string (nullable = true)
| |-- linecount: string (nullable = true)
| |-- source: string (nullable = true)
| |-- sourcetype: string (nullable = true)
To divide rows of dataframe to simple rows is easy. All you have to do is select all columns from dataframe and assign it to another dataframe. Something like this:
simpleDF = df.select("result.*")
It will convert the above given schema into following schema:
simpleDF.printSchema
root
|-- EventCode: string (nullable = true)
|-- Message: string (nullable = true)
|-- _bkt: string (nullable = true)
|-- _cd: string (nullable = true)
|-- _indextime: string (nullable = true)
|-- _pre_msg: string (nullable = true)
|-- _raw: string (nullable = true)
|-- _serial: string (nullable = true)
|-- _si: array (nullable = true)
| |-- element: string (containsNull = true)
|-- _sourcetype: string (nullable = true)
|-- _time: string (nullable = true)
|-- host: string (nullable = true)
|-- index: string (nullable = true)
|-- linecount: string (nullable = true)
|-- source: string (nullable = true)
|-- sourcetype: string (nullable = true)

Resources