Error trying to apply app_types to logical_types of add_dataframe at once - featuretools

app_types
{'TARGET': Boolean,
'FLAG_MOBIL': Boolean,
'FLAG_EMP_PHONE': Boolean,
'FLAG_WORK_PHONE': Boolean,
'FLAG_CONT_MOBILE': Boolean,
'FLAG_PHONE': Boolean,
'FLAG_EMAIL': Boolean,
'REG_REGION_NOT_LIVE_REGION': Boolean,
'REG_REGION_NOT_WORK_REGION': Boolean,
'LIVE_REGION_NOT_WORK_REGION': Boolean,
'REG_CITY_NOT_LIVE_CITY': Boolean,
'REG_CITY_NOT_WORK_CITY': Boolean,
'LIVE_CITY_NOT_WORK_CITY': Boolean,
'FLAG_DOCUMENT_2': Boolean,
'FLAG_DOCUMENT_3': Boolean,
'FLAG_DOCUMENT_4': Boolean,
'FLAG_DOCUMENT_5': Boolean,
'FLAG_DOCUMENT_6': Boolean,
'FLAG_DOCUMENT_7': Boolean,
'FLAG_DOCUMENT_8': Boolean,
'FLAG_DOCUMENT_9': Boolean,
'FLAG_DOCUMENT_10': Boolean,
'FLAG_DOCUMENT_11': Boolean,
'FLAG_DOCUMENT_12': Boolean,
'FLAG_DOCUMENT_13': Boolean,
'FLAG_DOCUMENT_14': Boolean,
'FLAG_DOCUMENT_15': Boolean,
'FLAG_DOCUMENT_16': Boolean,
'FLAG_DOCUMENT_17': Boolean,
'FLAG_DOCUMENT_18': Boolean,
'FLAG_DOCUMENT_19': Boolean,
'FLAG_DOCUMENT_20': Boolean,
'FLAG_DOCUMENT_21': Boolean,
'REGION_RATING_CLIENT': Ordinal,
'REGION_RATING_CLIENT_W_CITY': Ordinal}
I am trying to apply the above dictionary, app_types, to logical_types of add_dataframe at once. so, I wrote the following code, but I get the following error:
import featuretools as ft
es = ft.EntitySet(id="clients")
es = es.add_dataframe(dataframe_name="app_train", dataframe=app_train,
index="SK_ID_CURR", logical_types=app_types)
TypeError: Must use an Ordinal instance with order values defined
I want to apply logical_types at once, how can I do that? This error only occurs for ordinal columns
REGION_RATING_CLIENT and REGION_RATING_CLIENT_W_CITY columns of app_train have 1, 2, and 3 as values and are integer types.

Thank you for your question.
In order for the column to be assigned the Ordinal logical type, you must provide an order argument, specifying the ordering of values from low to high.
Since you mentioned the column only takes on the values 1, 2, and 3, you can try changing the last two entries in the app_types dictionary to:
'REGION_RATING_CLIENT': Ordinal(order=[1, 2, 3])
'REGION_RATING_CLIENT_W_CITY': Ordinal(order=[1, 2, 3])
You can reverse the order array if you want 3 to be considered lowest, and 1 to be considered highest.

Related

Apache Spark - Performance with and without using Case Classes

I have 2 datasets, customers and orders
I want to join both on customer key.
I tried two approaches, one using case classes and one without.
Using Case classes: -> Just takes forever to complete - almost 11 minutes
case class Customer(custKey: Int, name: String, address: String, phone: String, acctBal: String, mktSegment: String, comment: String) extends Serializable
case class Order(orderKey: Int, custKey: Int, orderStatus: String, totalPrice: Double, orderDate: String, orderQty: String, clerk: String, shipPriority: String, comment: String) extends Serializable
val customers = sc.textFile("customersFile").map(row => row.split('|')).map(cust => (cust(0).toInt, Customer(cust(0).toInt, cust(1), cust(2), cust(3), cust(4), cust(5), cust(6))))
val orders = sc.textFile("ordersFile").map(row => row.split('|')).map(order => (order(1).toInt, Order(order(0).toInt, order(1).toInt, order(2), order(3).toDouble, order(4), order(5), order(6), order(7), order(8))))
orders.join(customers).take(1)
Without Case classes -- completes in few seconds
val customers = sc.textFile("customersFile").map(row => row.split('|'))
val orders = sc.textFile("ordersFile").map(row => row.split('|'))
val customersByCustKey = customers.map(row => (row(0), row)) // customer key is the first column in customers rdd, hence row(0)
val ordersByCustKey = orders.map(row => (row(1), row)) // customer key is the second column in orders rdd, hence row(1)
ordersByCustKey.join(customersByCustKey).take(1)
Want to know if this due to the time taken for serialization/deserialization while using case classes?
if yes, in which cases is it recommended to use case classes?
Job details using case classes:
Job details without case classes:

Spark CBO not showing rowcount for queries having partition column in query

I'm working on Spark 2.3.0 using Cost Based Optimizer(CBO) for computing statistics for queries on done on external tables.
I have a created a external table in spark :
CREATE EXTERNAL TABLE IF NOT EXISTS test (
eventID string,type string,exchange string,eventTimestamp bigint,sequenceNumber bigint
,optionID string,orderID string,side string,routingFirm string,routedOrderID string
,session string,price decimal(18,8),quantity bigint,timeInForce string,handlingInstructions string
,orderAttributes string,isGloballyUnique boolean,originalOrderID string,initiator string,leavesQty bigint
,symbol string,routedOriginalOrderID string,displayQty bigint,orderType string,coverage string
,result string,resultTimestamp bigint,nbbPrice decimal(18,8),nbbQty bigint,nboPrice decimal(18,8)
,nboQty bigint,reporter string,quoteID string,noteType string,definedNoteData string,undefinedNoteData string
,note string,desiredLeavesQty bigint,displayPrice decimal(18,8),workingPrice decimal(18,8),complexOrderID string
,complexOptionID string,cancelQty bigint,cancelReason string,openCloseIndicator string,exchOriginCode string
,executingFirm string,executingBroker string,cmtaFirm string,mktMkrSubAccount string,originalOrderDate string
,tradeID string,saleCondition string,executionCodes string,buyDetails_side string,buyDetails_leavesQty bigint
,buyDetails_openCloseIndicator string,buyDetails_quoteID string,buyDetails_orderID string,buyDetails_executingFirm string,buyDetails_executingBroker string,buyDetails_cmtaFirm string,buyDetails_mktMkrSubAccount string,buyDetails_exchOriginCode string,buyDetails_liquidityCode string,buyDetails_executionCodes string,sellDetails_side string,sellDetails_leavesQty bigint,sellDetails_openCloseIndicator string,sellDetails_quoteID string,sellDetails_orderID string,sellDetails_executingFirm string,sellDetails_executingBroker string,sellDetails_cmtaFirm string,sellDetails_mktMkrSubAccount string,sellDetails_exchOriginCode string,sellDetails_liquidityCode string,sellDetails_executionCodes string,tradeDate int,reason string,executionTimestamp bigint,capacity string,fillID string,clearingNumber string
,contraClearingNumber string,buyDetails_capacity string,buyDetails_clearingNumber string,sellDetails_capacity string
,sellDetails_clearingNumber string,receivingFirm string,marketMaker string,sentTimestamp bigint,onlyOneQuote boolean
,originalQuoteID string,bidPrice decimal(18,8),bidQty bigint,askPrice decimal(18,8),askQty bigint,declaredTimestamp bigint,revokedTimestamp bigint,awayExchange string,comments string,clearingFirm string )
PARTITIONED BY (date integer ,reporteIDs string ,version integer )
STORED AS PARQUET LOCATION '/home/test/'
I have computed statistics on the columns using the following command:
val df = spark.read.parquet("/home/test/")
val cols = df.columns.mkString(",")
val analyzeDDL = s"Analyze table events compute statistics for columns $cols"
spark.sql(analyzeDDL)
Now when I'm trying to get the statistics for the query :
val query = "Select * from test where date > 20180222"
Its giving me only size and not the rowCount :
scala> val exec = spark.sql(query).queryExecution
exec: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('date > 20180222)
+- 'UnresolvedRelation `test`
== Analyzed Logical Plan ==
eventID: string, type: string, exchange: string, eventTimestamp: bigint, sequenceNumber: bigint, optionID: string, orderID: string, side: string, routingFirm: string, routedOrderID: string, session: string, price: decimal(18,8), quantity: bigint, timeInForce: string, handlingInstructions: string, orderAttributes: string, isGloballyUnique: boolean, originalOrderID: string, initiator: string, leavesQty: bigint, symbol: string, routedOriginalOrderID: string, displayQty: bigint, orderType: string, ... 82 more fields
Project [eventID#797974, type#797975, exchange#797976, eventTimestamp#797977L, sequenceNumber#...
scala>
scala> val stats = exec.optimizedPlan.stats
stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=1.0 B, hints=none)
Am I missing any steps here? How can I get the rowcount for the query.
Spark-version : 2.3.0
Files in the table are in parquet format.
Update
I'm able to get the statistics for a csv file. Not able to get the same for a parquet file.
The difference between the execution plan for parquet and csv is format is that in csv we are getting a HiveTableRelation while for parquet its Relation.
Any idea why it so?

Insert overwrite data count miss-match in PySpark for every second run

I am using PySpark (version 2.1.1) shell to run my ETL code.
The last few lines of my PySpark ETL code looks like this:
usage_fact = usage_fact_stg.union(gtac_usage).union(gtp_usage).union(upaf_src).repartition("data_date","data_product")
usage_fact.createOrReplaceTempView("usage_fact_staging")
fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
Now, after last line (having insert overwrite) is executed for the first time then the code runs fine and the output table (usageWideFactTable) is having about 2.4 million rows, which is expected.
If we execute the last line again then I get Error/Warning as shown below and count of output table (usageWideFactTable) is decreased to 0.84 million.
Again if we execute the last line for 3rd time then surprisingly it runs fine and count of output table (usageWideFactTable) is corrected and it comes to be 2.4 million.
And in 4th run, again the Warning/Error comes and count(*) of output table comes to be 0.84 million.
The same above 4 runs on PySpark shell is shown below:
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
I have tried running the same ETL job using Oozie as well but every 2nd oozie run is showing a count mismatch.
The DDL of the output table (usageWideFactTable = datawarehouse.usage_fact) is shown below:
CREATE EXTERNAL TABLE `datawarehouse.usage_fact`(
`mcs_session_id` string,
`meeting_id` string,
`session_tracking_id` string,
`session_type` string,
`session_subject` string,
`session_date` string,
`session_start_time` string,
`session_end_time` string,
`session_duration` double,
`product_name` string,
`product_tier` string,
`product_version` string,
`product_build_number` string,
`native_user_id` string,
`native_participant_id` string,
`native_participant_user_id` string,
`participant_name` string,
`participant_email` string,
`participant_type` string,
`participant_start_time` timestamp,
`participant_end_time` timestamp,
`participant_duration` double,
`participant_ip` string,
`participant_city` string,
`participant_state` string,
`participant_country` string,
`participant_end_point` string,
`participant_entry_point` string,
`os_type` string,
`os_ver` string,
`os_locale` string,
`os_architecture` string,
`os_timezone` string,
`model_id` string,
`machine_address` string,
`model_name` string,
`browser` string,
`browser_version` string,
`audio_type` string,
`voip_duration` string,
`pstn_duration` string,
`webcam_duration` string,
`screen_share_duration` string,
`is_chat_used` string,
`is_screenshare_used` string,
`is_dialout_used` string,
`is_webcam_used` string,
`is_webinar_scheduled` string,
`is_webinar_deleted` string,
`is_registrationquestion_create` string,
`is_registrationquestion_modify` string,
`is_registrationquestion_delete` string,
`is_poll_created` string,
`is_poll_modified` string,
`is_poll_deleted` string,
`is_survey_created` string,
`is_survey_deleted` string,
`is_handout_uploaded` string,
`is_handout_deleted` string,
`entrypoint_access_time` string,
`endpoint_access_time` string,
`panel_connect_time` string,
`audio_connect_time` string,
`endpoint_install_time` string,
`endpoint_download_time` string,
`launcher_install_time` string,
`launcher_download_time` string,
`join_time` string,
`likely_to_recommend` string,
`rating_reason` string,
`customer_support` string,
`native_machinename_key` string,
`download_status` string,
`native_plan_key` string,
`useragent` string,
`native_connection_key` string,
`active_time` string,
`csid` string,
`arrival_time` string,
`closed_by` string,
`close_cause` string,
`viewer_ip_address` string,
`viewer_os_type` string,
`viewer_os_ver` string,
`viewer_build` string,
`native_service_account_id` string,
`license_key` string,
`session_id` string,
`session_participant_id` string,
`featureusagefactid` string,
`join_session_fact_id` string,
`responseid` string,
`sf_data_date` string,
`spf_data_date` string,
`fuf_data_date` string,
`jsf_data_date` string,
`nps_data_date` string,
`upaf_data_date` string,
`data_source_name` string,
`data_load_date_time` timestamp)
PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://saasdata/datawarehouse/fact/UsageFact/'
What might be the possible issue? And how to rectify it? Also, is there any other way to achieve the same thing?
I assume this question is related to your previous one - PySpark insert overwrite issue and boils down to error you experienced before:
Cannot overwrite a path that is also being read from
It is thrown to protect you from data loss, not to make your life harder. There are cases where Spark won't be able to enforce this automatically, but let me repeat point I made here - never overwrite (fully or partially) data that is used as a source for the pipeline.
At best it will cause complete data loss (hopefully you have good backup policy).
At worst it will silently corrupt the data.

Merging two CSV Files in Azure Data Factory by using custom .NET activity

I have two CSV file which contains which contains many n-columns.I have to merge this two csv files with a single CSV file which having one unique column from the both input file.
I browsed thoroughly all the blogs and sites.All will result into using the custom .NET Activity.So i just go through this site
But still am not able to figure out which part in the C# Coding.Can any one share the code for how to merge this two CSV files using custom .NET Activity in Azure Data Factory.
Here is an example of how to join those two tab-separated files on Zip_Code column using U-SQL. This example assumes both files are held in Azure Data Lake Storage (ADLS). This script could easily be incorporated into a Data Factory pipeline:
// Get raw input from file A
#inputA =
EXTRACT
Date_received string,
Product string,
Sub_product string,
Issue string,
Sub_issue string,
Consumer_complaint_narrative string,
Company_public_response string,
Company string,
State string,
ZIP_Code string,
Tags string,
Consumer_consent_provided string,
Submitted_via string,
Date_sent_to_company string,
Company_response_to_consumer string,
Timely_response string,
Consumer_disputed string,
Complaint_ID string
FROM "/input/input48A.txt"
USING Extractors.Tsv();
// Get raw input from file B
#inputB =
EXTRACT Provider_ID string,
Hospital_Name string,
Address string,
City string,
State string,
ZIP_Code string,
County_Name string,
Phone_Number string,
Hospital_Type string,
Hospital_Ownership string,
Emergency_Services string,
Meets_criteria_for_meaningful_use_of_EHRs string,
Hospital_overall_rating string,
Hospital_overall_rating_footnote string,
Mortality_national_comparison string,
Mortality_national_comparison_footnote string,
Safety_of_care_national_comparison string,
Safety_of_care_national_comparison_footnote string,
Readmission_national_comparison string,
Readmission_national_comparison_footnote string,
Patient_experience_national_comparison string,
Patient_experience_national_comparison_footnote string,
Effectiveness_of_care_national_comparison string,
Effectiveness_of_care_national_comparison_footnote string,
Timeliness_of_care_national_comparison string,
Timeliness_of_care_national_comparison_footnote string,
Efficient_use_of_medical_imaging_national_comparison string,
Efficient_use_of_medical_imaging_national_comparison_footnote string,
Location string
FROM "/input/input48B.txt"
USING Extractors.Tsv();
// Join the two files on the Zip_Code column
#output =
SELECT b.Provider_ID,
b.Hospital_Name,
b.Address,
b.City,
b.State,
b.ZIP_Code,
a.Complaint_ID
FROM #inputA AS a
INNER JOIN
#inputB AS b
ON a.ZIP_Code == b.ZIP_Code
WHERE a.ZIP_Code == "36033";
// Output the file
OUTPUT #output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
This could also be converted into a U-SQL stored procedure with parameters for the filenames and Zip Code.
There are of course may ways to achieve this, each with their own pros and cons. The .net custom activity for example might feel more comfortable for someone with a .net background but you'll need some compute to run it on. Importing the files into an Azure SQL Database would be a good option for someone with a SQL / database background and an Azure SQL DB in the subscription.

Convert a DataSet with single column to multiple column dataset in scala

I have a dataset which is a DataSet of String and it has the data
12348,5,233,234559,4
12348,5,233,234559,4
12349,6,233,234560,5
12350,7,233,234561,6
I want to split this single row and convert this to multiple columns which says RegionId, PerilId, Date, EventId, ModelId. How do i achieve this ?
you mean:
case class NewSet(RegionId: String, PerilId: String, Date: String, EventId: String, ModelId: String)
val newDataset = oldDataset.map(s:String => {
val strings = s.split(",")
NewSet(strings(0), strings(1), strings(2), string(3), strings(4)) })
Of course you should probably make the lambda function a little more robust...
If you have the data you specified in an RDD, then converting that to dataframe is pretty easy.
case class MyClass(RegionId: String, PerilId: String, Date: String,
EventId: String, ModelId: String)
val dataframe = sqlContext.createDataFrame(rdd,classOf[MyClass])
this dataframe will have all the columns with the column name corresponds to the variables of clas MyClass.

Resources