Merging two CSV Files in Azure Data Factory by using custom .NET activity - azure

I have two CSV file which contains which contains many n-columns.I have to merge this two csv files with a single CSV file which having one unique column from the both input file.
I browsed thoroughly all the blogs and sites.All will result into using the custom .NET Activity.So i just go through this site
But still am not able to figure out which part in the C# Coding.Can any one share the code for how to merge this two CSV files using custom .NET Activity in Azure Data Factory.

Here is an example of how to join those two tab-separated files on Zip_Code column using U-SQL. This example assumes both files are held in Azure Data Lake Storage (ADLS). This script could easily be incorporated into a Data Factory pipeline:
// Get raw input from file A
#inputA =
EXTRACT
Date_received string,
Product string,
Sub_product string,
Issue string,
Sub_issue string,
Consumer_complaint_narrative string,
Company_public_response string,
Company string,
State string,
ZIP_Code string,
Tags string,
Consumer_consent_provided string,
Submitted_via string,
Date_sent_to_company string,
Company_response_to_consumer string,
Timely_response string,
Consumer_disputed string,
Complaint_ID string
FROM "/input/input48A.txt"
USING Extractors.Tsv();
// Get raw input from file B
#inputB =
EXTRACT Provider_ID string,
Hospital_Name string,
Address string,
City string,
State string,
ZIP_Code string,
County_Name string,
Phone_Number string,
Hospital_Type string,
Hospital_Ownership string,
Emergency_Services string,
Meets_criteria_for_meaningful_use_of_EHRs string,
Hospital_overall_rating string,
Hospital_overall_rating_footnote string,
Mortality_national_comparison string,
Mortality_national_comparison_footnote string,
Safety_of_care_national_comparison string,
Safety_of_care_national_comparison_footnote string,
Readmission_national_comparison string,
Readmission_national_comparison_footnote string,
Patient_experience_national_comparison string,
Patient_experience_national_comparison_footnote string,
Effectiveness_of_care_national_comparison string,
Effectiveness_of_care_national_comparison_footnote string,
Timeliness_of_care_national_comparison string,
Timeliness_of_care_national_comparison_footnote string,
Efficient_use_of_medical_imaging_national_comparison string,
Efficient_use_of_medical_imaging_national_comparison_footnote string,
Location string
FROM "/input/input48B.txt"
USING Extractors.Tsv();
// Join the two files on the Zip_Code column
#output =
SELECT b.Provider_ID,
b.Hospital_Name,
b.Address,
b.City,
b.State,
b.ZIP_Code,
a.Complaint_ID
FROM #inputA AS a
INNER JOIN
#inputB AS b
ON a.ZIP_Code == b.ZIP_Code
WHERE a.ZIP_Code == "36033";
// Output the file
OUTPUT #output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
This could also be converted into a U-SQL stored procedure with parameters for the filenames and Zip Code.
There are of course may ways to achieve this, each with their own pros and cons. The .net custom activity for example might feel more comfortable for someone with a .net background but you'll need some compute to run it on. Importing the files into an Azure SQL Database would be a good option for someone with a SQL / database background and an Azure SQL DB in the subscription.

Related

AZURE HD INSIGHT (Cluster) Import CSV file to storage: Creating a table

I am absolutely new to coding - I know the basics so I'm pulling my hair out here on this project.
I am attempting to link my Hadoop cluster to Tableau in the end, where the bulk of my project will be focused on.
I am following this guy.
However, he does not explain exactly how to link the CSV file to cluster. After a little research, I find that I need to import the data via a cluster.
I have managed to import the CSV File via CloudXplorer. Now I just need to create the tables.
I am getting no luck through Ambari (create a table the error is error fetching databases and it never really uploaded my file at the beginning anyway) OR on Zeppelin.
My code on Zeppelin follows:
%livy2.spark
//The above magic instructs Zeppelin to use the Livy Scala interpreter
// Create an RDD using the default Spark context, sc
val SearchText = sc.textFile("wasb://test'myname'1#.blob.core.windows.net/sample/stopandsearch.csv")
// Define a schema
case class Search(Type: String, date: String, time: String, LATITUDE: String, LONGITUDE: String, Gender: String, Age_Range: String, Self_defined_Eth: String, Officer_defined_Eth: String, Legislation: String, Obj_Of_Search: String, Outcome: String)
// Map the values in the .csv file to the schema
val Search = SearchText.map(s => s.split(",")).map(
s => Search(s(6),
s(1),
s(7),
s(3),
s(6),
s(7),
s(3),
s(7),
s(12),
s(12),
s(12)
)
).toDF()
Search.registerAsTable("Search")
Search.saveAsTable("Search")
<console>:30: error: recursive value Search needs type
s => Search(s(6),
^
<console>:42: error: value toDF is not a member of org.apache.spark.rdd.RDD[U]
possible cause: maybe a semicolon is missing before `value toDF'?
).toDF()
^
any suggestions, please. Any shortcut around this, I just need to insert the data into nice tables! :)
Thanks in advance.
PS I have no idea how to get the link to wasb? THE Http link for the csv file in the container.
I think this path is not correct .
wasb://test'myname'1#.blob.core.windows.net/sample/stopandsearch.csv"
it should be
wasb://test'myname'1#<storageaccount>.blob.core.windows.net/sample/stopandsearch.csv"
You are missing the storageaccount and I am assuming that test'myname'1 is the container name .

Azure Data lake analysis job failed reading data from Data lake store

I have a CSV file copied from Azure blob to Azure data lake store. The pipe line is established successfully and file copied.
I'm trying to write USQL sample script from here:
Home -> datalakeanalysis1->Sample scripts-> New job
Its showing me default script.
//Define schema of file, must map all columns
#searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM #"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
OUTPUT #searchlog
TO #"/Samples/Output/SearchLog_output.tsv"
USING Outputters.Tsv();
Note: my file in data lake store is here:
Home->dls1->Data explorer->rdl1
How can I give the path of my CSV file in the script ( my CSV file is stored in Data Lake Store).
Also, I would like to keep my destination file(output) in Data lake store.
How can I modify my script to refer to the data lake store path?
Edit:
I have changed my script as below:
//Define schema of file, must map all columns
#searchlog =
EXTRACT ID1 int,
ID2 int,
Date DateTime,
Rs string,
Rs1 string,
Number string,
Direction string,
ID3 int
FROM #"adl://rdl1.azuredatalakestore.net/blob1/vehicle1_09142014_JR.csv"
USING Extractors.Csv();
OUTPUT #searchlog
TO #"adl://rdl1.azuredatalakestore.net/blob1/vehicle1_09142014_JR1.csv"
USING Outputters.Csv();
However, my job is getting failed with attached error:
Moreover, I'm attaching the CSV file that I wanted to be used in the job.
Sample CSV file
Is there anything wrong in the CSV file ? Or in my script??
Please help. Thanks.
I believe that while extracting data from the file you can pass in some additional parameters to ignore the header row
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/extractor-parameters-u-sql#skipFirstNRows
#searchlog =
EXTRACT ID1 int,
ID2 int,
Date DateTime,
Rs string,
Rs1 string,
Number string,
Direction string,
ID3 int
FROM #"adl://rdl1.azuredatalakestore.net/blob1/vehicle1_09142014_JR.csv"
USING Extractors.Csv(skipFirstNRows:1);
Modifying the input file may or may not be possible in all scenarios specially if the input file is being dropped by stakeholders that you cannot control.
I followed your steps and reproduce your issue.
My sample data:
ID1,ID2,Date,Rs,Rs1,Number,Direction,ID3
1,1,9/14/2014 0:00,46.81006,-92.08174,51,S,1
1,2,9/14/2014 0:00,46.81006,-92.08174,13,NE,1
1,3,9/14/2014 0:00,46.81006,-92.08174,48,NE,1
1,4,9/14/2014 0:00,46.81006,-92.08174,30,W,1
Based on the error log, I found it can't parse the title row.So, I removed the title row and everything works fine.
Modified data:
1,1,9/14/2014 0:00,46.81006,-92.08174,51,S,1
1,2,9/14/2014 0:00,46.81006,-92.08174,13,NE,1
1,3,9/14/2014 0:00,46.81006,-92.08174,48,NE,1
1,4,9/14/2014 0:00,46.81006,-92.08174,30,W,1
Usql script :
//Define schema of file, must map all columns
#searchlog =
EXTRACT ID1 int,
ID2 int,
Date DateTime,
Rs string,
Rs1 string,
Number string,
Direction string,
ID3 int
FROM #"/test/data.csv"
USING Extractors.Csv();
OUTPUT #searchlog
TO #"/testOutput/dataOutput.csv"
USING Outputters.Csv();
Output:
Hope it helps you.

Insert overwrite data count miss-match in PySpark for every second run

I am using PySpark (version 2.1.1) shell to run my ETL code.
The last few lines of my PySpark ETL code looks like this:
usage_fact = usage_fact_stg.union(gtac_usage).union(gtp_usage).union(upaf_src).repartition("data_date","data_product")
usage_fact.createOrReplaceTempView("usage_fact_staging")
fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
Now, after last line (having insert overwrite) is executed for the first time then the code runs fine and the output table (usageWideFactTable) is having about 2.4 million rows, which is expected.
If we execute the last line again then I get Error/Warning as shown below and count of output table (usageWideFactTable) is decreased to 0.84 million.
Again if we execute the last line for 3rd time then surprisingly it runs fine and count of output table (usageWideFactTable) is corrected and it comes to be 2.4 million.
And in 4th run, again the Warning/Error comes and count(*) of output table comes to be 0.84 million.
The same above 4 runs on PySpark shell is shown below:
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 08:41:59 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
>>> fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact_staging")
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
18/04/20 09:12:17 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
I have tried running the same ETL job using Oozie as well but every 2nd oozie run is showing a count mismatch.
The DDL of the output table (usageWideFactTable = datawarehouse.usage_fact) is shown below:
CREATE EXTERNAL TABLE `datawarehouse.usage_fact`(
`mcs_session_id` string,
`meeting_id` string,
`session_tracking_id` string,
`session_type` string,
`session_subject` string,
`session_date` string,
`session_start_time` string,
`session_end_time` string,
`session_duration` double,
`product_name` string,
`product_tier` string,
`product_version` string,
`product_build_number` string,
`native_user_id` string,
`native_participant_id` string,
`native_participant_user_id` string,
`participant_name` string,
`participant_email` string,
`participant_type` string,
`participant_start_time` timestamp,
`participant_end_time` timestamp,
`participant_duration` double,
`participant_ip` string,
`participant_city` string,
`participant_state` string,
`participant_country` string,
`participant_end_point` string,
`participant_entry_point` string,
`os_type` string,
`os_ver` string,
`os_locale` string,
`os_architecture` string,
`os_timezone` string,
`model_id` string,
`machine_address` string,
`model_name` string,
`browser` string,
`browser_version` string,
`audio_type` string,
`voip_duration` string,
`pstn_duration` string,
`webcam_duration` string,
`screen_share_duration` string,
`is_chat_used` string,
`is_screenshare_used` string,
`is_dialout_used` string,
`is_webcam_used` string,
`is_webinar_scheduled` string,
`is_webinar_deleted` string,
`is_registrationquestion_create` string,
`is_registrationquestion_modify` string,
`is_registrationquestion_delete` string,
`is_poll_created` string,
`is_poll_modified` string,
`is_poll_deleted` string,
`is_survey_created` string,
`is_survey_deleted` string,
`is_handout_uploaded` string,
`is_handout_deleted` string,
`entrypoint_access_time` string,
`endpoint_access_time` string,
`panel_connect_time` string,
`audio_connect_time` string,
`endpoint_install_time` string,
`endpoint_download_time` string,
`launcher_install_time` string,
`launcher_download_time` string,
`join_time` string,
`likely_to_recommend` string,
`rating_reason` string,
`customer_support` string,
`native_machinename_key` string,
`download_status` string,
`native_plan_key` string,
`useragent` string,
`native_connection_key` string,
`active_time` string,
`csid` string,
`arrival_time` string,
`closed_by` string,
`close_cause` string,
`viewer_ip_address` string,
`viewer_os_type` string,
`viewer_os_ver` string,
`viewer_build` string,
`native_service_account_id` string,
`license_key` string,
`session_id` string,
`session_participant_id` string,
`featureusagefactid` string,
`join_session_fact_id` string,
`responseid` string,
`sf_data_date` string,
`spf_data_date` string,
`fuf_data_date` string,
`jsf_data_date` string,
`nps_data_date` string,
`upaf_data_date` string,
`data_source_name` string,
`data_load_date_time` timestamp)
PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://saasdata/datawarehouse/fact/UsageFact/'
What might be the possible issue? And how to rectify it? Also, is there any other way to achieve the same thing?
I assume this question is related to your previous one - PySpark insert overwrite issue and boils down to error you experienced before:
Cannot overwrite a path that is also being read from
It is thrown to protect you from data loss, not to make your life harder. There are cases where Spark won't be able to enforce this automatically, but let me repeat point I made here - never overwrite (fully or partially) data that is used as a source for the pipeline.
At best it will cause complete data loss (hopefully you have good backup policy).
At worst it will silently corrupt the data.

Filter a list of case class objects based on a list of strings

I have a case class as this User(id:String, name: String, address: String, password: String) and another case class as Account(userId: String, accountId: String, roles: Set[String]). I need to filter a list of Account objects ( List[Account]) based on a list of userIds which I have as a List[String] in Scala. I have been struggling with this and tried doing this but couldn't. Any pointers on how should I do this would be really helpful.
Thanks !
I'm not sure if I understand your question correctly, but if you're only trying to keep only Accounts for which the userId is part of separate collection that you have, you can do it like this:
val accounts: List[Account] = ???
val idsToKeep: Set[String] = ???
accounts.filter(a => idsToKeep.contains(a.userId))
For the record, if you use the contains method a lot, you are better off using a Set[String] than a List[String] to store the ids to keep.

Convert a DataSet with single column to multiple column dataset in scala

I have a dataset which is a DataSet of String and it has the data
12348,5,233,234559,4
12348,5,233,234559,4
12349,6,233,234560,5
12350,7,233,234561,6
I want to split this single row and convert this to multiple columns which says RegionId, PerilId, Date, EventId, ModelId. How do i achieve this ?
you mean:
case class NewSet(RegionId: String, PerilId: String, Date: String, EventId: String, ModelId: String)
val newDataset = oldDataset.map(s:String => {
val strings = s.split(",")
NewSet(strings(0), strings(1), strings(2), string(3), strings(4)) })
Of course you should probably make the lambda function a little more robust...
If you have the data you specified in an RDD, then converting that to dataframe is pretty easy.
case class MyClass(RegionId: String, PerilId: String, Date: String,
EventId: String, ModelId: String)
val dataframe = sqlContext.createDataFrame(rdd,classOf[MyClass])
this dataframe will have all the columns with the column name corresponds to the variables of clas MyClass.

Resources