Can I retrieve limited dataset from a CSV file in Azure Data Lake U-SQL? - azure

Can I filter data loaded from a CSV File using `U-SQL's EXTRACT? I know we can limit the data using the where condition in the select after the EXTRACT. But, I want to filter it during the use of the 'EXTRACT'
I have huge CSV file. I don't want to load all of it into the first dataset itself.
e.g.
I have lot of auto claims in the dataset. I want to filter it while I 'EXTRACT' it based on a date in the dataset.

The answer is yes. However only column pruning can be pushed into the extractors. Since there is no semantics assigned to the data before you extract it with the Csv extractor, filters will be applied on the generated rowset. There are many examples out there that show you how to do so. Here is an example from one of the hands-on-labs.

Yes you can filter data loaded from csv file. You can do something like this:
#log =
EXTRACT UserId int,
StartDate DateTime,
Location string,
....
....
Url string
FROM "/Samples/Data/Log.csv"
USING Extractors.csv();
#result =
SELECT Location, Url, StartDate
FROM #log
WHERE StartDate >= DateTime.Parse("2017/01/6") AND StartDate <= DateTime.Parse("2018/06/08");
OUTPUT #result
TO "/output/cleanlog.csv"
USING Outputters.Csv();

Related

Not able to change datatype of Additional Column in Copy Activity - Azure Data Factory

I am facing very simple problem of not able to change the datatype of additional column in copy activity in ADF pipeline from String to Datetime
I am trying to change source datatype for additional column in mapping using JSON but still it doesn't work with polybase cmd
When I run my pipeline it gives same error
Is it not possible to change datatype of additional column, by default it takes string only
Dynamic columns return string.
Try to put the value [Ex. utcnow()] in the dynamic content of query and cast it to the required target datatype.
Otherwise you can use data-flow-derived-column :
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column
Since your source is a query, you can choose to bring current date in source SQL query itself in the desired format rather than adding it in the additional column.
Thanks
Try to use formatDateTime as shown below and define the desired Date format:
Here since format given is ‘yyyy-dd-MM’, the result will look as below:
Note: The output here will be of string format only as in Copy activity we could not cast data type as of the date.
We could either create current date in the Source sql query or use above way so that the data would load into the sink in expected format.

How to partition S3 output files by a combination of column values?

I have data which I am crawling into AWS Glue. There I am using PySpark and converting it to Parquet format. My original data is CSV looks something like this:
id, date, data 1, 202003, x 2, 202003, y 1, 202004, z
etc...
I am able to convert the data successfully, but I am unsure the best way to to get the desired output. The output should be split by id and date in S3. So it should have something like:
s3://bucket/outputdata/{id}_{date}/{data}.parquet
Where id and date are the actual id and date values in the data. The name of the files within obviously does not matter, I just want to be able to create "folders" in the S3 object prefix and split the data within them.
I am very new to AWS Glue and I have a feeling I am missing something very obvious.
Thanks in advance.
You can create a partition column by concatenating your two existing columns and then partition by the new column on write e.g.
from pyspark.sql.functions import concat, col, lit
df1 = df.withColumn('p', concat(col('id'), lit('_'), col('date')))
df1.write.partitionBy('p').parquet('s3://bucket/outputdata')

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

Adding Extraction DateTime in Azure Data Factory

I want to write a generic DataFactory in V2 with below scenario.
Source ---> Extracted (Salesforce or some other way), which don't have
extraction timestamp. ---> I want to write it to Blob with extraction
Time Stamp.
I want it to be generic, so I don't want to give column mapping anywhere.
Is there any way to use expression or system variable in Custom activity to append a column in output dataset? I like to have a very simple solution to make implementation realistic.
To do that you should change the query to add the column you need, with the query property in the copy activity of the pipeline. https://learn.microsoft.com/en-us/azure/data-factory/connector-salesforce#copy-activity-properties
I dont know much about Salesforce, but in SQL Server you can do the following:
SELECT *, CURRENT_TIMESTAMP as AddedTimeStamp from [schema].[table]
This will give you every field on your table and will add a column named AddedTimeStamp with the CURRENT_TIMESTAMP value in every row of the result.
Hope this helped!

Loading PIG output files into Hive table with some blank cells

I have successfully loaded a 250000 record CSV file into HDFS and I have performed some ETL functions on it such as removing any characters in a string other than 0-9, a-z and A-Z so that it's nice and clean.
I've saved the output of this ETL to the HDFS for loading into Hive. While in Hive I created the schema for the table and set the appropriate data types for each column.
create external table pigOutputHive (
id string,
Score int,
ViewCount int,
OwnerUserId string,
Body string,
Rank int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
location '/user/admin/PigOutputETL';
When I run a simple query on the data such as:
SELECT * FROM pigoutputhive LIMIT 100000;
The data looks as it should, and when i download it to my local machine and view it in Excel as a CSV it also looks good.
When I try and run the following query on the same table I get every field being returned as an integer even for the string columns. See the screenshot below.
Can anyone see where I am going wrong? Of the original 250000 rows there are some blanks in a particular fields such as the OwnerUserId, do I need to tell Pig or Hive how to handle these?

Resources