I've this statement:
--Insert a new column based on filename
Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile');
Data_Schema = FOREACH Data GENERATE
(chararray)$1 AS Date,
(chararray)$2 AS ID,
(chararray)$3 AS Interval,
(chararray)$4 AS Code,
(chararray)$5 AS S_In,
(chararray)$6 AS S_Out,
(chararray)$7 AS C_In,
(chararray)$8 AS C_Out,
(chararray)$9 AS Traffic;
--Split into different directories
SPLIT Data_Schema INTO Src1 IF (Date == '2016-06-25.txt'),
Src2 IF (Date == '2014-07-31.txt'),
Src3 IF (Date == '2016-01-01.txt');
STORE Src1 INTO '/user/cloudera/Source_DatA/2016-06-25' using PigStorage('\t');
STORE Src2 INTO '/user/cloudera/Source_Data/2014-07-31.txt' using PigStorage('\t');
STORE Src2 INTO '/user/cloudera/Source_Data/2016-01-01' using PigStorage('\t');
And there is a example of my orignally source data:
10000 1388530800000 39 8.600870350350515 13.86183926855984 1.7218329193014124 3.424444103320796 25.972920214509095
But when I execute it runs successfully, however the files in HDFS are without data...
Note that I add a new column based on filename. That's why I've one more column in Foreach Statment...
If your input files named 2016-06-25.txt, 2014-07-31.txt and 2016-01-01.txt then the new added column will be referenced by $0 and it will contain files names.
You have to do like this :
Data_Schema = FOREACH Data GENERATE
(chararray)$0 AS Date,
(chararray)$1 AS ID,
...
Or simply specify schema while loading files and keep the rest as it is:
Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile') as (Date:chararray, ID:chararray, ... ;
Related
I have a CSV file and when I load it to data bricks using Pyspark, It loses its structure. I tried to delimit the same using a pipe and set header = True but nothing works out. Below is what I am facing and is only an example :
I have written the following code :
df = spark.read.csv(df_path, header = True, sep = "|")
and the result is :
----------------------------------
region, sub-region, country, owner
----------------------------------
new_jersey, daffodil, USA, Anker
Dubai, Bahamas, UAE, Nikon
All the values get inside a single column and are separated by ','.
How do I convert the same into a structured data frame?
I have U-SQL script that uses file pattern to find files in Azure Data Lake and extracts some data from them:
DECLARE #input_file string = #"\data\{*}\{*}\{*}.avro";
#data = EXTRACT
Column1 string,
Column2 double
FROM #input_file
USING new MyExtractors.AvroExtractor();
File pattern is:
data/{Namespace}-{EventHub}-{PartitionId}/{Year}-{Month}-{Day}/{Hour}-{Minute}-{Second}
Problem: Custom extractor is executing very slow. I have many files in the Data Lake and it takes 15hrs to process and costs $600USD per run. Too slow and too expensive.
I only need to extract fresh data from files that are not more than 90 days old. How can I filter out old files using file pattern, file date modified or any other technique?
You can leverage GetMetadata activity in Azure data factory to retrieve lastModifiedTime of the files.
ref doc:
Get metadata activity in Azure Data Factory
And there's a relevant post about incremental copy:
Azure data factory | incremental data load from SFTP to Blob
You could use the .AddDays method of DateTime.Now, although whether or not this actually filters out all your files is (I think) dependent on your custom extractor, eg
//DECLARE #input_file string = #"\data\{*}\{*}\{*}.csv";
DECLARE #input_file string = #"\data\{Namespace}-{EventHub}-{PartitionId}\{xdate:yyyy}-{xdate:MM}-{xdate:dd}\{Hour}-{Minute}-{Second}.csv";
// data/{Namespace}-{EventHub}-{PartitionId}/{Year}-{Month}-{Day}/{Hour}-{Minute}-{Second}
#input =
EXTRACT Column1 string,
Column2 double,
xdate DateTime,
Namespace string,
EventHub string,
PartitionId string,
Hour int,
Minute int,
Second int
FROM #input_file
USING Extractors.Csv();
//USING new MyExtractors.AvroExtractor();
#output =
SELECT Column1,
Column2
FROM #input
WHERE xdate > DateTime.Now.AddDays(-90);
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv();
In my simple tests with .Csv this worked to reduce the input stream from 4 to 3 streams, but as mentioned I'm not sure if this will work with your custom extractor.
I'm having an issue with scheduling job in Data Factory.
I'm trying to approach a scheduled job per hour which will execute the same script each hour with different condition.
Consider I have a bunch of Avro Files spread in Azure Data Lake Store with following pattern.
/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}{date:MM}{date:dd}__{date:H}
Each hour new files are added to Data Lake Store.
In order to process the files only once I decided to handle them by help of U-SQL virtual file set column and some SyncTable which i created in Data Lake Store.
My query looks like following.
DECLARE #file_set_path string = /Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H};
#result = EXTRACT [Id] long,
....
date DateTime
FROM #file_set_path
USING someextractor;
#rdate =
SELECT MAX(ProcessedDate) AS ProcessedDate
FROM dbo.SyncTable
WHERE EntityName== "SomeEntity";
#finalResult = SELECT [Id],... FROM #result
CROSS JOIN #rdate AS r
WHERE date >= r.ProcessedDate;
since I can't use rowset variable in where clause I'm cross joining the singe row with set , however even in this case U-SQL won't find the correct files and always return all files set.
Is there any workaround or other approach ?
I think this approach should work unless there is something not quite right somewhere, ie can you confirm the datatypes of the dbo.SyncTable table? Dump out #rdate and make sure the value you get there is what you expect.
I put together a simple demo which worked as expected. My copy of SyncTable had one record with the value of 01/01/2018:
#working =
SELECT *
FROM (
VALUES
( (int)1, DateTime.Parse("2017/12/31") ),
( (int)2, DateTime.Parse("2018/01/01") ),
( (int)3, DateTime.Parse("2018/02/01") )
) AS x ( id, someDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM dbo.SyncTable;
//#output =
// SELECT *
// FROM #rdate;
#output =
SELECT *, (w.someDate - r.maxDate).ToString() AS diff
FROM #working AS w
CROSS JOIN
#rdate AS r
WHERE w.someDate >= r.maxDate;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
I did try this with a filepath (full script here). The thing to remember is the custom date format H represents the hour as a number from 0 to 23. If your SyncTable date does not have a time component to it when you insert it, it will default to midnight (0), meaning the whole day will be collected. Your file structure should look something like this according to your pattern:
"D:\Data Lake\USQLDataRoot\Data\SomeEntity\2017\12\31\SomeEntity_2017_12_31__8\test.csv"
I note your filepath has underscores in the second section and a double underscore before the hour section (which will be between 0 and 23, single digit up to the hour 10). I notice your fileset path does not have a file type or quotes - I've used test.csv in my tests. My results:
Basically I think the approach will work, but there is something not quite right, maybe in your file structure, the value in your SyncTable, the datatype etc. You need to go over the details, dump out intermediate values to check until you find the problem.
Doesn't the gist of wBob's full script resolve your issue? Here is a very slightly edited version of wBob's full script to address some of the issues you raised:
Ability to filter on SyncTable,
last part of pattern is file name and not folder. Sample file and structure: \Data\SomeEntity\2018\01\01\SomeEntity_2018_01_01__1
DECLARE #file_set_path string = #"/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H}";
#input =
EXTRACT [Id] long,
date DateTime
FROM #file_set_path
USING Extractors.Text();
// in lieu of creating actual table
#syncTable =
SELECT * FROM
( VALUES
( "SomeEntity", new DateTime(2018,01,01,01,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,01,00,00) ),
( "SomeEntity", new DateTime(2018,01,01,00,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,00,00,00) ),
( "SomeEntity", new DateTime(2017,12,31,23,00,00) ),
( "AnotherEntity", new DateTime(2017,12,31,23,00,00) )
) AS x ( EntityName, ProcessedDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM #syncTable
WHERE EntityName== "SomeEntity";
#output =
SELECT *,
date.ToString() AS dateString
FROM #input AS i
CROSS JOIN
#rdate AS r
WHERE i.date >= r.maxDate;
OUTPUT #output
TO "/output/output.txt"
ORDER BY Id
USING Outputters.Text(quoting:false);
Also please note that file sets cannot perform partition elimination on dynamic joins, since the values are not known to the optimizer during the preparation phase.
I would suggest to pass the Sync point as a parameter from ADF to the processing script. Then the value is known to the optimizer and file set partition elimination will kick in. In the worst case, you would have to read the value from your sync table in a previous script and use it as a parameter in the next.
I processed data using pySpark and sqlContext using the following query:
(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
.rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))
It is stored in the following format:
Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)
But I want to store the data in a Hive table as
LastUpdate Count
2016-03-14 12:27:55.01 1
. .
. .
Here is how I create the table in Hive:
CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';
I tried many options but was not successful. Please help me on this.
Why not load the data into Hive itself, without going through the process of saving the file and then loading it to hive.
from datetime import datetime, date, time, timedelta
hiveCtx = HiveContext(sc)
#Create sample data
currTime = datetime.now()
currRow = Row(LastUpdate=currTime)
delta = timedelta(days=1)
futureTime = currTime + delta
futureRow = Row(LastUpdate=futureTime)
lst = [currRow, currRow, futureRow, futureRow, futureRow]
#parallelize the list and convert to dataframe
myRdd = sc.parallelize(lst)
df = myRdd.toDF()
df.registerTempTable("temp_t")
aggRDD = hiveCtx.sql("select LastUpdate,Count(1) as Count from temp_t group by LastUpdate")
aggRDD.saveAsTable("Data_Count")
You created a table, now you need to fill it with the data you generated.
This could be ran from a Spark HiveContext, I believe
LOAD DATA INPATH '/apps/hive/warehouse/Count' INTO TABLE Data_Count
Alternatively, you may want to build a table over the data
CREATE EXTERNAL TABLE IF NOT Exists Data_Count(
LastUpdate DATE,
Count INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/Count';
I have a simple column family in my Cassandra KeySpace. I need to access it using PIG. Please help me understand how this works:
SD = LOAD 'cassandra://SampleData/Queries' USING CassandraStorage() as (f1,f2,f3);
If I perform
X = foreach SD generate f1; dump X;
it gives me all the keys stored in the `Queries'' table. I need to be able to generate a couple(key, value)' where key' is a row key andvalue' is the value of a column by this key with a specific name `UpdateTimeStamp'
I figured out that if I do
Y = foreach SD generate f2.name; dump Y;
It goes through the list of all rows and prints the list of NAMEs of columns present in each row
if I do
Z = foreach SD generate f2.value; dump Z;
it gives me the same thing as above except instead of column names, it prints column values.
I need to be able to generate a relation (key, timestamp) something like this:
T = foreach SD generate (f1, f2.value(for f2.name == 'UpdateTimeStamp'));
Obviously, PIG won't take the statement above;
The columns in cassandra are loaded in the PIG as a inner bag of tuples.
Try this
data = LOAD 'cassandra://SampleData/Queries' USING CassandraStorage()
AS (keycolumn, columns: bag {T: tuple(columnname, columnvalue)});
dump data; // check for what is in the data alias
data2 = FOREACH data GENERATE keycolumn, columns.name);
dump data2;