Hello i would like to make my own hash function with the key being a column "arrival delay"
the code that i had at the moment is
# this is for the flights
def import_parse_rdd(data):
# create rdd
rdd = sc.textFile(data)
# remove the header
header = rdd.first()
rdd = rdd.filter(lambda row: row != header) #filter out header
# split by comma
split_rdd = rdd.map(lambda line: line.split(','))
row_rdd = split_rdd.map(lambda line: Row(
YEAR = int(line[0]),MONTH = int(line[1]),DAY = int(line[2]),DAY_OF_WEEK = int(line[3])
,AIRLINE = line[4],FLIGHT_NUMBER = int(line[5]),
TAIL_NUMBER = line[6],ORIGIN_AIRPORT = line[7],DESTINATION_AIRPORT = line[8],
SCHEDULED_DEPARTURE = line[9],DEPARTURE_TIME = line[10],DEPARTURE_DELAY = 0 if "".__eq__(line[11]) else float(line[11]),TAXI_OUT = 0 if "".__eq__(line[12]) else float(line[12]),
WHEELS_OFF = line[13],SCHEDULED_TIME = line[14],ELAPSED_TIME = 0 if "".__eq__(line[15]) else float(line[15]),AIR_TIME = 0 if "".__eq__(line[16]) else float(line[16]),DISTANCE = 0 if "".__eq__(line[17]) else float(line[17]),WHEELS_ON = line[18],TAXI_IN = 0 if "".__eq__(line[19]) else float(line[19]),
SCHEDULED_ARRIVAL = line[20],ARRIVAL_TIME = line[21],ARRIVAL_DELAY = 0 if "".__eq__(line[22]) else float(line[22]),DIVERTED = line[23],CANCELLED = line[24],CANCELLATION_REASON = line[25],AIR_SYSTEM_DELAY = line[26],
SECURITY_DELAY = line[27],AIRLINE_DELAY = line[28],LATE_AIRCRAFT_DELAY = line[29],WEATHER_DELAY = line[30])
)
return row_rdd
if i take flight_rdd.take(1)
[Row(YEAR=2015, MONTH=6, DAY=26, DAY_OF_WEEK=5, AIRLINE='EV', FLIGHT_NUMBER=4951, TAIL_NUMBER='N707EV', ORIGIN_AIRPORT='BHM', DESTINATION_AIRPORT='LGA', SCHEDULED_DEPARTURE='630', DEPARTURE_TIME='629', DEPARTURE_DELAY=-1.0, TAXI_OUT=13.0, WHEELS_OFF='642', SCHEDULED_TIME='155', ELAPSED_TIME=141.0, AIR_TIME=113.0, DISTANCE=866.0, WHEELS_ON='935', TAXI_IN=15.0, SCHEDULED_ARRIVAL='1005', ARRIVAL_TIME='950', ARRIVAL_DELAY=-15.0, DIVERTED='0', CANCELLED='0', CANCELLATION_REASON='', AIR_SYSTEM_DELAY='', SECURITY_DELAY='', AIRLINE_DELAY='', LATE_AIRCRAFT_DELAY='', WEATHER_DELAY='')]
is the output
i would like to make a user defined hash partitioning function with the key being the ARRIVAL_DELAY column.
If i could i would also like Min and Max value in the ARRIVAL_DELAY column to be used as a key to determine the distribution of the partition.
the furthest i have gone is that
flight_rdd.partitionBy(number of parts, key)
is what i understand
YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
2015,2,6,5,OO,6271,N937SW,FAR,DEN,1712,1701,-11,15,1716,123,117,95,627,1751,7,1815,1758,-17,0,0,,,,,,
2015,1,19,1,AA,1605,N496AA,DFW,ONT,1740,1744,4,15,1759,193,198,175,1188,1854,8,1853,1902,9,0,0,,,,,,
2015,3,8,7,NK,1068,N519NK,LAS,CLE,2220,2210,-10,12,2222,238,229,208,1824,450,9,518,459,-19,0,0,,,,,,
2015,9,21,1,AA,1094,N3EDAA,DFW,BOS,1155,1155,0,12,1207,223,206,190,1562,1617,4,1638,1621,-17,0,0,,,,,,
this is the unprocessed version of the data set
TableAlias isn't working with multiple joins.
The query:
var q = Db.From<Blog>(Db.TableAlias("b"))
.LeftJoin<Blog, BlogToBlogCategory>((b,btb)=> b.Id == btb.BlogId, Db.TableAlias("btbc"))
.Join<BlogToBlogCategory, BlogCategory>((bt,bc)=>bt.BlogCategoryId == bc.Id, Db.TableAlias("cats"))
.GroupBy(x => x.Id);
.Select("b.*, json_agg(cats) as BlogCategoriesJson");
var results = Db.Select<BlogQueryResponse>(q);
Generates this SQL:
SELECT b.*, json_agg(cats) as BlogCategoriesJson
FROM "blog" "b" LEFT JOIN "blog_to_blog_category" "btbc" ON ("b"."id" = "btbc"."blog_id") INNER JOIN "blog_category" "cats" ON ("blog_to_blog_category"."blog_category_id" = "cats"."id")
GROUP BY "b"."id"
This causes error because it is referencing "blog_to_blog_category" instead of btbc
The Db.TableAlias() only provides an alias for the target join table, your inner join does not specify the alias to use for the source table so it references the full table name as expected.
You can use Sql.TableAlias() in your LINQ Expression to reference a table alias, e.g:
var q = Db.From<Blog>(Db.TableAlias("b"))
.LeftJoin<Blog, BlogToBlogCategory>((b,btb)=> b.Id == btb.BlogId, Db.TableAlias("btbc"))
.Join<BlogToBlogCategory, BlogCategory>((bt,bc)=>
Sql.TableAlias(bt.BlogCategoryId, "btbc") == bc.Id, Db.TableAlias("cats"))
.GroupBy(x => x.Id);
.Select("b.*, json_agg(cats) as BlogCategoriesJson");
my one query is --
select * from
(
select a.model_code,a.fiscal_quarter,a.scenario_no,a.gfcid,a.field_id,a.field_value
from ccar_input_data a,ccar_model_feed_map b
where a.model_code = b.model_code (+) and a.field_id = b.field_id (+)
and a.model_code = 'COMREPROJ'
and a.scenario_no = 2 and a.fiscal_quarter = '2014-Q1' and a.gfcid = 112000000
) src
pivot
(
max(field_value) for field_id in (10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61)
)
which gives me a.model_code,a.fiscal_quarter,a.scenario_no,a.gfcid,a.field_id,a.field_value and 10,11,...61 columns
My second query is ----
select * from
(
select d.field_id,d.field_value
from ccar14a_output_view d,ccar_model_feed_map b
where d.model_code = b.model_code and d.field_id = b.field_id
and d.model_code = 'COMREPROJ'
and d.scenario_no = 2 and d.fiscal_quarter = '2014-Q1' and d.gfcid = 112000000
) src
pivot
(
max(field_value) for field_id in (62,63,64,65,66,67,69)
)
which gives me column 62, 63,...69
I want to make the row as
a.model_code,a.fiscal_quarter,a.scenario_no,a.gfcid,a.field_id,a.field_value and 10,11,...61, 62,63...69
Pl help. I am not sure how can I join this two query...HELP NEEDED (I am very new to sql)
Try this:
select * from
(
select a.model_code,a.fiscal_quarter,a.scenario_no,a.gfcid,a.field_id,a.field_value
from ccar_input_data a,ccar_model_feed_map b
where a.model_code = b.model_code (+) and a.field_id = b.field_id (+)
and a.model_code = 'COMREPROJ'
and a.scenario_no = 2 and a.fiscal_quarter = '2014-Q1' and a.gfcid = 112000000
UNION ALL
select null, null,null, null, d.field_id,d.field_value
from ccar14a_output_view d,ccar_model_feed_map b
where d.model_code = b.model_code and d.field_id = b.field_id
and d.model_code = 'COMREPROJ'
and d.scenario_no = 2 and d.fiscal_quarter = '2014-Q1' and d.gfcid = 112000000
) src
pivot
(
max(field_value) for field_id in (10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61, 62,63,64,65,66,67,69)
)