Using Pandas, I'm attempting to 'slice' (Sorry if that's not the correct term) segments of a dataframe out of one DF and into a new one, where every segment is stacked one on top of the other.
Code:
import pandas as pd
df = pd.DataFrame(
{
'TYPE': ['System','VERIFY','CMD','SECTION','SECTION','VERIFY','CMD','CMD','VERIFY','CMD','System'],
'DATE': [100,200,300,400,500,600,700,800,900,1000,1100],
'OTHER': [10,20,30,40,50,60,70,80,90,100,110],
'STEP': ['Power On','Start: 2','Start: 1-1','Start: 10-7','End: 10-7','Start: 3-1','Start: 10-8','End: 1-1','End: 3-1','End: 10-8','Power Off']
})
print(df)
column_headers = df.columns.values.tolist()
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_step = 'STEP'
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
type_df = df[df[col_name_type].isin(types_to_check)]
for row in type_df:
if 'CMD' in row:
if 'START:' in row[col_name_step].value:
idx_start = row.iloc[::-1].str.match('VERIFY').first_valid_index() #go backwards and find first VERIFY
step_match = row[col_name_step].value[6:] #get the unique ID after Start:
idx_end = df[df[col_name_step].str.endswith(step_match, na=False)].last_valid_index() #find last instance of matching unique id
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
print(df)
print(df_segments)
Nothing gets populated in my segements array so the concat function fails.
From my research I'm confident that this can be done using either .loc or .iloc, but I can't seem to get a working implementation in.
My DF:
What I am trying to make:
Any help and/or guidance would be welcome.
Edit: To clarify, I'm trying to create a new DF that is comprised of every group of rows, where the start is the "VERIFY" that comes before a "CMD" row that also contains "Start:", and the end is the matching "CMD" row that has end.
EDIT2: I think the following is something close to what I need, but I'm unsure how to get it to reliably work:
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
cmd_check = ['CMD']
verify_check = ['VERIFY']
cmd_df = df[(df[col_name_type].isin(cmd_check))]
cmd_start_df = cmd_df[(cmd_df[col_name_step].str.contains('START:'))]
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,]
idx_start = temp_df[col_name_type].isin(verify_check).last_valid_index()
idx_end = cmd_df[cmd_df[col_name_type].str.endswith(step_name, na=False)].last_valid_index()
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
you can use str.contains
segmented_df = df.loc[df['STEP'].str.contains('Start|End')]
print(segmented_df )
I created some code to accomplish the 'slicing' I wanted:
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,:]
temp_list = temp_df[col_name_type].values.tolist()
if 'VERIFY' in temp_list:
idx_start = temp_df[temp_df[col_name_type].str.match('VERIFY')].last_valid_index()
else:
idx_start = cmd_idx
idx_end = cmd_df[cmd_df[col_name_step].str.endswith(step_name, na=False)].last_valid_index()
slides.append(df.loc[idx_start:idx_end, :])
slides.append(df_blank)
I essentially create a new DF that is a subset of the old DF up to my first START index, then I find the last_valid_index that has VERIFY, then I use that index to create a filtered DF from idx_start to idx_end and then eventually concat all those slices into one DF.
Maybe there's an easier way, but I couldn't find it.
Hello i would like to convert empty string to 0 of my RDD.
I have read 20 files and they are in like this formation.
YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
2015,2,6,5,OO,6271,N937SW,FAR,DEN,1712,1701,-11,15,1716,123,117,95,627,1751,7,1815,1758,-17,0,0,,,,,,
2015,1,19,1,AA,1605,N496AA,DFW,ONT,1740,1744,4,15,1759,193,198,175,1188,1854,8,1853,1902,9,0,0,,,,,,
2015,3,8,7,NK,1068,N519NK,LAS,CLE,2220,2210,-10,12,2222,238,229,208,1824,450,9,518,459,-19,0,0,,,,,,
2015,9,21,1,AA,1094,N3EDAA,DFW,BOS,1155,1155,0,12,1207,223,206,190,1562,1617,4,1638,1621,-17,0,0,,,,,,
i would like to fill these empty strings with the number 0 to them
def import_parse_rdd(data):
# create rdd
rdd = sc.textFile(data)
# remove the header
header = rdd.first()
rdd = rdd.filter(lambda row: row != header) #filter out header
# split by comma
split_rdd = rdd.map(lambda line: line.split(','))
row_rdd = split_rdd.map(lambda line: Row(
YEAR = int(line[0]),MONTH = int(line[1]),DAY = int(line[2]),DAY_OF_WEEK = int(line[3])
,AIRLINE = line[4],FLIGHT_NUMBER = int(line[5]),
TAIL_NUMBER = line[6],ORIGIN_AIRPORT = line[7],DESTINATION_AIRPORT = line[8],
SCHEDULED_DEPARTURE = line[9],DEPARTURE_TIME = line[10],DEPARTURE_DELAY = (line[11]),TAXI_OUT = (line[12]),
WHEELS_OFF = line[13],SCHEDULED_TIME = line[14],ELAPSED_TIME = (line[15]),AIR_TIME = (line[16]),DISTANCE = (line[17]),WHEELS_ON = line[18],TAXI_IN = (line[19]),
SCHEDULED_ARRIVAL = line[20],ARRIVAL_TIME = line[21],ARRIVAL_DELAY = line[22],DIVERTED = line[23],CANCELLED = line[24],CANCELLATION_REASON = line[25],AIR_SYSTEM_DELAY = line[26],
SECURITY_DELAY = line[27],AIRLINE_DELAY = line[28],LATE_AIRCRAFT_DELAY = line[29],WEATHER_DELAY = line[30])
)
return row_rdd
the above is the code i am running.
I am working with RDD ROW OBJECTS not a dataframe
You can use na.fill("0") to replace all nulls with "0" strings.
spark.read.csv("path/to/file").na.fill(value="0").show()
In case you need integers, you can change the schema to convert string columns to integers.
You could add this to your dataframe to apply the change to a column named 'col_name'
from pyspark.sql import functions as F
(...)
.withColumn('col_name', F.regexp_replace('col_name', ' ', 0))
You could use this syntax directly in your code
You can add if-else condition while creating Row.
Let's consider WEATHER_DELAY.
row_rdd = split_rdd.map(lambda line: Row(#allothercols,
WEATHER_DELAY = 0 if "".__eq__(line[30]) else line[30])
Please allow me another try for your problem, using foreach() method dedicated to rdd.
def f(x) = x.replace(' ', 0)
(...)
row_rdd = row_rdd.foreach(f) # to be added at the end of your script
My current dataframe looks as below:
existing_data = {'STORE_ID': ['1234','5678','9876','3456','6789'],
'FULFILLMENT_TYPE': ['DELIVERY','DRIVE','DELIVERY','DRIVE','DELIVERY'],
'FORECAST_DATE':['2020-08-01','2020-08-02','2020-08-03','2020-08-04','2020-08-05'],
'DAY_OF_WEEK':['SATURDAY','SUNDAY','MONDAY','TUESDAY','WEDNESDAY'],
'START_HOUR':[8,8,6,7,9],
'END_HOUR':[19,19,18,19,17]}
existing = pd.DataFrame(data=existing_data)
I would need the data to be exploded between the start and end hour such that each hour is a different row like below:
needed_data = {'STORE_ID': ['1234','1234','1234','1234','1234'],
'FULFILLMENT_TYPE': ['DELIVERY','DELIVERY','DELIVERY','DELIVERY','DELIVERY'],
'FORECAST_DATE':['2020-08-01','2020-08-01','2020-08-01','2020-08-01','2020-08-01'],
'DAY_OF_WEEK': ['SATURDAY','SATURDAY','SATURDAY','SATURDAY','SATURDAY'],
'HOUR':[8,9,10,11,12]}
required = pd.DataFrame(data=needed_data)
Not sure how to achieve this ..I know it should be with explode() but unable to achieve it.
If small DataFrame or performance is not important use range per both columns with DataFrame.explode:
existing['HOUR'] = existing.apply(lambda x: range(x['START_HOUR'], x['END_HOUR']+1), axis=1)
existing = (existing.explode('HOUR')
.reset_index(drop=True)
.drop(['START_HOUR','END_HOUR'], axis=1))
If performance is important use Index.repeat by subtract both columns and then add counter by GroupBy.cumcount to START_HOUR:
s = existing["END_HOUR"].sub(existing["START_HOUR"]) + 1
df = existing.loc[existing.index.repeat(s)].copy()
add = df.groupby(level=0).cumcount()
df['HOUR'] = df["START_HOUR"].add(add)
df = df.reset_index(drop=True).drop(['START_HOUR','END_HOUR'], axis=1)