I get a timeout on running this notebook in databricks. The last step in writing to parquet is taking approx 15-18 mins before timeout error occurs. I'm not sure as to where it goes wrong.
from pyspark.sql.functions import explode, sequence
# Create hours string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23')) as hh").createOrReplaceTempView('hours')
# Create minutes string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59')) as mm").createOrReplaceTempView('minutes')
# Create seconds string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59')) as ss").createOrReplaceTempView('seconds')
# Create Time string, add hour, minute, second
spark.sql(f"select CAST(CONCAT(hours.hh, ':', minutes.mm, ':', seconds.ss) as string) as Time, explode(sequence(0,23,1)) as Hour from hours cross join minutes cross join seconds ").createOrReplaceTempView('time1')
spark.sql(f"select *, explode(sequence(0,59,1)) as Minute from time1").createOrReplaceTempView('time2')
spark.sql(f"select *, explode(sequence(0,59,1)) as Second from time2").createOrReplaceTempView('time3')
# Add TimeID
spark.sql(f"select row_number() over (order by TIME) as TimeID, * from time3").createOrReplaceTempView('src')
# Add HourDescription
spark.sql(f"select *, CONCAT(CASE date_part('HOUR', Time) WHEN 0 THEN '00' WHEN 1 THEN '01' WHEN 2 THEN '02' WHEN 3 THEN '03' WHEN 4 THEN '04' WHEN 5 THEN '05' WHEN 6 THEN '06' WHEN 7 THEN '07' WHEN 8 THEN '08' WHEN 9 THEN '09' END, ':00') as HourDescription from src").createOrReplaceTempView('src1')
# Add HourBucket
spark.sql(f"select *, CONCAT(HourDescription, ' - ', CONCAT(CASE date_part('HOUR', Time) WHEN 0 THEN '01' WHEN 1 THEN '02' WHEN 2 THEN '03' WHEN 3 THEN '04' WHEN 4 THEN '05' WHEN 5 THEN '06' WHEN 6 THEN '07' WHEN 7 THEN '08' WHEN 8 THEN '09' WHEN 9 THEN '10' WHEN 10 THEN '11' WHEN 11 THEN '12' WHEN 12 THEN '13' WHEN 13 THEN '14' WHEN 14 THEN '15' WHEN 15 THEN '16' WHEN 16 THEN '17' WHEN 17 THEN '18' WHEN 18 THEN '19' WHEN 19 THEN '20' WHEN 20 THEN '21' WHEN 21 THEN '22' WHEN 22 THEN '23' WHEN 23 THEN '00' END, ':00')) as HourBucket from src1").createOrReplaceTempView('src2')
# Add DayPart
spark.sql(f"select *, CASE WHEN (Hour >= 0 AND Hour < 6) THEN 'Night' WHEN (Hour >= 6 AND Hour < 12) THEN 'Morning' WHEN (Hour >= 12 AND Hour < 18) THEN 'Afternoon' ELSE 'Evening' END as DayPart FROM src2").createOrReplaceTempView('src3')
# Add BusinessHour
spark.sql(f"select *, CASE WHEN (Hour >= 8 AND Hour < 18) THEN 'Yes' ELSE 'No' END as BusinessHour FROM src3").createOrReplaceTempView('src_final')
#Write to Parquet
df = sqlContext.sql("select * from src_final");
df.write.parquet("/mnt/xxx/xx/xxx/")
I figured it out. the explode(sequence) took alot of effort. Especially as soon as the one for minutes stepped in. I fixed the code like this:
# Create Time string, add hour, minute, second
spark.sql(f"select CAST(CONCAT(hours.hh, ':', minutes.mm, ':', seconds.ss) as string) as Time, cast(hours.hh as int) as Hour, cast(minutes.mm as int) as Minute, cast(seconds.ss as int) as Second from hours cross join minutes cross join seconds ").createOrReplaceTempView('time')
csv file exampleI have a csv file and I need to check all columns to find ? in the csv file and remove those rows.
below is an example
Column1 Column 2 Column 3
1 ? 3
2 ?.. 1
? 2 ?.
? 4 4
I tried the below however it does not work
data = readData(“text.csv”)
print(data)
def Filter(string, substr):
return [str for str in string if
any(sub not in str for sub in substr)]
string = data
substr = [’?’,’?.’,’? ‘,’? ']
filter_data=Filter(string, substr)
my code is below to get ouptut in tupples.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def readData(filename) :
data = pd.read_csv(filename, skipinitialspace=True)
return [d for d in data.itertuples(index=False, name=None)]
data = readData("problem2.csv")
print(data)
[('18.0', 8, '307.0 ', '130.0 ', '3504.', '12.0', 70, 1, 'chevrolet chevelle malibu'), ('15.0', 8, '350.0 ', '165.0 ', '3693.', '11.5', 70, 1, 'buick skylark 320'), ('18.0', 8, '318.0 ', '150.0 ', '?.', '11.0', 70, 1, 'plymouth satellite'), ('16.0', 8, '304.0 ', '150.0 ', '3433.', '12.0', 70, 1, 'amc rebel sst'), ('17.0', 8, '302.0 ', '140.0 ', '3449.', '10.5', 70, 1, 'ford torino'), ('15.0', 8, '429.0 ', '198.0 ', '4341.', '10.0', 70, 1, 'ford galaxie 500'), ('14.0', 8, '454.0 ', '220.0 ', '4354.', '9.0', 70, 1, 'chevrolet impala'), ('14.0', 8, '440.0 ', '215.0 ', '4312.', '8.5', 70, 1, 'plymouth fury iii'),
Next want to remove rows with '?; from all columns to provide the same output in tupples.
My input file is as follows:
mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name
18,8,307,130,3504,12,70,1,chevy malibu
18,8,308,140,?.,14,70,1,plymoth satellite
18,8,309,150,?,15,70,1,ford torino
18,8,310,150,? ,16,70,1,ford galaxy
18,8,310,150, ?,17,70,1,pontiac catalina
18,8,310,150,3505,18,70,1,ford maverick
The code to replace any of the following occurrences ['?','?.',' ?','? '] is as follows:
import csv
qs = ['?','?.',' ?','? ']
with open('abc.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
row = ['' if r in qs else r for r in row]
print (row)
The output of this will be as follows:
['mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu']
['18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite']
['18', '8', '309', '150', '', '15', '70', '1', 'ford torino']
['18', '8', '310', '150', '', '16', '70', '1', 'ford galaxy']
['18', '8', '310', '150', '', '17', '70', '1', 'pontiac catalina']
['18', '8', '310', '150', '3505', '18', '70', '1', 'ford maverick']
As you can see values from rows 3 thru 6 got replaced with ''.
Ran with one more sample dataset:
mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name
18,8,307,130,3504,12,70,1,chevy malibu
18,8,308,140,?.,14,70,1,plymoth satellite
18,8,309,?,3506,15,70,1,ford torino
18,8,310,160,? ,16,70,1,ford galaxy
18,8,311,170,3508, ?,70,1,pontiac catalina
18,8,312,180,3509,18,70,1,ford maverick
Output is:
['mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu']
['18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite']
['18', '8', '309', '', '3506', '15', '70', '1', 'ford torino']
['18', '8', '310', '160', '', '16', '70', '1', 'ford galaxy']
['18', '8', '311', '170', '3508', '', '70', '1', 'pontiac catalina']
['18', '8', '312', '180', '3509', '18', '70', '1', 'ford maverick']
In this scenario, the ? is on various columns. It still addresses the problem.
In case you are looking for all the rows in one go, you can read all the lines into one variable and process it.
qs = {'?.':'',' ?':'','? ':'','?':''}
with open('abc.txt') as csv_file:
lines = csv_file.readlines()
for i,text in enumerate(lines):
[text := text.replace(a,b) for a,b in qs.items()]
lines[i] = text
print (lines)
Your output data will be as follows:
['mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name\n', '18,8,307,130,3504,12,70,1,chevy malibu\n', '18,8,308,140,,14,70,1,plymoth satellite\n', '18,8,309,,3506,15,70,1,ford torino\n', '18,8,310,160,,16,70,1,ford galaxy\n', '18,8,311,170,3508,,70,1,pontiac catalina\n', '18,8,312,180,3509,18,70,1,ford maverick\n']
tuple output
Looks like you are expecting tuples as output.
Here's the code to do it:
import csv
qs = {'?.':'',' ?':'','? ':'','?':''}
final_list = []
with open('abc.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
row = ['' if r in qs else r for r in row]
final_list.append(tuple(row))
print (final_list)
The output will be as follows:
[('mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name'), ('18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu'), ('18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite'), ('18', '8', '309', '', '3506', '15', '70', '1', 'ford torino'), ('18', '8', '310', '160', '', '16', '70', '1', 'ford galaxy'), ('18', '8', '311', '170', '3508', '', '70', '1', 'pontiac catalina'), ('18', '8', '312', '180', '3509', '18', '70', '1', 'ford maverick')]