Spark timeout on writing to parquet

Spark timeout on writing to parquet - apache-spark

I get a timeout on running this notebook in databricks. The last step in writing to parquet is taking approx 15-18 mins before timeout error occurs. I'm not sure as to where it goes wrong.
from pyspark.sql.functions import explode, sequence
# Create hours string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23')) as hh").createOrReplaceTempView('hours')
# Create minutes string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59')) as mm").createOrReplaceTempView('minutes')
# Create seconds string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59')) as ss").createOrReplaceTempView('seconds')
# Create Time string, add hour, minute, second
spark.sql(f"select CAST(CONCAT(hours.hh, ':', minutes.mm, ':', seconds.ss) as string) as Time, explode(sequence(0,23,1)) as Hour from hours cross join minutes cross join seconds ").createOrReplaceTempView('time1')
spark.sql(f"select *, explode(sequence(0,59,1)) as Minute from time1").createOrReplaceTempView('time2')
spark.sql(f"select *, explode(sequence(0,59,1)) as Second from time2").createOrReplaceTempView('time3')
# Add TimeID
spark.sql(f"select row_number() over (order by TIME) as TimeID, * from time3").createOrReplaceTempView('src')
# Add HourDescription
spark.sql(f"select *, CONCAT(CASE date_part('HOUR', Time) WHEN 0 THEN '00' WHEN 1 THEN '01' WHEN 2 THEN '02' WHEN 3 THEN '03' WHEN 4 THEN '04' WHEN 5 THEN '05' WHEN 6 THEN '06' WHEN 7 THEN '07' WHEN 8 THEN '08' WHEN 9 THEN '09' END, ':00') as HourDescription from src").createOrReplaceTempView('src1')
# Add HourBucket
spark.sql(f"select *, CONCAT(HourDescription, ' - ', CONCAT(CASE date_part('HOUR', Time) WHEN 0 THEN '01' WHEN 1 THEN '02' WHEN 2 THEN '03' WHEN 3 THEN '04' WHEN 4 THEN '05' WHEN 5 THEN '06' WHEN 6 THEN '07' WHEN 7 THEN '08' WHEN 8 THEN '09' WHEN 9 THEN '10' WHEN 10 THEN '11' WHEN 11 THEN '12' WHEN 12 THEN '13' WHEN 13 THEN '14' WHEN 14 THEN '15' WHEN 15 THEN '16' WHEN 16 THEN '17' WHEN 17 THEN '18' WHEN 18 THEN '19' WHEN 19 THEN '20' WHEN 20 THEN '21' WHEN 21 THEN '22' WHEN 22 THEN '23' WHEN 23 THEN '00' END, ':00')) as HourBucket from src1").createOrReplaceTempView('src2')
# Add DayPart
spark.sql(f"select *, CASE WHEN (Hour >= 0 AND Hour < 6) THEN 'Night' WHEN (Hour >= 6 AND Hour < 12) THEN 'Morning' WHEN (Hour >= 12 AND Hour < 18) THEN 'Afternoon' ELSE 'Evening' END as DayPart FROM src2").createOrReplaceTempView('src3')
# Add BusinessHour
spark.sql(f"select *, CASE WHEN (Hour >= 8 AND Hour < 18) THEN 'Yes' ELSE 'No' END as BusinessHour FROM src3").createOrReplaceTempView('src_final')
#Write to Parquet
df = sqlContext.sql("select * from src_final");
df.write.parquet("/mnt/xxx/xx/xxx/")

I figured it out. the explode(sequence) took alot of effort. Especially as soon as the one for minutes stepped in. I fixed the code like this:
# Create Time string, add hour, minute, second
spark.sql(f"select CAST(CONCAT(hours.hh, ':', minutes.mm, ':', seconds.ss) as string) as Time, cast(hours.hh as int) as Hour, cast(minutes.mm as int) as Minute, cast(seconds.ss as int) as Second from hours cross join minutes cross join seconds ").createOrReplaceTempView('time')

Related

How can I determine the user who provided the correct arguments when using the command?

I make a game of roulette, everyone probably knows.
Problem:
I have arguments that need to be cited correctly, but I need to get the user who cited those arguments correctly.
Question:
How can I do it?
I haven’t tried it, I don’t know how to do it :) I hope you can help, thanks! Code below
#commands.command(brief = '''
Использование команды:
Поставить на число: JM!wheel number (число) (ставка)
Поставить на цвет: JM!wheel color (red или black) (ставка)
Поставить на чет-нечет JM!wheel vs (even = чет, odd = нечет) (ставка)''')
async def wheel(self, ctx, mode = None, value = None, bet = None):
result = random.randint(0, 36)
numbers_red = ['1', '3', '5', '7', '9', '12', '14', '16', '18', '19', '21', '23', '25', '27', '30', '32', '34',
'36']
numbers_black = ['2', '4', '6', '10', '11', '13', '15', '17', '20', '22', '24', '26', '28', '29', '31', '33',
'35']
numbers_even = ['2', '4', '6', '8', '10', '12', '14', '16', '18', '20', '22', '24', '26', '28', '30', '32', '34', '36']
numbers_odd = ['1', '3', '5', '7', '9', '11', '13', '15', '17', '19', '21', '23', '25', '27', '29', '31', '33', '35']
if mode == None and value == None and bet = None:
await ctx.send('''
Использование команды:
Поставить на число: JM!wheel number (число) (ставка)
Поставить на цвет: JM!wheel color (red или black) (ставка)
Поставить на чет-нечет JM!wheel vs (even = чет, odd = нечет) (ставка)''')
if mode and value and bet:
if mode == 'color':
if value in numbers_red and result in numbers_red:
pass
elif value in numbers_black and result in numbers_black:
pass
elif value in numbers_green and result == '0':
pass
elif mode == 'number':
if value == result:
pass
if mode == 'vs':
if value in numbers_odd and result in numbers_odd:
pass
if value in numbers_even and result in numbers_even:
pass

Maybe you need to use this code:
user = ctx.message.author
So that you will know who used the command.
Idk if you asked for this...
Otherwise you may need to know who is the author of a message, you can fetch the message and then get the author.
user = fetch_message(ID).author
Hope it was usefull

How to filter string from all column from csv file using python

csv file exampleI have a csv file and I need to check all columns to find ? in the csv file and remove those rows.
below is an example
Column1 Column 2 Column 3
1 ? 3
2 ?.. 1
? 2 ?.
? 4 4
I tried the below however it does not work
data = readData(“text.csv”)
print(data)
def Filter(string, substr):
return [str for str in string if
any(sub not in str for sub in substr)]
string = data
substr = [’?’,’?.’,’? ‘,’? ']
filter_data=Filter(string, substr)
my code is below to get ouptut in tupples.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def readData(filename) :
data = pd.read_csv(filename, skipinitialspace=True)
return [d for d in data.itertuples(index=False, name=None)]
data = readData("problem2.csv")
print(data)
[('18.0', 8, '307.0 ', '130.0 ', '3504.', '12.0', 70, 1, 'chevrolet chevelle malibu'), ('15.0', 8, '350.0 ', '165.0 ', '3693.', '11.5', 70, 1, 'buick skylark 320'), ('18.0', 8, '318.0 ', '150.0 ', '?.', '11.0', 70, 1, 'plymouth satellite'), ('16.0', 8, '304.0 ', '150.0 ', '3433.', '12.0', 70, 1, 'amc rebel sst'), ('17.0', 8, '302.0 ', '140.0 ', '3449.', '10.5', 70, 1, 'ford torino'), ('15.0', 8, '429.0 ', '198.0 ', '4341.', '10.0', 70, 1, 'ford galaxie 500'), ('14.0', 8, '454.0 ', '220.0 ', '4354.', '9.0', 70, 1, 'chevrolet impala'), ('14.0', 8, '440.0 ', '215.0 ', '4312.', '8.5', 70, 1, 'plymouth fury iii'),
Next want to remove rows with '?; from all columns to provide the same output in tupples.

My input file is as follows:
mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name
18,8,307,130,3504,12,70,1,chevy malibu
18,8,308,140,?.,14,70,1,plymoth satellite
18,8,309,150,?,15,70,1,ford torino
18,8,310,150,? ,16,70,1,ford galaxy
18,8,310,150, ?,17,70,1,pontiac catalina
18,8,310,150,3505,18,70,1,ford maverick
The code to replace any of the following occurrences ['?','?.',' ?','? '] is as follows:
import csv
qs = ['?','?.',' ?','? ']
with open('abc.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
row = ['' if r in qs else r for r in row]
print (row)
The output of this will be as follows:
['mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu']
['18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite']
['18', '8', '309', '150', '', '15', '70', '1', 'ford torino']
['18', '8', '310', '150', '', '16', '70', '1', 'ford galaxy']
['18', '8', '310', '150', '', '17', '70', '1', 'pontiac catalina']
['18', '8', '310', '150', '3505', '18', '70', '1', 'ford maverick']
As you can see values from rows 3 thru 6 got replaced with ''.
Ran with one more sample dataset:
mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name
18,8,307,130,3504,12,70,1,chevy malibu
18,8,308,140,?.,14,70,1,plymoth satellite
18,8,309,?,3506,15,70,1,ford torino
18,8,310,160,? ,16,70,1,ford galaxy
18,8,311,170,3508, ?,70,1,pontiac catalina
18,8,312,180,3509,18,70,1,ford maverick
Output is:
['mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu']
['18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite']
['18', '8', '309', '', '3506', '15', '70', '1', 'ford torino']
['18', '8', '310', '160', '', '16', '70', '1', 'ford galaxy']
['18', '8', '311', '170', '3508', '', '70', '1', 'pontiac catalina']
['18', '8', '312', '180', '3509', '18', '70', '1', 'ford maverick']
In this scenario, the ? is on various columns. It still addresses the problem.
In case you are looking for all the rows in one go, you can read all the lines into one variable and process it.
qs = {'?.':'',' ?':'','? ':'','?':''}
with open('abc.txt') as csv_file:
lines = csv_file.readlines()
for i,text in enumerate(lines):
[text := text.replace(a,b) for a,b in qs.items()]
lines[i] = text
print (lines)
Your output data will be as follows:
['mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name\n', '18,8,307,130,3504,12,70,1,chevy malibu\n', '18,8,308,140,,14,70,1,plymoth satellite\n', '18,8,309,,3506,15,70,1,ford torino\n', '18,8,310,160,,16,70,1,ford galaxy\n', '18,8,311,170,3508,,70,1,pontiac catalina\n', '18,8,312,180,3509,18,70,1,ford maverick\n']
tuple output
Looks like you are expecting tuples as output.
Here's the code to do it:
import csv
qs = {'?.':'',' ?':'','? ':'','?':''}
final_list = []
with open('abc.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
row = ['' if r in qs else r for r in row]
final_list.append(tuple(row))
print (final_list)
The output will be as follows:
[('mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name'), ('18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu'), ('18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite'), ('18', '8', '309', '', '3506', '15', '70', '1', 'ford torino'), ('18', '8', '310', '160', '', '16', '70', '1', 'ford galaxy'), ('18', '8', '311', '170', '3508', '', '70', '1', 'pontiac catalina'), ('18', '8', '312', '180', '3509', '18', '70', '1', 'ford maverick')]

How to sort dictionary of list based on certain column?

I have a table like below, stored in a dictionary:
The dictionary looks like this
d = {
'A': ['45', '70', '5', '88', '93', '79', '87', '69'],
'B': ['99', '18', '91', '3', '92', '2', '67', '15'],
'C': ['199200128', '889172415', '221388292', '199200128', '889172415', '889172415', '199200128', '221388292'],
'D': ['10:27:05', '07:10:29', '17:04:48', '10:25:42', '07:11:18', '07:11:37', '10:38:11', '17:08:55'],
'E': ['73', '6', '95', '21', '29', '15', '99', '9']
}
I'd like to sort the dictionary based on the hours from lowest to highest and sum the columns A, B and E corresponding the same value in column C as in image below (where sums of A, B and E are in red):
Then, the resulting dictionary would look like this:
{
'A': ['70', '93', '79', '242', '88', '45', '133', '87', '5', '69', '161'],
'B': ['18', '92', '2', '112', '3', '99', '102', '67', '91', '15', '173'],
'C': ['889172415', '889172415', '889172415', '', '199200128', '199200128', '', '199200128', '221388292', '221388292', ''],
'D': ['07:10:29', '07:11:18', '07:11:37', '', '10:25:42', '10:27:05', '', '10:38:11', '17:04:48', '17:08:55', ''],
'E': ['6', '29', '15', '50', '21', '73', '94', '99', '95', '9', '203']
}
I currently try to sort the input dictionary with this code, but doesn´t seem to work for me.
>>> sorted(d.items(), key=lambda e: e[1][4])
[
('D', ['10:27:05', '07:10:29', '17:04:48', '10:25:42', '07:11:18', '07:11:37', '10:38:11', '17:08:55']),
('E', ['73', '6', '95', '21', '29', '15', '99', '9']),
('C', ['199200128', '889172415', '221388292', '199200128', '889172415', '889172415', '199200128', '221388292']),
('B', ['99', '18', '91', '3', '92', '2', '67', '15']),
('A', ['45', '70', '5', '88', '93', '79', '87', '69'])
]
>>>
May someone give some help with this. Thanks

Do you allow to use pandas to solve this task ?
If yes, then you can transform your data to
pd.DataFrame
object
data = pd.DataFrame.from_dict(dictionary, orient = 'columns')
data = data.sort_values(by =„D”)
And then return to dictionary again using
_dict = data.to_dict()

getting different parts in a list while skipping some objects

So I have created a list consisting of dates and numbers that I got from a text file which includes some lotto numbers. I am trying to access certain parts in the list, to create multiple lists of the different numbers available as I would later like to do some statistical work on the numbers ie how many times a number appears in the different lists etc. Thus showing numbers according to popularity.
I am still new to python and thought that this would really be a good project to get started in testing what I know so far and to continue on working as I get better with python.
My list below consists of the following: 'Day', 'Month', 'year', 'num1', 'num2', 'num3', 'num4', 'num5' repeat for a number of days
['03', 'March', '2020', '33', '16', '18', '10', '04', '02', 'March', '2020', '14', '13', '34', '04', '20', '01', 'March', '2020', '10', '08', '15', '02', '23', '29', 'February', '2020', '16', '28', '20', '07', '35', '28', 'February', '2020', '31', '35', '10', '30', '29', '27', 'February', '2020', '25', '26', '05', '03', '19', '26', 'February', '2020', '33', '21', '29', '11', '32', '25', 'February', '2020', '10', '19', '13', '05', '08', '24', 'February', '2020', '14', '29', '33', '31', '09', '23', 'February', '2020', '04', '27', '05', '11', '12', '22', 'February', '2020', '18', '05', '27', '34', '20', '21', 'February', '2020', '29', '10', '15', '25', '12', '20', 'February', '2020', '33', '03', '12', '27', '05', '19', 'February', '2020', '06', '14', '26', '04', '29', '18', 'February', '2020', '07', '08', '23', '32', '30', '17', 'February', '2020', '05', '32', '22', '21', '19', '16']
Here is the code I have used thus far, which will take forever to do what I want to do considering I would like to increase the information to provide better statistics.
#read txt file and convert info into list
with open('results.txt') as f:
line = f.read()
a = line.split()
#Skipping the day-month-year to only return the numbers
list_1 = a[3:9]
list_2 = a[11:17]
print(a)
print(list_1)
print(list_2)
Any suggestions would be appreciated as I would like to learn more and understand where I might improve my idea and make life easier. Maybe it is a bit difficult project to start with but i'm thinking long term here in where I am going with it... Lol

Here's a way to put it neatly into a list of lists.
range takes 3 arguments. the start point, the stop point, and the amount you wish to step by. Stepping to the position of every date, you can just slice the section of the list you want out.
import pprint
all_numbers = []
for i in range(0, len(a), 8):
if len(a[i + 3:i + 8]):
all_numbers.append(a[i + 3:i + 8])
pprint.pprint(all_numbers)

How can I parse the table in this page?

I want to parse the table with
id=standings-16548-grid
class=grid with-centered-columns hover
. Unfortunatly when I try it the output shows me like the tr are completely empty. Since I'm new to this language I was wondering if I'm missing something.
Afterwards I'll also scrape the datas from the sheet "form" and not only from the sheet "standings", but I'm trying to do one step at the time.
Below you can find my code.
I already tried with selenium to open a webpage with Firefox. Then I tried to push the button that shows up as soon you open the page to continue to use the website. Finally using BeautfulSoup I tried to parse the table specyfing the ID of the table.
'Python3.7'
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
driver = webdriver.Firefox(executable_path='/Applications/Python3.7/geckodriver')
driver.get('https://www.whoscored.com/Regions/108/Tournaments/5/Italy-Serie-A')
driver.implicitly_wait(20)
myDynamicElement = driver.find_element(By.XPATH, "/html/body/div[9]/div[1]/div/div/div[3]/button").click()
source = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(source, 'lxml')
driver.quit()
table = soup.find('table', {"id":"standings-16548-grid"})
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('tr')
row = [i.text for i in td]
print(row)
The output of this code is:
Traceback (most recent call last):
File "/Users/Gina/PycharmProjects/Prova1/DriverProva/SeleniumScrape.py", line 12, in <module>
myDynamicElement = driver.find_element(By.XPATH, "/html/body/div[9]/div[1]/div/div/div[3]/button").click()
File "/Users/Gina/PycharmProjects/Prova1/venv/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Users/Gina/PycharmProjects/Prova1/venv/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "/Users/Gina/PycharmProjects/Prova1/venv/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/Users/Gina/PycharmProjects/Prova1/venv/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message:
Element could not be scrolled into
view
Process finished with exit code 1

Try the following code.It will returns expected output.
selenium.common.exceptions.ElementNotInteractableException: Message: Element could not be scrolled into view
To avoid this error use java script executor to click on the element.I have changed the element xpath as well.
driver.execute_script("arguments[0].click();",element)
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import time
driver = webdriver.Firefox(executable_path='/Applications/Python3.7/geckodriver')
driver.get('https://www.whoscored.com/Regions/108/Tournaments/5/Italy-Serie-A')
element=WebDriverWait(driver,20).until(ec.element_to_be_clickable((By.XPATH,"//button[contains(.,'Continue Using Site')]")))
driver.execute_script("arguments[0].click();",element)
time.sleep(3)
source=driver.page_source
soup = BeautifulSoup(source, 'lxml')
driver.quit()
table = soup.find('table', {"id":"standings-16548-grid"})
table_rows = table.find_all('tr')
for tr in table_rows[5:len(table_rows)]:
row = [i.text for i in tr.find_all('td')]
print(row)
Output
['1', 'Juventus', '38', '28', '6', '4', '70', '30', '+40', '90', 'wddldl']
['2', 'Napoli', '38', '24', '7', '7', '74', '36', '+38', '79', 'lwwwwl']
['3', 'Atalanta', '38', '20', '9', '9', '77', '46', '+31', '69', 'wwwwdw']
['4', 'Inter', '38', '20', '9', '9', '57', '33', '+24', '69', 'dddwlw']
['5', 'AC Milan', '38', '19', '11', '8', '55', '36', '+19', '68', 'dlwwww']
['6', 'Roma', '38', '18', '12', '8', '66', '48', '+18', '66', 'dwdwdw']
['7', 'Torino', '38', '16', '15', '7', '52', '37', '+15', '63', 'wwdwlw']
['8', 'Lazio', '38', '17', '8', '13', '56', '46', '+10', '59', 'lwlwdl']
['9', 'Sampdoria', '38', '15', '8', '15', '60', '51', '+9', '53', 'lldldw']
['10', 'Bologna', '38', '11', '11', '16', '48', '56', '-8', '44', 'wwlwdw']
['11', 'Sassuolo', '38', '9', '16', '13', '53', '60', '-7', '43', 'dwdldl']
['12', 'Udinese', '38', '11', '10', '17', '39', '53', '-14', '43', 'dldwww']
['13', 'SPAL 2013', '38', '11', '9', '18', '44', '56', '-12', '42', 'wdwlll']
['14', 'Parma Calcio 1913', '38', '10', '11', '17', '41', '61', '-20', '41', 'dddlwl']
['15', 'Cagliari', '38', '10', '11', '17', '36', '54', '-18', '41', 'wllldl']
['16', 'Fiorentina', '38', '8', '17', '13', '47', '45', '+2', '41', 'llllld']
['17', 'Genoa', '38', '8', '14', '16', '39', '57', '-18', '38', 'lddldd']
['18', 'Empoli', '38', '10', '8', '20', '51', '70', '-19', '38', 'llwwwl']
['19', 'Frosinone', '38', '5', '10', '23', '29', '69', '-40', '25', 'lldlld']
['20', 'Chievo', '38', '2', '14', '22', '25', '75', '-50', '17', 'wdlldd']

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark timeout on writing to parquet - apache-spark

Related

How can I determine the user who provided the correct arguments when using the command?

How to filter string from all column from csv file using python

How to sort dictionary of list based on certain column?

getting different parts in a list while skipping some objects

How can I parse the table in this page?

Categories

Resources