How to run the python file in Spark

How to run the python file in Spark - apache-spark

I am having python code in python file.I want to know how to run the python code which is present in one location.I am using Ubuntu OS.In my code, I am getting Json from one URL and need to show as scatter graph using SPARK.I am new to PYSPARK. Please guide me how to achieve this. Please find my below code,
`import multiprocessing
import time
import json
from sseclient import SSEClient as EventSource
# 'Complete your function here i cant understand what you are doing'
# i just placed the code inside check once i dont have the package so u try it
def func(n):
file = open('w.txt','w',encoding='utf8')
url = 'https://stream.wikimedia.org/v2/stream/recentchange'
print(1)
url = 'https://stream.wikimedia.org/v2/stream/recentchange'
json_st=''
stt=''
for event in EventSource(url):
if event.event == 'message':
try:
change = json.loads(event.data)
except ValueError:
pass
else:
print(1)
file.write(str(event.data))
#if file.write(str(event))count <= 10:
#print(event.data)
#print(event.data)
#js=json.loads(event.data)
##print(js['comment'])
#file.write(stt)
#print(stt)
#file.write(str(event))
# count = count + 1
#else:
# break
#print(stt)
#json_str={s}
if __name__ == '__main__':
# Start your process as a process
p = multiprocessing.Process(target=func, name="func", args=(10,))
p.start()
# Wait 3(give your time in secs) seconds for foo
time.sleep(3)
# Terminate func
p.terminate()
# Cleanup
p.join()`

you have to used spark-submit command to running you python script with spark (using command line terminal).
spark-submit /home/sample.py

Related

torch.distributed.barrier() added on all processes not working

import torch
import os
torch.distributed.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
if local_rank >0:
torch.distributed.barrier()
print(f"Entered process {local_rank}")
if local_rank ==0:
torch.distributed.barrier()
The above code gets hanged forever but if I remove both torch.distributed.barrier() then both print statements get executed. Am I missing something here?
On the command line I execute the process using torchrun --nnodes=1 --nproc_per_node 2 test.py where test.py is the name of the script
tried the above code with and without the torch.distributed.barrier()
With the barrier() statements expecting the statement to print for one gpu and exit -- not as expected
Without the barrier() statements expecting both to print -- as expected
Am I missing something here?

It is better to put your multiprocessing initialization code inside the if __name__ == "__main__": to avoid endless process generation and re-design the control flow to fit your purpose:
if __name__ == "__main__":
import torch
import os
torch.distributed.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
if local_rank > 0:
torch.distributed.barrier()
else:
print(f"Entered process {local_rank}")
torch.distributed.barrier()

Python subprocess for docker-compose

I have an interesting set of requirements that I am trying to conduct using the Python subprocess module and docker-compose. This whole setup is possible in one docker-compose but due to requirement this is what I would like to setup:
call the docker-compose using python subprocess to activate the
test-servers
print all the std-out of above docker-compose running.
as soon as the test-server up and running via docker-compose; call the testing scripts for that server.
This is my docker-compose.py looks like:
import subprocess
from subprocess import PIPE
import os
from datetime import datetime
class MyLog:
def my_log(self, message):
date_now = datetime.today().strftime('%d-%m-%Y %H:%M:%S')
print("{0} || {1}".format(date_now, message))
class DockercomposeRun:
log = MyLog()
def __init__(self):
dir_name, _ = os.path.split(os.path.abspath(__file__))
self.dirname = dir_name
def run_docker_compose(self, filename):
command_name = ["docker-compose", "-f", self.dirname + filename, "up"]
popen = subprocess.Popen(command_name, stdin=PIPE, stdout=PIPE, stderr=PIPE, universal_newlines=True)
return popen
now in my test.py
as soon as my stdout is blank I would like to break the loop of printing and run the rest of the test in test.py.
docker_compose_run = DockercomposeRun()
rc = docker_compose_run.run_docker_compose('/docker-compose.yml.sas-viya-1')
for line in iter(rc.stdout.readline, ''):
print(line, end='')
if line == '':
break
popen.stdout.close()
# start here actual test cases
.......
But for me the loop is never broken even though the stdout of docker-compose goes blank after the server is up and running. And, the test cases are never executed.
Is it the right approach or how I can achieve this?

I think the issue here is because you are not running docker-compose in detached mode and its blocking the application run. Can you try adding "-d" to command_name?

Scrapy - run at time interval

i have a spider for crawling a site and i want to run it every 10 minutes. put it in python schedule and run it. after first run i got
ReactorNotRestartable
i try this sulotion and got
AttributeError: Can't pickle local object 'run_spider..f'
error.
edit:
try how-to-schedule-scrapy-crawl-execution-programmatically python program run without error and crawl function run every 30 seconds but spider doesn't run and i don't get data.
def run_spider():
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)
#deferred.addBoth(lambda _: reactor.stop())
#reactor.run()
q.put(None)
except Exception as e:
q.put(e)
runner = crawler.CrawlerRunner()
deferred = runner.crawl(DivarSpider)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result

The multiprocessing solution is a gross hack to work-around lack of understanding of how Scrapy and reactor management work. You can get rid of it and everything is much simpler.
from twisted.internet.task import LoopingCall
from twisted.internet import reactor
from scrapy.crawler import CrawlRunner
from scrapy.utils.log import configure_logging
from yourlib import YourSpider
configure_logging()
runner = CrawlRunner()
task = LoopingCall(lambda: runner.crawl(YourSpider()))
task.start(60 * 10)
reactor.run()

Easiest way I know to do it is using a separate script to call the script containing your twisted reactor, like this:
cmd = ['python3', 'auto_crawl.py']
subprocess.Popen(cmd).wait()
To run your CrawlerRunner every 10 minutes, you could use a loop or crontab on this script.

Multiprocess in Python while reading a file line by line is 4000x slower than uniprocess, what is wrong?

I need to read a large (10GB+) file line by line and process each line. The processing is fairly simple, so it seemed that multiprocessing was the way to go. However, when I set it up, it is much, much slower than running things linearly. My CPU usage never goes above 50%, so it's not a processing power issue.
I'm running Python 3.6 in Jupyter Notebook on a Mac.
This is what I have, working from the accepted answer posted here:
def do_work(in_queue, out_list):
while True:
line = in_queue.get()
# exit signal
if line == None:
return
#fake work for testing
elements = line.split("\t")
out_list.append(elements)
if __name__ == "__main__":
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
# start for workers
pool = []
for i in range(num_workers):
p = Process(target=do_work, args=(work, results))
p.start()
pool.append(p)
# produce data
with open(file_on_my_machine, 'rt',newline="\n") as f:
for line in f:
work.put(line)
for p in pool:
p.join()
# get the results
print(sorted(results))

How emulate termial iteraction like putty using pySerial

I tried to use pySerial to build a simple terminal to interact COM1 in my PC,
I create 2 threads , one is for READ , the other is for Write
however,
def write_comport():
global ser,cmd, log_file
print("enter Q to quit")
while True:
cmd = raw_input(">>:")
# print(repr(var))
if cmd=='Q':
ser.close()
sys.exit()
log_file.close()
else:
ser.write(cmd + '\r\n')
write_to_file("[Input]"+cmd, log_file)
time.sleep(1)
pass
def read_comport():
global ser, cmd, log_file
while True:
element = ser.readline().strip('\n')
if "~ #" in str(element):
continue
if cmd == str(element).strip():
continue
if "img" in str(element):
print("got:"+element)
beep()
print element
write_to_file(cmd, log_file)
pass
def write_to_file(str,f):
f.write(str)
f.flush
def main():
try:
global read_thr,write_thr
beep()
port_num='COM4'
baudrate=115200
init_serial_port(port_num,baudrate)
read_thr =Thread(target=read_comport)
read_thr.start()
write_thr =Thread(target=write_comport)
write_thr.start()
while True:
pass
except Exception as e:
print(e)
exit_prog()
but the behavior of my code is not as smart as putty or anyother.
cause my function can Not detect whenever the reader is done.
is there any better way to achieve this goal?
By the way, I tried to save the log into txt file real-time . But when I open the file during the process is running, it seems nothing write to my text log file?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to run the python file in Spark - apache-spark

you have to used spark-submit command to running you python script with spark (using command line terminal). spark-submit /home/sample.py

Related

torch.distributed.barrier() added on all processes not working

Python subprocess for docker-compose

Scrapy - run at time interval

Multiprocess in Python while reading a file line by line is 4000x slower than uniprocess, what is wrong?

How emulate termial iteraction like putty using pySerial

Categories

Resources