How to obtain metadata about the current executor(s), Apache-Spark?

How to obtain metadata about the current executor(s), Apache-Spark? - apache-spark

I would like to get as much information as possible from within the executor, while it is executing, but can't seem to find any information on how to accomplish that other than by using the Web UI. For example, it would be useful to know which file is being processed by which executor, and when.
I need this flexibility for debugging, but cannot find any information about it.
Thank you

One of the ways to accomplish it is to mapPartitionsWithContext
Example code:
import org.apache.spark.TaskContext
val a = sc.parallelize(1 to 9, 3)
def myfunc(tc: TaskContext, iter: Iterator[Int]) : Iterator[Int] = {
tc.addOnCompleteCallback(() => println(
"Partition: " + tc.partitionId +
", AttemptID: " + tc.attemptId
)
)
iter.toList.filter(_ % 2 == 0).iterator
}
a.mapPartitionsWithContext(myfunc)
a.collect
API: https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.TaskContext
However, this does not answer the question about how to see which file was processed, and when.

Related

Run databricks job from notebook

I want to know if it is possible to run a Databricks job from a notebook using code, and how to do it
I have a job with multiple tasks, and many contributors, and we have a job created to execute it all, now we want to run the job from a notebook to test new features without creating a new task in the job, also for running the job multiple times in a loop, for example:
for i in [1,2,3]:
run job with parameter i
Regards

what you need to do is the following:
install the databricksapi. %pip install databricksapi==1.8.1
Create your job and return an output. You can do that by exiting the notebooks like that:
import json dbutils.notebook.exit(json.dumps({"result": f"{_result}"}))
If you want to pass a dataframe, you have to pass them as json dump too, there is some official documentation about that from databricks. check it out.
Get the job id you will need it later. You can get it from the jobs details in databricks.
In the executors notebook you can use the following code.
def run_ks_job_and_return_output(params):
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
# context
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
jobs_instance = Jobs.Jobs(url, token) # initialize a jobs_instance
runs_job_id = jobs_instance.runJob(****************, 'notebook',
params) # **** is the job id
run_is_not_completed = True
while run_is_not_completed:
current_run = [run for run in jobs_instance.runsList('completed')['runs'] if run['run_id'] == runs_job_id['run_id'] and run['number_in_job'] == runs_job_id['number_in_job']]
if len(current_run) == 0:
time.sleep(30)
else:
run_is_not_completed = False
current_run = current_run[0]
print( f"Result state: {current_run['state']['result_state']}, You can check the resulted output in the following link: {current_run['run_page_url']}")
note_output = jobs_instance.runsGetOutput(runs_job_id['run_id'])['notebook_output']
return note_output
run_ks_job_and_return_output( { 'parm1' : 'george',
'variable': "values1"})
If you want to run the job many times in parallel you can do the following. (first be sure that you have increased the max concurent runs in the job settings)
from multiprocessing.pool import ThreadPool
pool = ThreadPool(1000)
results = pool.map(lambda j: run_ks_job_and_return_output( { 'table' : 'george',
'variable': "values1",
'j': j}),
[str(x) for x in range(2,len(snapshots_list))])
There is also the possibility to save the whole html output but maybe you are not interested on that. In any case I will answer to that to another post on StackOverflow.
Hope it helps.

You can use following steps :
Note-01:
dbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")
dbutils.widgets.text("foo2", "foo2Default", "foo2EmptyLabel")
result = dbutils.widgets.get("foo")+"-"+dbutils.widgets.get("foo2")
def display():
print("Function Display: "+result)
dbutils.notebook.exit(result)
Note-02:
thislist = ["apple", "banana", "cherry"]
for x in thislist:
dbutils.notebook.run("Note-01 path", 60, {"foo": x,"foo2":'Azure'})

How to programmatically get log level in Pyspark

I need to know, programmatically in Pyspark, which is the log level.
I know I can set it, by doing:
# spark is a SparkSession object
spark.sparkContext.setLogLevel(log_level)
But there is not an equivalent method for retrieving the log level.
Any ideas? Thanks!

I finally came up with a solution, by accessing the Spark session's JVM (py4j underneath):
def get_log_level(spark):
log_manager = spark._jvm.org.apache.log4j.LogManager
trace = spark._jvm.org.apache.log4j.Level.TRACE
debug = spark._jvm.org.apache.log4j.Level.DEBUG
info = spark._jvm.org.apache.log4j.Level.INFO
warn = spark._jvm.org.apache.log4j.Level.WARN
error = spark._jvm.org.apache.log4j.Level.ERROR
fatal = spark._jvm.org.apache.log4j.Level.FATAL
logger = log_manager.getRootLogger()
if logger.isEnabledFor(trace):
return "TRACE"
elif logger.isEnabledFor(debug):
return "DEBUG"
elif logger.isEnabledFor(info):
return "INFO"
elif logger.isEnabledFor(warn):
return "WARN"
elif logger.isEnabledFor(error):
return "ERROR"
elif logger.isEnabledFor(fatal):
return "FATAL"
else:
return None
Most probably there is a better way for doing it.

This will return the LogLevel set in your spark session
log_manager = spark._jvm.org.apache.log4j.LogManager
logger = log_manager.getRootLogger().getEffectiveLevel()

Spark is Open Source, right?
The source code will show you many things that are not in the documentation. And the unit tests will give you hints about things not covered in tutorials.
Demo: browse the Spark project on Github and search for setLogLevel.
OK, the Github internal search usually sucks, but on a single specific keyword it's worth trying. And indeed the very 1st answer gives you this interesting snippet, from a unit test (here reset to branch 2.4):
https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
val originalLevel = org.apache.log4j.Logger.getRootLogger().getLevel
try {
// Avoid outputting a lot of expected warning logs
spark.sparkContext.setLogLevel("error")
...
} finally {
spark.sparkContext.setLogLevel(originalLevel.toString)
...
}
So the setLogLevel method appears to be a (very) thin wrapper around the Log4J API.
And that's exactly that:
https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/SparkContext.scala
def setLogLevel(logLevel: String) {
...
Utils.setLogLevel(org.apache.log4j.Level.toLevel(upperCased))
}
https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/util/Utils.scala
def setLogLevel(l: org.apache.log4j.Level) {
org.apache.log4j.Logger.getRootLogger().setLevel(l)
}

Using specific elements from a list in different loops for a multiple choice test python 3.x

Basically i'm trying to create a multiple choice test that uses information stored inside of lists to change the questions/ answers by location.
so far I have this
import random
DATASETS = [["You first enter the car", "You start the car","You reverse","You turn",
"Coming to a yellow light","You get cut off","You run over a person","You have to stop short",
"in a high speed chase","in a stolen car","A light is broken","The car next to you breaks down",
"You get a text message","You get a call","Your out of gas","Late for work","Driving angry",
"Someone flips you the bird","Your speedometer stops working","Drinking"],
["Put on seat belt","Check your mirrors","Look over your shoulder","Use your turn signal",
"Slow to a safe stop","Relax and dont get upset","Call 911", "Thank your brakes for working",
"Pull over and give up","Ask to get out","Get it fixed","Offer help","Ignore it","Ignore it",
"Get gas... duh","Drive the speed limit","Don't do it","Smile and wave","Get it fixed","Don't do it"],
[''] * 20,
['B','D','A','A','C','A','B','A','C','D','B','C','D','A','D','C','C','B','D','A'],
[''] * 20]
def main():
questions(0)
answers(1)
def questions(pos):
for words in range(len(DATASETS[0])):
DATASETS[2][words] = input("\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] +
'\nA.)'+random.choice(DATASETS[1]) + '\nB.)%s' %DATASETS[1][words] + '\nC.)'
+random.choice(DATASETS[1]) + '\nD.)'+random.choice(DATASETS[1])+
"\nChoose your answer carefully: ")
def answers(pos):
for words in range(len(DATASETS[0])):
DATASETS[4] = list(x is y for x, y in zip(DATASETS[2], DATASETS[3]))
print(DATASETS)
I apologize if the code is crude to some... i'm in my first year of classes and this is my first bout of programming.
list 3 is my key for the right answer's, I want my code in questions() to change the position of the correct answer so that it correlates to the key provided....
I've tried for loops, if statements and while loops but just cant get it to do what I envision. Any help is greatly appreciated

tmp = "\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] + '\nA.)'
if DATASETS[3][words] == 'A': #if the answer key is A
tmp = tmp + DATASETS[1][words] #append the first choice as correct choice
else:
tmp = tmp + random.choice(DATASETS[1]) #if not, randomise the choice
Do similar if-else for 'B', 'C', and 'D'
Once your question is formulated, then you can use it:
DATASETS[2][words] = input(tmp)
This is a bit long but I am not sure if any shorter way exists.

how to get celery tasks id

I set up a periodic task using celery beat. The task runs and I can see the result in the console.
I want to have a python script that recollects the results thrown by the tasks.
I could do it like this:
#client.py
from cfg_celery import app
task_id = '337fef7e-68a6-47b3-a16f-1015be50b0bc'
try:
x = app.AsyncResult(id)
print(x.get())
except:
print('some error')
Anyway, as you can see, for this test I had to copy the task_id thrown at the celery beat console (so to say) and hardcode it in my script. Obviously this is not going to work in real production.
I hacked it setting the task_id on the celery config file:
#cfg_celery.py
app = Celery('celery_config',
broker='redis://localhost:6379/0',
include=['taskos'],
backend = 'redis'
)
app.conf.beat_schedule = {
'something': {
'task': 'tasks.add',
'schedule': 10.0,
'args': (16, 54),
'options' : {'task_id':"my_custom_id"},
}
}
This way I can read it like this:
#client.py
from cfg_celery import app
task_id = 'my_custom_id'
try:
x = app.AsyncResult(id)
print(x.get())
except:
print('some error')
The problem with this approach is that I lose the previous results (previous to the call of client.py).
Is there some way I can read a list of the task_id's in the celery backend?
If I have more than one periodic tasks, can I get a list of task_id's from each periodic task?
Can I use app.tasks.key() to accomplish this, how?
pd: not english-speaking-native, plus new to celery, be nice if I used some terminology wrong.

OK. I am not sure if nobody answered this because is difficult or because my question is too dumb.
Anyway, what I wanted to do is to get the results of my 'celery-beat' tasks from another python process.
Being in the same process there was no problem I could access the task id and everything was easy from there on. But from other process I didn't find a way to retrieve a list of the finished tasks.
I tried python-RQ (it is nice) but when I saw that using RQ I couldn't do that either I came to understand that I had to manually make use of redis storage capabilities. So I got what I wanted, doing this:
. Use 'bind=True' to be able to instrospect from within the task function.
. Once I have the result of the function, I write it in a list in redis (I made some trick to limit the sizeof this list)
. Now I can from an independent process connect to the same redis server and retrieve the results stored in such list.
My files ended up being like this:
cfg_celery.py : here I define the way the tasks are going to be called.
#cfg_celery.py
from celery import Celery
appo = Celery('celery_config',
broker='redis://localhost:6379/0',
include=['taskos'],
backend = 'redis'
)
'''
urlea se decoro como periodic_task. no hay necesidad de darla de alta aqi.
pero como add necesita args, la doy de alta manualmente p pasarselos
'''
appo.conf.beat_schedule = {
'q_loco': {
'task': 'taskos.add',
'schedule': 10.0,
'args': (16, 54),
# 'options' : {'task_id':"lcura"},
}
}
taskos.py : these are the tasks.
#taskos.py
from cfg_celery import appo
from celery.decorators import periodic_task
from redis import Redis
from datetime import timedelta
import requests, time
rds = Redis()
#appo.task(bind=True)
def add(self,a, b):
#result of operation. very dummy.
result = a + b
#storing in redis
r= (self.request.id,time.time(),result)
rds.lpush('my_results',r)
# for this test i want to have at most 5 results stored in redis
long = rds.llen('my_results')
while long > 5:
x = rds.rpop('my_results')
print('popping out',x)
long = rds.llen('my_results')
time.sleep(1)
return a + b
#periodic_task(run_every=20)
def urlea(url='https://www.fullstackpython.com/'):
inicio = time.time()
R = dict()
try:
resp = requests.get(url)
R['vato'] = url+" = " + str(resp.status_code*10)
R['num palabras'] = len(resp.text.split())
except:
R['vato'] = None
R['num palabras'] = 0
print('u {} : {}'.format(url,time.time()-inicio))
time.sleep(0.8) # truco pq se vea mas claramente la dif.
return R
consumer.py : the independent process that can get the results.
#consumer.py
from redis import Redis
nombre_lista = 'my_results'
rds = Redis()
tamaño = rds.llen(nombre_lista)
ultimos_resultados = list()
for i in range(tamaño):
ultimos_resultados.append(rds.rpop(nombre_lista))
print(ultimos_resultados)
I am relatively new to programming and I hope that this answer can help noobs like me. If I got something wrong feel free to make the corrections as necessary.

Spark Streaming textFileStream not supporting wildcards

I setup a simple test to stream text files from S3 and got it to work when I tried something like
val input = ssc.textFileStream("s3n://mybucket/2015/04/03/")
and in the bucket I would have log files go in there and everything would work fine.
But if their was a subfolder, it would not find any files that got put into the subfolder (and yes, I am aware that hdfs doesn't actually use a folder structure)
val input = ssc.textFileStream("s3n://mybucket/2015/04/")
So, I tried to simply do wildcards like I have done before with a standard spark application
val input = ssc.textFileStream("s3n://mybucket/2015/04/*")
But when I try this it throws an error
java.io.FileNotFoundException: File s3n://mybucket/2015/04/* does not exist.
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1483)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1523)
at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:176)
at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:134)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
.....
I know for a fact that you can use wildcards when reading fileInput for a standard spark applications but it appears that when doing streaming input, it doesn't do that nor does it automatically process files in subfolders. Is there something I'm missing here??
Ultimately what I need is a streaming job to be running 24/7 that will be monitoring an S3 bucket that has logs placed in it by date
So something like
s3n://mybucket/<YEAR>/<MONTH>/<DAY>/<LogfileName>
Is there any way to hand it the top most folder and it automatically read files that show up in any folder (cause obviously the date will increase every day)?
EDIT
So upon digging into the documentation at http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources it states that nested directories are not supported.
Can anyone shed some light as to why this is the case?
Also, since my files will be nested based upon their date, what would be a good way of solving this problem in my streaming application? It's a little complicated since the logs take a few minutes to get written to S3 and so the last file being written for the day could be written in the previous day's folder even though we're a few minutes into the new day.

Some "ugly but working solution" can be created by extending FileInputDStream.
Writing sc.textFileStream(d) is equivalent to
new FileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
You can create CustomFileInputDStream that will extend FileInputDStream. The custom class will copy the compute method from the FileInputDStream class and adjust the findNewFiles method to your needs.
changing findNewFiles method from:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val newFiles = fs.listStatus(directoryPath, filter).map(_.getPath.toString)
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
to:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val directories = fs.listStatus(directoryPath).filter(_.isDirectory)
val newFiles = ArrayBuffer[FileStatus]()
directories.foreach(directory => newFiles.append(fs.listStatus(directory.getPath, filter) : _*))
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles.map(_.getPath.toString).toArray
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
will check for files in all first degree sub folders, you can adjust it to use the batch timestamp in order to access the relevant "subdirectories".
I created the CustomFileInputDStream as I mentioned and activated it by calling:
new CustomFileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
It seems to behave us expected.
When I write solution like this I must add some points for consideration:
You are breaking Spark encapsulation and creating a custom class that you would have to support solely as time pass.
I believe that solution like this is the last resort. If your use case can be implemented by different way, it is usually better to avoid solution like this.
If you will have a lot of "subdirectories" on S3 and would check each one of them it will cost you.
It will be very interesting to understand if Databricks doesn't support nested files just because of possible performance penalty or not, maybe there is a deeper reason I haven't thought about.

we had same problem. we joined sub folder names with comma.
List<String> paths = new ArrayList<>();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy/MM/dd");
try {
Date start = sdf.parse("2015/02/01");
Date end = sdf.parse("2015/04/01");
Calendar calendar = Calendar.getInstance();
calendar.setTime(start);
while (calendar.getTime().before(end)) {
paths.add("s3n://mybucket/" + sdf.format(calendar.getTime()));
calendar.add(Calendar.DATE, 1);
}
} catch (ParseException e) {
e.printStackTrace();
}
String joinedPaths = StringUtils.join(",", paths.toArray(new String[paths.size()]));
val input = ssc.textFileStream(joinedPaths);
I hope that in this way your problem is solved.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to obtain metadata about the current executor(s), Apache-Spark? - apache-spark

Related

Run databricks job from notebook

How to programmatically get log level in Pyspark

Using specific elements from a list in different loops for a multiple choice test python 3.x

how to get celery tasks id

Spark Streaming textFileStream not supporting wildcards

Categories

Resources