I need to know, programmatically in Pyspark, which is the log level.
I know I can set it, by doing:
# spark is a SparkSession object
spark.sparkContext.setLogLevel(log_level)
But there is not an equivalent method for retrieving the log level.
Any ideas? Thanks!
I finally came up with a solution, by accessing the Spark session's JVM (py4j underneath):
def get_log_level(spark):
log_manager = spark._jvm.org.apache.log4j.LogManager
trace = spark._jvm.org.apache.log4j.Level.TRACE
debug = spark._jvm.org.apache.log4j.Level.DEBUG
info = spark._jvm.org.apache.log4j.Level.INFO
warn = spark._jvm.org.apache.log4j.Level.WARN
error = spark._jvm.org.apache.log4j.Level.ERROR
fatal = spark._jvm.org.apache.log4j.Level.FATAL
logger = log_manager.getRootLogger()
if logger.isEnabledFor(trace):
return "TRACE"
elif logger.isEnabledFor(debug):
return "DEBUG"
elif logger.isEnabledFor(info):
return "INFO"
elif logger.isEnabledFor(warn):
return "WARN"
elif logger.isEnabledFor(error):
return "ERROR"
elif logger.isEnabledFor(fatal):
return "FATAL"
else:
return None
Most probably there is a better way for doing it.
This will return the LogLevel set in your spark session
log_manager = spark._jvm.org.apache.log4j.LogManager
logger = log_manager.getRootLogger().getEffectiveLevel()
Spark is Open Source, right?
The source code will show you many things that are not in the documentation. And the unit tests will give you hints about things not covered in tutorials.
Demo: browse the Spark project on Github and search for setLogLevel.
OK, the Github internal search usually sucks, but on a single specific keyword it's worth trying. And indeed the very 1st answer gives you this interesting snippet, from a unit test (here reset to branch 2.4):
https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
val originalLevel = org.apache.log4j.Logger.getRootLogger().getLevel
try {
// Avoid outputting a lot of expected warning logs
spark.sparkContext.setLogLevel("error")
...
} finally {
spark.sparkContext.setLogLevel(originalLevel.toString)
...
}
So the setLogLevel method appears to be a (very) thin wrapper around the Log4J API.
And that's exactly that:
https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/SparkContext.scala
def setLogLevel(logLevel: String) {
...
Utils.setLogLevel(org.apache.log4j.Level.toLevel(upperCased))
}
https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/util/Utils.scala
def setLogLevel(l: org.apache.log4j.Level) {
org.apache.log4j.Logger.getRootLogger().setLevel(l)
}
Related
I have a json output (y) like this below.
{
"WebACL":{
"Name":"aBlockKnownBadInputs-WebAcl",
"Id":"4312a5d0-9878-4feb-a083-09d7a9cfcfbb",
"ARN":"arn:aws:wafv2:us-east-1:100467320728:regional/webacl/aBlockKnownBadInputs-WebAcl/4312a5d0-9878-4feb-a083-09d7a9cfcfbb",
"DefaultAction":{
"Allow":{
}
},
"Description":"",
"Rules":[
{
"Name":"AWS-AWSManagedRulesKnownBadInputsRuleSet",
"Priority":500,
"Statement":{
"ManagedRuleGroupStatement":{
"VendorName":"AWS",
"Name":"AWSManagedRulesKnownBadInputsRuleSet"
}
},
"OverrideAction":{
"None":{
}
},
"VisibilityConfig":{
"SampledRequestsEnabled":true,
"CloudWatchMetricsEnabled":true,
"MetricName":"AWS-AWSManagedRulesKnownBadInputsRuleSet"
}
}
]
}
}
I want to extract "AWS-AWSManagedRulesKnownBadInputsRuleSet" from the section:-
"Name":"AWS-AWSManagedRulesKnownBadInputsRuleSet",
"Priority":500,
"Statement":{
"ManagedRuleGroupStatement":{
"VendorName":"AWS",
"Name":"AWSManagedRulesKnownBadInputsRuleSet"*
At the minute my code is returning a key error:
KeyError: 'Rules[].Statement[].ManagedRuleGroupStatement[].Name'
The format of this line is clearly wrong, but I don't know why.
ruleset = y['Rules[].Statement[].ManagedRuleGroupStatement[].Name']
My code block:
respons = client.get_web_acl(Name=(acl),Scope='REGIONAL',Id=(ids))
for y in response['WebACLs']:
ruleset = y['Rules[].Statement[].ManagedRuleGroupStatement[].Name']
Does anyone know what I'm doing wrong here?
UPDATE :- In case anyone else comes up against this problem, I fixed this by doing it a slightly different way.
#Requesting the info from AWS via get_web_acl request
respons = client.get_web_acl(Name=(acl),Scope='REGIONAL',Id=(ids))
#Converting the dict output to a string to make it searchable
result = json.dumps(respons)
#Defining what I want to search for
fullstring = "AWS-AWSManagedRulesKnownBadInputsRuleSet"
#Searching the output & printing the result: if = true / else = false
if fullstring in result:
print("Found WAF ruleset: AWS-AWSManagedRulesKnownBadInputsRuleSet!")
else:
print("WAF ruleset not found!")
Also, as part of my research I found a cool thing called jello.
(https://github.com/kellyjonbrazil/jello).
jello is similar to jq in that it processes JSON and JSON Lines data except it uses standard python dict and list syntax.
So, I copied my json into a file called file.json
Ran cat file.json | jello -s to print a grep-able schema by using the -s option
Found the bit I was interested in - in my case the name of the ManagedRuleGroupStatement and ran the following:
cat file.json | jello -s _.WebACL.Rules[0].Statement.ManagedRuleGroupStatement.Name
This gave me exactly what I wanted!
It doesn't work inside a python script but was something new and interesting to learn!
I want to know if it is possible to run a Databricks job from a notebook using code, and how to do it
I have a job with multiple tasks, and many contributors, and we have a job created to execute it all, now we want to run the job from a notebook to test new features without creating a new task in the job, also for running the job multiple times in a loop, for example:
for i in [1,2,3]:
run job with parameter i
Regards
what you need to do is the following:
install the databricksapi. %pip install databricksapi==1.8.1
Create your job and return an output. You can do that by exiting the notebooks like that:
import json dbutils.notebook.exit(json.dumps({"result": f"{_result}"}))
If you want to pass a dataframe, you have to pass them as json dump too, there is some official documentation about that from databricks. check it out.
Get the job id you will need it later. You can get it from the jobs details in databricks.
In the executors notebook you can use the following code.
def run_ks_job_and_return_output(params):
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
# context
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
jobs_instance = Jobs.Jobs(url, token) # initialize a jobs_instance
runs_job_id = jobs_instance.runJob(****************, 'notebook',
params) # **** is the job id
run_is_not_completed = True
while run_is_not_completed:
current_run = [run for run in jobs_instance.runsList('completed')['runs'] if run['run_id'] == runs_job_id['run_id'] and run['number_in_job'] == runs_job_id['number_in_job']]
if len(current_run) == 0:
time.sleep(30)
else:
run_is_not_completed = False
current_run = current_run[0]
print( f"Result state: {current_run['state']['result_state']}, You can check the resulted output in the following link: {current_run['run_page_url']}")
note_output = jobs_instance.runsGetOutput(runs_job_id['run_id'])['notebook_output']
return note_output
run_ks_job_and_return_output( { 'parm1' : 'george',
'variable': "values1"})
If you want to run the job many times in parallel you can do the following. (first be sure that you have increased the max concurent runs in the job settings)
from multiprocessing.pool import ThreadPool
pool = ThreadPool(1000)
results = pool.map(lambda j: run_ks_job_and_return_output( { 'table' : 'george',
'variable': "values1",
'j': j}),
[str(x) for x in range(2,len(snapshots_list))])
There is also the possibility to save the whole html output but maybe you are not interested on that. In any case I will answer to that to another post on StackOverflow.
Hope it helps.
You can use following steps :
Note-01:
dbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")
dbutils.widgets.text("foo2", "foo2Default", "foo2EmptyLabel")
result = dbutils.widgets.get("foo")+"-"+dbutils.widgets.get("foo2")
def display():
print("Function Display: "+result)
dbutils.notebook.exit(result)
Note-02:
thislist = ["apple", "banana", "cherry"]
for x in thislist:
dbutils.notebook.run("Note-01 path", 60, {"foo": x,"foo2":'Azure'})
I have this block of code that basically translates text from one language to another using the cloud translate API. The problem is that this code always throws the error: "Caller's project doesn't match parent project". What could be the problem?
translation_separator = "translated_text: "
language_separator = "detected_language_code: "
translate_client = translate.TranslationServiceClient()
# parent = translate_client.location_path(
# self.translate_project_id, self.translate_location
# )
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = (
os.getcwd()
+ "/translator_credentials.json"
)
# Text can also be a sequence of strings, in which case this method
# will return a sequence of results for each text.
try:
result = str(
translate_client.translate_text(
request={
"contents": [text],
"target_language_code": self.target_language_code,
"parent": f'projects/{self.translate_project_id}/'
f'locations/{self.translate_location}',
"model": self.translate_model
}
)
)
print(result)
except Exception as e:
print("error here>>>>>", e)
Your issue seems to be related to the authentication method that you are using on your application, please follow the guide for authention methods with the translate API. If you are trying to pass the credentials using code, you can explicitly point to your service account file in code with:
def explicit():
from google.cloud import storage
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json(
'service_account.json')
Also, there is a codelab for getting started with the translation API with Python, this is a great step by step getting started guide for running the translate API with Python.
If the issue persists, you can try creating a Public Issue Tracker for Google Support
I am trying to understand how cats effect Cancelable works. I have the following minimal app, based on the documentation
import java.util.concurrent.{Executors, ScheduledExecutorService}
import cats.effect._
import cats.implicits._
import scala.concurrent.duration._
object Main extends IOApp {
def delayedTick(d: FiniteDuration)
(implicit sc: ScheduledExecutorService): IO[Unit] = {
IO.cancelable { cb =>
val r = new Runnable {
def run() =
cb(Right(()))
}
val f = sc.schedule(r, d.length, d.unit)
// Returning the cancellation token needed to cancel
// the scheduling and release resources early
val mayInterruptIfRunning = false
IO(f.cancel(mayInterruptIfRunning)).void
}
}
override def run(args: List[String]): IO[ExitCode] = {
val scheduledExecutorService =
Executors.newSingleThreadScheduledExecutor()
for {
x <- delayedTick(1.second)(scheduledExecutorService)
_ <- IO(println(s"$x"))
} yield ExitCode.Success
}
}
When I run this:
❯ sbt run
[info] Loading global plugins from /Users/ethan/.sbt/1.0/plugins
[info] Loading settings for project stackoverflow-build from plugins.sbt ...
[info] Loading project definition from /Users/ethan/IdeaProjects/stackoverflow/project
[info] Loading settings for project stackoverflow from build.sbt ...
[info] Set current project to cats-effect-tutorial (in build file:/Users/ethan/IdeaProjects/stackoverflow/)
[info] Compiling 1 Scala source to /Users/ethan/IdeaProjects/stackoverflow/target/scala-2.12/classes ...
[info] running (fork) Main
[info] ()
The program just hangs at this point. I have many questions:
Why does the program hang instead of terminating after 1 second?
Why do we set mayInterruptIfRunning = false? Isn't the whole point of cancellation to interrupt a running task?
Is this the recommended way to define the ScheduledExecutorService? I did not see examples in the docs.
This program waits 1 second, and then returns () (then unexpectedly hangs). What if I wanted to return something else? For example, let's say I wanted to return a string, the result of some long-running computation. How would I extract that value from IO.cancelable? The difficulty, it seems, is that IO.cancelable returns the cancelation operation, not the return value of the process to be cancelled.
Pardon the long post but this is my build.sbt:
name := "cats-effect-tutorial"
version := "1.0"
fork := true
scalaVersion := "2.12.8"
libraryDependencies += "org.typelevel" %% "cats-effect" % "1.3.0" withSources() withJavadoc()
scalacOptions ++= Seq(
"-feature",
"-deprecation",
"-unchecked",
"-language:postfixOps",
"-language:higherKinds",
"-Ypartial-unification")
you need shutdown the ScheduledExecutorService, Try this
Resource.make(IO(Executors.newSingleThreadScheduledExecutor))(se => IO(se.shutdown())).use {
se =>
for {
x <- delayedTick(5.second)(se)
_ <- IO(println(s"$x"))
} yield ExitCode.Success
}
I was able to find an answer to these questions although there are still some things that I don't understand.
Why does the program hang instead of terminating after 1 second?
For some reason, Executors.newSingleThreadScheduledExecutor() causes things to hang. To fix the problem, I had to use Executors.newSingleThreadScheduledExecutor(new Thread(_)). It appears that the only difference is that the first version is equivalent to Executors.newSingleThreadScheduledExecutor(Executors.defaultThreadFactory()), although nothing in the docs makes it clear why this is the case.
Why do we set mayInterruptIfRunning = false? Isn't the whole point of cancellation to interrupt a running task?
I have to admit that I do not understand this entirely. Again, the docs were not especially clarifying on this point. Switching the flag to true does not seem to change the behavior at all, at least in the case of Ctrl-c interrupts.
Is this the recommended way to define the ScheduledExecutorService? I did not see examples in the docs.
Clearly not. The way that I came up with was loosely inspired by this snippet from the cats effect source code.
This program waits 1 second, and then returns () (then unexpectedly hangs). What if I wanted to return something else? For example, let's say I wanted to return a string, the result of some long-running computation. How would I extract that value from IO.cancelable? The difficulty, it seems, is that IO.cancelable returns the cancelation operation, not the return value of the process to be cancelled.
The IO.cancellable { ... } block returns IO[A] and the callback cb function has type Either[Throwable, A] => Unit. Logically this suggests that whatever is fed into the cb function is what the IO.cancellable expression will returned (wrapped in IO). So to return the string "hello" instead of (), we rewrite delayedTick:
def delayedTick(d: FiniteDuration)
(implicit sc: ScheduledExecutorService): IO[String] = { // Note IO[String] instead of IO[Unit]
implicit val processRunner: JVMProcessRunner[IO] = new JVMProcessRunner
IO.cancelable[String] { cb => // Note IO.cancelable[String] instead of IO[Unit]
val r = new Runnable {
def run() =
cb(Right("hello")) // Note "hello" instead of ()
}
val f: ScheduledFuture[_] = sc.schedule(r, d.length, d.unit)
IO(f.cancel(true))
}
}
You need explicitly terminate the executor at the end, as it is not managed by Scala or Cats runtime, it wouldn't exit by itself, that's why your App hands up instead of exit immediately.
mayInterruptIfRunning = false gracefully terminates a thread if it is running. You can set it as true to forcely kill it, but it is not recommanded.
You have many way to create a ScheduledExecutorService, it depends on need. For this case it doesn't matter, but the question 1.
You can return anything from the Cancelable IO by call cb(Right("put your stuff here")), the only thing blocks you to retrieve the return A is when your cancellation works. You wouldn't get anything if you stop it before it gets to the point. Try to return IO(f.cancel(mayInterruptIfRunning)).delayBy(FiniteDuration(2, TimeUnit.SECONDS)).void, you will get what you expected. Because 2 seconds > 1 second, your code gets enough time to run before it has been cancelled.
I setup a simple test to stream text files from S3 and got it to work when I tried something like
val input = ssc.textFileStream("s3n://mybucket/2015/04/03/")
and in the bucket I would have log files go in there and everything would work fine.
But if their was a subfolder, it would not find any files that got put into the subfolder (and yes, I am aware that hdfs doesn't actually use a folder structure)
val input = ssc.textFileStream("s3n://mybucket/2015/04/")
So, I tried to simply do wildcards like I have done before with a standard spark application
val input = ssc.textFileStream("s3n://mybucket/2015/04/*")
But when I try this it throws an error
java.io.FileNotFoundException: File s3n://mybucket/2015/04/* does not exist.
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1483)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1523)
at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:176)
at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:134)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
.....
I know for a fact that you can use wildcards when reading fileInput for a standard spark applications but it appears that when doing streaming input, it doesn't do that nor does it automatically process files in subfolders. Is there something I'm missing here??
Ultimately what I need is a streaming job to be running 24/7 that will be monitoring an S3 bucket that has logs placed in it by date
So something like
s3n://mybucket/<YEAR>/<MONTH>/<DAY>/<LogfileName>
Is there any way to hand it the top most folder and it automatically read files that show up in any folder (cause obviously the date will increase every day)?
EDIT
So upon digging into the documentation at http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources it states that nested directories are not supported.
Can anyone shed some light as to why this is the case?
Also, since my files will be nested based upon their date, what would be a good way of solving this problem in my streaming application? It's a little complicated since the logs take a few minutes to get written to S3 and so the last file being written for the day could be written in the previous day's folder even though we're a few minutes into the new day.
Some "ugly but working solution" can be created by extending FileInputDStream.
Writing sc.textFileStream(d) is equivalent to
new FileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
You can create CustomFileInputDStream that will extend FileInputDStream. The custom class will copy the compute method from the FileInputDStream class and adjust the findNewFiles method to your needs.
changing findNewFiles method from:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val newFiles = fs.listStatus(directoryPath, filter).map(_.getPath.toString)
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
to:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val directories = fs.listStatus(directoryPath).filter(_.isDirectory)
val newFiles = ArrayBuffer[FileStatus]()
directories.foreach(directory => newFiles.append(fs.listStatus(directory.getPath, filter) : _*))
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles.map(_.getPath.toString).toArray
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
will check for files in all first degree sub folders, you can adjust it to use the batch timestamp in order to access the relevant "subdirectories".
I created the CustomFileInputDStream as I mentioned and activated it by calling:
new CustomFileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
It seems to behave us expected.
When I write solution like this I must add some points for consideration:
You are breaking Spark encapsulation and creating a custom class that you would have to support solely as time pass.
I believe that solution like this is the last resort. If your use case can be implemented by different way, it is usually better to avoid solution like this.
If you will have a lot of "subdirectories" on S3 and would check each one of them it will cost you.
It will be very interesting to understand if Databricks doesn't support nested files just because of possible performance penalty or not, maybe there is a deeper reason I haven't thought about.
we had same problem. we joined sub folder names with comma.
List<String> paths = new ArrayList<>();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy/MM/dd");
try {
Date start = sdf.parse("2015/02/01");
Date end = sdf.parse("2015/04/01");
Calendar calendar = Calendar.getInstance();
calendar.setTime(start);
while (calendar.getTime().before(end)) {
paths.add("s3n://mybucket/" + sdf.format(calendar.getTime()));
calendar.add(Calendar.DATE, 1);
}
} catch (ParseException e) {
e.printStackTrace();
}
String joinedPaths = StringUtils.join(",", paths.toArray(new String[paths.size()]));
val input = ssc.textFileStream(joinedPaths);
I hope that in this way your problem is solved.