In my deployment.yaml file I have defined a static cluster as such:
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "11.2.x-scala2.12"
basic-static-cluster: &basic-static-cluster
new_cluster:
<<: *basic-cluster-props
num_workers: 1
node_type_id: "Standard_DS3_v2"
I use this for all of my tasks. In one of the tasks, I save a DataFrame using:
transactions.createOrReplaceGlobalTempView("transactions")
And in another task (which is depended on the previous task), I try to read the temporary view as such:
global_temp_db = session.conf.get("spark.sql.globalTempDatabase")
# Load wallet features
transactions = session.sql(f"""SELECT *
FROM """ + global_temp_db + """.transactions""")
But I get the error:
AnalysisException: Table or view not found: global_temp.transactions; line 2 pos 43;
'Project [*]
+- 'UnresolvedRelation [global_temp, transactions], [], false
Both tasks run within the same SparkSession, so why can it not find my global temp view?
Unfortunately this won't work unless you're using a cluster-reuse feature (otherwise you have a new cluster each time, therefore you won't be able to cross-reference this view).
A more pythonic approach would be to add the code that initializes the view in every task, e.g. if you're using the pre-defined Task class:
class TaskWithPreInitializedView(Task):
def _add_transactions_view(self):
transactions = ... # some code to define the view
transactions.createOrReplaceGlobalTempView(...)
def launch(self):
self._add_transactions_view()
class RealTask(TaskWithPreInitializedView):
def launch(self):
super(RealTask).launch()
... # your code
Since view creation is a very cheap operation which doesn't take much time, this is a quite efficient approach.
Related
I would like to use Ansible 2.9.9 Python API to get config file and parse it to json format from servers in hosts file.
I don't know how to call an existing ansible task using Python API.
Through the Ansible API document, how to integrate ansible task with the sample code.
Sample.py
#!/usr/bin/env python
import json
import shutil
from ansible.module_utils.common.collections import ImmutableDict
from ansible.parsing.dataloader import DataLoader
from ansible.vars.manager import VariableManager
from ansible.inventory.manager import InventoryManager
from ansible.playbook.play import Play
from ansible.executor.task_queue_manager import TaskQueueManager
from ansible.plugins.callback import CallbackBase
from ansible import context
import ansible.constants as C
class ResultCallback(CallbackBase):
"""A sample callback plugin used for performing an action as results come in
If you want to collect all results into a single object for processing at
the end of the execution, look into utilizing the ``json`` callback plugin
or writing your own custom callback plugin
"""
def v2_runner_on_ok(self, result, **kwargs):
"""Print a json representation of the result
This method could store the result in an instance attribute for retrieval later
"""
host = result._host
print(json.dumps({host.name: result._result}, indent=4))
# since the API is constructed for CLI it expects certain options to always be set in the context object
context.CLIARGS = ImmutableDict(connection='local', module_path=['/to/mymodules'], forks=10, become=None,
become_method=None, become_user=None, check=False, diff=False)
# initialize needed objects
loader = DataLoader() # Takes care of finding and reading yaml, json and ini files
passwords = dict(vault_pass='secret')
# Instantiate our ResultCallback for handling results as they come in. Ansible expects this to be one of its main display outlets
results_callback = ResultCallback()
# create inventory, use path to host config file as source or hosts in a comma separated string
inventory = InventoryManager(loader=loader, sources='localhost,')
# variable manager takes care of merging all the different sources to give you a unified view of variables available in each context
variable_manager = VariableManager(loader=loader, inventory=inventory)
# create data structure that represents our play, including tasks, this is basically what our YAML loader does internally.
play_source = dict(
name = "Ansible Play",
hosts = 'localhost',
gather_facts = 'no',
tasks = [
dict(action=dict(module='shell', args='ls'), register='shell_out'),
dict(action=dict(module='debug', args=dict(msg='{{shell_out.stdout}}')))
]
)
# Create play object, playbook objects use .load instead of init or new methods,
# this will also automatically create the task objects from the info provided in play_source
play = Play().load(play_source, variable_manager=variable_manager, loader=loader)
# Run it - instantiate task queue manager, which takes care of forking and setting up all objects to iterate over host list and tasks
tqm = None
try:
tqm = TaskQueueManager(
inventory=inventory,
variable_manager=variable_manager,
loader=loader,
passwords=passwords,
stdout_callback=results_callback, # Use our custom callback instead of the ``default`` callback plugin, which prints to stdout
)
result = tqm.run(play) # most interesting data for a play is actually sent to the callback's methods
finally:
# we always need to cleanup child procs and the structures we use to communicate with them
if tqm is not None:
tqm.cleanup()
# Remove ansible tmpdir
shutil.rmtree(C.DEFAULT_LOCAL_TMP, True)
sum.yml : generated summary file for each host
- hosts: staging
tasks:
- name: pt_mysql_sum
shell: PTDEST=/tmp/collected;mkdir -p $PTDEST;cd /tmp;wget percona.com/get/pt-mysql-summary;chmod +x pt*;./pt-mysql-summary -- --user=adm --password=***** > $PTDEST/pt-mysql-summary.txt;cat $PTDEST/pt-mysql-summary.out;
register: result
environment:
http_proxy: http://proxy.example.com:8080
https_proxy: https://proxy.example.com:8080
- name: ansible_result
debug: var=result.stdout_lines
- name: fetch_log
fetch:
src: /tmp/collected/pt-mysql-summary.txt
dest: /tmp/collected/pt-mysql-summary-{{ inventory_hostname }}.txt
flat: yes
hosts file
[staging]
vm1 ansible_ssh_host=10.40.50.41 ansible_ssh_user=testuser ansible_ssh_pass=*****
I went through the official docs of google cloud but I don't have an idea how to use these to list resources of specific organization by providing the organization id
organizations = CloudResourceManager.Organizations.Search()
projects = emptyList()
parentsToList = queueOf(organizations)
while (parent = parentsToList.pop()) {
// NOTE: Don't forget to iterate over paginated results.
// TODO: handle PERMISSION_DENIED appropriately.
projects.addAll(CloudResourceManager.Projects.List(
"parent.type:" + parent.type + " parent.id:" + parent.id))
parentsToList.addAll(CloudResourceManager.Folders.List(parent))
}
organizations = CloudResourceManager.Organizations.Search()
projects = emptyList()
parentsToList = queueOf(organizations)
while (parent = parentsToList.pop()) {
// NOTE: Don't forget to iterate over paginated results.
// TODO: handle PERMISSION_DENIED appropriately.
projects.addAll(CloudResourceManager.Projects.List(
"parent.type:" + parent.type + " parent.id:" + parent.id))
parentsToList.addAll(CloudResourceManager.Folders.List(parent))
}
You can use Cloud Asset Inventory for this. I wrote this code for performing a sink in BigQuery.
import os
from google.cloud import asset_v1
from google.cloud.asset_v1.proto import asset_service_pb2
def asset_to_bq(request):
client = asset_v1.AssetServiceClient()
parent = 'organizations/{}'.format(os.getEnv('ORGANIZATION_ID'))
output_config = asset_service_pb2.OutputConfig()
output_config.bigquery_destination.dataset = 'projects/{}}/datasets/{}'.format(os.getEnv('PROJECT_ID'),
os.getEnv('DATASET'))
output_config.bigquery_destination.table = 'asset_export'
output_config.bigquery_destination.force = True
response = client.export_assets(parent, output_config)
# For waiting the finish
# response.result()
# Do stuff after export
return "done", 200
if __name__ == "__main__":
asset_to_bq('')
Be careful is you use it, the sink must be done in an empty/not existing table or set the force to true.
In my case, some minutes after the Cloud Scheduler that trigger my function and extract the data to BigQuery, I have a Scheduled Query into BigQuery that copy the data to another table, for keeping the history.
Note: It's also possible to configure an extract in Cloud Storage if you prefer.
I hope that is a starting point for you and for achieving what do you want to do.
I am able to list the project but I also want to list the folder and resources under folder and folder.name and tags and i also want to specify the organization id to resources information from a specific organization
import os
from google.cloud import resource_manager
def export_resource (organizations):
client = resource_manager.Client()
for project in client.list_projects():
print("%s, %s" % (project.project_id, project.status))
The Hazelcast Jet prints the DAG definition on the console,once started
This converts the Pipeline definition to the DAG.
Here is a Pipeline definition.
private Pipeline buildPipeline() {
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<String, Record>remoteMapJournal("record", getClientConfig(), START_FROM_OLDEST))
.addTimestamps((v) -> getTimeStamp(v), 3000)
.peek()
.groupingKey((v) -> Tuple2.tuple2(getUserID(v),getTranType(v)))
.window(WindowDefinition.sliding(SLIDING_WINDOW_LENGTH_MILLIS, SLIDE_STEP_MILLIS))
.aggregate(counting())
.map((v)-> getMapKey(v))
.drainTo(Sinks.remoteMap("Test", getClientConfig()));
return p;
}
and here is a DAG definition printed on console.
.vertex("remoteMapJournalSource(record)").localParallelism(1)
.vertex("sliding-window-step1").localParallelism(4)
.vertex("sliding-window-step2").localParallelism(4)
.vertex("map").localParallelism(4)
.vertex("remoteMapSink(Test)").localParallelism(1)
.edge(between("remoteMapJournalSource(record)", "sliding-window-step1").partitioned(?))
.edge(between("sliding-window-step1", "sliding-window-step2").partitioned(?).distributed())
.edge(between("sliding-window-step2", "map"))
.edge(between("map", "remoteMapSink(Test)"))
Is there any way to get the DAG definition with all the details like sliding window details, aggregation APIs etc ?
No, it's technically impossible. If you write a lambda (for example for a key extractor), there's no way to display the code that defined the lambda. The only way for you to get more information is to embed that information into the vertex name.
In Jet 0.7, this printout will be changed to the graphviz format so that you can copy-paste it to a tool and see the DAG as an image.
We have evolved our Origen usage such that we have a params file and a flow file for each test module (scan, mbist, etc.). We are now at the point where we need to take into account the test insertion when handling the DUT model and the test flow generation. I can see here that using a job flag is the preferred method for specifying test insertion specifics into the flow file. And this video shows how to specify a test insertion when simulating the test flow. My question is how can a test insertion be specified when not generating a flow, only loading params files into the DUT model? Take this parameter set that defines some test conditions for a scan/ATPG test module.
scan.define_params :test_flows do |p|
p.flows.ws1.chain = [:vmin, :vmax]
p.flows.ft1.chain = [:vmin, :vmax]
p.flows.ws1.logic = [:vmin, :vmax]
p.flows.ft1.logic = [:vmin]
p.flows.ws1.delay = [:pmax]
p.flows.ft1.delay = [:pmin]
end
You can see in the parameter set hierarchy that there are two test insertions defined: 'ws1' and 'ft1'. Am I right to assume that the --job option only sets a flag somewhere when used with the origen testers:run command? Or can this option be applied to origen i, such that just loading some parameter sets will have access to the job selected?
thx
There's no built-in way to do what you want here, but given that you are using parameters in this example the way I would do it would be to align your parameter contexts to the job name:
scan.define_params :ws1 do |p|
p.flows.chain = [:vmin, :vmax]
p.flows.logic = [:vmin, :vmax]
p.flows.delay = [:pmax]
end
scan.define_params :ft1 do |p|
p.flows.chain = [:vmin, :vmax]
p.flows.logic = [:vmin]
p.flows.delay = [:pmin]
end
There are various ways to actually set the current context, one way would be to have a target setup per job:
# target/ws1.rb
MyDUT.new
dut.params = :ws1
# target/ft1.rb
MyDUT.new
dut.params = :ft1
Here it is assuming that the scan object is configured to track the context of the top-level DUT - http://origen-sdk.org/origen//guides/models/parameters/#Tracking_the_Context_of_Another_Object
I would like to use the library threads (or perhaps parallel) for loading/preprocessing data into a queue but I am not entirely sure how it works. In summary;
Load data (tensors), pre-process tensors (this takes time, hence why I am here) and put them in a queue. I would like to have as many threads as possible doing this so that the model is not waiting or not waiting for long.
For the tensor at the top of the queue, extract it and forward it through the model and remove it from the queue.
I don't really understand the example in https://github.com/torch/threads enough. A hint or example as to where I would load data into the queue and train would be great.
EDIT 14/03/2016
In this example "https://github.com/torch/threads/blob/master/test/test-low-level.lua" using a low level thread, does anyone know how I can extract data from these threads into the main thread?
Look at this multi-threaded data provider:
https://github.com/soumith/dcgan.torch/blob/master/data/data.lua
It runs this file in the thread:
https://github.com/soumith/dcgan.torch/blob/master/data/data.lua#L18
by calling it here:
https://github.com/soumith/dcgan.torch/blob/master/data/data.lua#L30-L43
And afterwards, if you want to queue a job into the thread, you provide two functions:
https://github.com/soumith/dcgan.torch/blob/master/data/data.lua#L84
The first one runs inside the thread, and the second one runs in the main thread after the first one completes.
Hopefully that makes it a bit more clear.
If Soumith's examples in the previous answer are not very easy to use, I suggest you build your own pipeline from scratch. I provide here an example of two synchronized threads : one for writing data and one for reading data:
local t = require 'threads'
t.Threads.serialization('threads.sharedserialize')
local tds = require 'tds'
local dict = tds.Hash() -- only local variables work here, and only tables or tds.Hash()
dict[1] = torch.zeros(4)
local m1 = t.Mutex()
local m2 = t.Mutex()
local m1id = m1:id()
local m2id = m2:id()
m1:lock()
local pool = t.Threads(
1,
function(threadIdx)
end
)
pool:addjob(
function()
local t = require 'threads'
local m1 = t.Mutex(m1id)
local m2 = t.Mutex(m2id)
while true do
m2:lock()
dict[1] = torch.randn(4)
m1:unlock()
print ('W ===> ')
print(dict[1])
collectgarbage()
collectgarbage()
end
return __threadid
end,
function(id)
end
)
-- Code executing on master:
local a = 1
while true do
m1:lock()
a = dict[1]
m2:unlock()
print('R --> ')
print(a)
end