NameError: name 'countryCodeMap' is not defined - apache-spark

I am trying to implement a Spark program in a Databricks Cluster and I am following the documentation whose link is as follows:
Now, after this line of code:
def mapKeyToVal(mapping):
def mapKeyToVal_(col):
return mapping.get(col)
return udf(mapKeyToVal_, StringType())
I am using this:
gameInfDf = gameInfDf.withColumn("country_code", mapKeyToVal(countryCodeMap)("country"))
And I am getting the Error: name 'countryCodeMap' is not defined
It will be great if anyone can help me with the same.

https://databricks.com/blog/2018/07/09/analyze-games-from-european-soccer-leagues-with-apache-spark-and-databricks.html is the formal guide for databricks.
See picture below. You need to click on the link and IMPORT the .dbc
You will then see various setupo things. E.g. the Maps needed. Good stuff.
You can see the maps, some of them:
situationMap = {1:'Open play', 2:'Set piece', 3:'Corner', 4:'Free kick', 99:'NA'}
countryCodeMap = {'germany':'DEU', 'france':'FRA', 'england':'GBR', 'spain':'ESP', 'italy':'ITA'}

Related

How to pass RunProperties while calling the glue workflow using boto3 and python in lambda function?

My python code in lambda function:
import json
import boto3
from botocore.exceptions import ClientError
glueClient = boto3.client('glue')
default_run_properties = {'s3_path': 's3://bucketname/abc.zip'}
response = glue_client.start_workflow_run(Name="Testing",RunProperties=default_run_properties)
print(response)
I am getting error like this:
"errorMessage": "Parameter validation failed:\nUnknown parameter in input: \"RunProperties\", must be one of: Name",
"errorType": "ParamValidationError",
I also tried like this :
session = boto3.session.Session()
glue_client = session.client('glue')
But got the same error.
can anyone tell how to pass the RunProperties while calling the glue workflow to run .The RunProperties are dynamic need to be passed from lambda event.
I had the same issue and this is a bit tricky. I do not like my solution, so maybe someone else has a better idea? See here: https://github.com/boto/boto3/issues/2580
And also here: https://docs.aws.amazon.com/glue/latest/webapi/API_StartWorkflowRun.html
So, you cannot pass the parameters when starting the workflow, which is a shame in my opinion, because even the CLI suggests that: https://docs.aws.amazon.com/cli/latest/reference/glue/start-workflow-run.html
However, you can update the parameters before you start the workflow. These values are then set for everyone. If you expect any "concurrency" issues then this is not a good way to go. You need to decide, if you reset the values afterwards or just leave it to the next start of the workflow.
I start my workflows like this:
glue_client.update_workflow(
Name=SHOPS_WORKFLOW_NAME,
DefaultRunProperties={
's3_key': file_key,
'market_id': segments[0],
},
)
workflow_run_id = glue_client.start_workflow_run(
Name=SHOPS_WORKFLOW_NAME
)
This basically produces the following in the next run:
I had the same problem and asked in AWS re:Post. The problem is the old boto3 version used in Lambda. They recommended two ways to work around this issue:
Update run properties for a Job immediately after start_workflow_run:
default_run_properties = {'s3_path': 's3://bucketname/abc.zip'}
response = glue_client.start_workflow_run(Name="Testing")
updateRun = glue_client.put_workflow_run_properties(
Name = "Testing",
RunId = response['RunId'],
RunProperties = default_run_properties
)
Or you can create a lambda layer for your lambda function and include a new boto3 version there.

Understanding argparse to get dynamic maps with Geo-Location of tweets

I have found this python code online (twitter_map_clustered.py) which (I think) help create a map using the geodata of different tweets.:
from argparse import ArgumentParser
import folium
from folium.plugins import MarkerCluster
import json
def get_parser():
parser = ArgumentParser()
parser.add_argument('--geojson')
parser.add_argument('--map')
return parser
def make_map(geojson_file, map_file):
tweet_map = folium.Map(Location=[50, 5], max_zoom=20)
marker_cluster = MarkerCluster().add_to(tweet_map)
geodata= json.load(open(geojson_file))
for tweet in geodata['features']:
tweet['geometry']['coordinates'].reverse()
marker = folium.Marker(tweet['geometry']['coordinates'], popup=tweet['properties']['text'])
marker.add_to(marker_cluster)
#Save to HTML map file
tweet_map.save(map_file)
if __name__ == '__main__':
parser = get_parser()
args = parser.parse_args()
make_map(args.geojson, args.map)
I managed to extract the geo information of different tweets and save it into a geo_data.json file. However, I have trouble understanding the code, specially the function def get_parser().
It seems that we need to add argument when running the file in the command prompt. The argument should be geo_data.json. However, it is also asking for a map ? parser.add_argument('--map')
Why is it the case? In the code, aren't we creating the map here?
#Save to HTML map file
tweet_map.save(map_file)
Can you please help me. How would you run the python script ? Is there anything important I am missing ?
As explained by argparse documentation, it simply asks for the name of the geojson file and a name that your code will use to save the map.
Therefore, you will run:
python twitter_map_clustered.py --geojson geo_data.json --map mymap.html
and you will get a map named mymap.html.

ScriptRunConfig with datastore reference on AML

When trying to run a ScriptRunConfig, using :
src = ScriptRunConfig(source_directory=project_folder,
script='train.py',
arguments=['--input-data-dir', ds.as_mount(),
'--reg', '0.99'],
run_config=run_config)
run = experiment.submit(config=src)
It doesn't work and breaks with this when I submit the job :
... lots of things... and then
TypeError: Object of type 'DataReference' is not JSON serializable
However if I run it with the Estimator, it works. One of the differences is the fact that with a ScriptRunConfig we're using a list for parameters and the other is a dictionary.
Thanks for any pointers!
Being able to use DataReference in ScriptRunConfig is a bit more involved than doing just ds.as_mount(). You will need to convert it into a string in arguments and then update the RunConfiguration's data_references section with the DataReferenceConfiguration created from ds. Please see here for an example notebook on how to do that.
If you are just reading from the input location and not doing any writes to it, please check out Dataset. It allows you to do exactly what you are doing without doing anything extra. Here is an example notebook that shows this in action.
Below is a short version of the notebook
from azureml.core import Dataset
# more imports and code
ds = Datastore(workspace, 'mydatastore')
dataset = Dataset.File.from_files(path=(ds, 'path/to/input-data/within-datastore'))
src = ScriptRunConfig(source_directory=project_folder,
script='train.py',
arguments=['--input-data-dir', dataset.as_named_input('input').as_mount(),
'--reg', '0.99'],
run_config=run_config)
run = experiment.submit(config=src)
you can see this link how-to-migrate-from-estimators-to-scriptrunconfig in official documents.
The core code of using DataReference in ScriptRunConfig is
# if you want to pass a DataReference object, such as the below:
datastore = ws.get_default_datastore()
data_ref = datastore.path('./foo').as_mount()
src = ScriptRunConfig(source_directory='.',
script='train.py',
arguments=['--data-folder', str(data_ref)], # cast the DataReference object to str
compute_target=compute_target,
environment=pytorch_env)
src.run_config.data_references = {data_ref.data_reference_name: data_ref.to_config()} # set a dict of the DataReference(s) you want to the `data_references` attribute of the ScriptRunConfig's underlying RunConfiguration object.

error while calling function inside another function

I have function for newspaper3k which extract summary for given url. Given as :-
def article_summary(row):
url = row
article = Article(url)
article.download()
article.parse()
article.nlp()
text = article.summary
return text
I have pandas dataframe with column named as url
url
https://www.xyssss.com/dddd
https://www.sbkaksbk.com/shshshs
https://www.ascbackkkc.com/asbbs
............
............
There is another function main_code() which runs perfectly fine and inside which Im using article_summary.I want to add both functions article_summary and main_code() into one function final_code.
Here is my code : 1st function as:-
def article_summary(row):
url = row
article = Article(url)
article.download()
article.parse()
article.nlp()
text = article.summary
return text
Here is 2nd Function
def main_code():
article_data['article']=article_data['url'].apply(article_summary)
return article_data['articles']
When I have done:
def final_code():
article_summary()
main_code()
But final_code() not giving any output it shows as TypeError: article_summary() missing 1 required positional argument: 'row'
Are those the actual URLs you're using? If so, they seem to be causing an ArticleException, I tested your code with some wikipedia pages and it works.
On that note, are you working with just one df? If not, it's probably a good idea to pass it as a variable to the function.
-----------------------------------Edit after comments----------------------------------------------------------------------
I think a tutorial on Python functions will be beneficial. That said, in regards to your specific question, calling a function the way you described it will make it run twice, which is not needed in this case. As I said earlier, you should pass the df as an argument to the function, here is a tutorial on global vs local variables and how to use them.
The error you're getting is because you should pass an argument 'row' to the function article_summary (please see functions tutorial).

passing additional arguments for `createOptFlow_DualTVL1` for Python

I am trying to use createOptFlow_DualTVL1 from OpenCV for Python. I am using this book (Page 120) as a reference( https://books.google.de/books?id=TrZTDwAAQBAJ&pg=PA119&lpg=PA119&dq=cv2.createOptFlow_DualTVL1()+parameters+python&source=bl&ots=3sQlqNwhDX&sig=h7Ve4FzpsVJKkjrNXpy8t_Arpvw&hl=en&sa=X&ved=0ahUKEwicu6_EveTbAhUDb1AKHQp1Bq8Q6AEIUDAD#v=onepage&q=cv2.createOptFlow_DualTVL1()%20parameters%20python&f=false).
According to the book, I need to create an instance of cv2.DualTVL1OpticalFlow class by calling the cv2.createOptFlow_DualTVL1 function. Then the optical flow can be calculated by calling the calc function of the created instance. This function takes the previous frame, the current frame and the optical flow initialisations as arguments. The book also says that other algorithm parameters can be set by using set. But I am unable to pass any other parameter as argument in Python. Is there a way to set additional algorithm parameters? Below is a snippet of what I have until now.
optical_flow = cv2.createOptFlow_DualTVL1()
flow = optical_flow.calc(new_prvs, new_nxt, None)
I found the solution: here
For cv2 version "'4.1.0'":
replace:
optical_flow = cv2.createOptFlow_DualTVL1()
flow = optical_flow.calc(new_prvs, new_nxt, None)
with:
optical_flow = cv2.optflow.DualTVL1OpticalFlow_create()
flow = optical_flow.calc(prvs, next, None)
The parameter descriptions can be found here: http://docs.opencv.org/3.3.0/dc/d47/classcv_1_1DualTVL1OpticalFlow.html

Resources