I try to figure out how to get nested data as dictionary/property from yaml file.
The code below works if I provide the function with only one level.
example :
result = parse_yaml_file(config_yaml_file, 'section')
but fails if I try something like :
result = parse_yaml_file(yaml_file, 'section.sub-section')
or
result = parse_yaml_file(yaml_file, '[\'section\'][\'sub-section\']')
python3 code :
def parse_yaml_file(yml_file, section):
print('section : ' + section)
data_dict = {}
try:
with open(yml_file) as f:
data_dict = (yaml.load(f))
except (FileNotFoundError, IOError):
exit_with_error('Issue finding/opening ' + yml_file)
if not section:
return data_dict
else:
return data_dict.get(section)
result = parse_yaml_file(yaml_file, 'section.sub-section.property')
print(json.dumps(result, indent=4))
Is it possible to parse only on part/section of the yaml file ?
Or just retrieve one sub-section/property from the parsed result ?
I know I can get it from the dictionary like :
data_dict['section']['sub-section']['property']
but I want it to be flexible, and not hardcoded since the data to grab is provided as argument to the function.
Thanks a lot for your help.
You could try using a library to help search the parsed yaml file e.g. dpath
https://pypi.org/project/dpath/
import yaml
import dpath.util
def parse_yaml(yml_file, section):
with open(yml_file,'r') as f:
data_dict = yaml.load(f)
return dpath.util.search(data_dict,section)
parse_yaml('file.yml','section/sub-section')
Related
I have a file with non-classic formatting so I need to use the spark.DataFrameReader (spark.read.csv) on the raw file directly so that I can set the appropriate parsing configurations.
How can I do this?
You'll want to follow the methodology over here. Strongly recommend using unit test based methods to iterate on your code to recover the file contents.
Your compute function code will look like:
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.option("header", "true").csv(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs)
return output_df
#transform(
the_output=Output("ri.foundry.main.dataset.my-awesome-output"),
the_input=Input("ri.foundry.main.dataset.my-awesome-input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = input_filesystem.ls('**/*.csv.gz').map(lambda file_name: hadoop_path + file_name)
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
I am using interpolation feature of configparser module. My problem is for the value which is referred across different sections, I would like it to have date or datetime as part of its value. Below is what I could come up with, while it does my job I feel that there could be a better and elegant way to handle this.
import configparser
from io import StringIO
import datetime
def update_tmstmp_value(config_string):
fo = StringIO(config_string)
data = ''
for line in fo.readlines():
if line.startswith('tmstmp'):
key, tmstmp_str = line.strip().split('=')
try:
value = datetime.datetime.now().strftime(tmstmp_str)
except ValueError:
value = datetime.datetime.now().strftime('%Y%m%d')
data += key + '=' + value + '\n'
else:
data += line
return data
config_data = """
[DEFAULT]
tmstmp=%Y%m%d
type=REPLACE_ME_%(tmstmp)s
[section1]
item1=val1_%(type)s
item2=val2
[section2]
item3=val3_%(type)s
item4=val4
"""
config = configparser.ConfigParser()
modified_config_data = update_tmstmp_value(config_data)
config.read_string(modified_config_data)
print(config.items('section1'))
Output:
[('tmstmp', '20190402'), ('type', 'REPLACE_ME_20190402'), ('item1', 'val1_REPLACE_ME_20190402'), ('item2', 'val2')]
I have the following functions within a wrapper class for the ConfigParser that does pretty much what you are trying to do.
This way I can prepend the date to any of the option=vars that I want to.
def get_date():
"""
Sets up a date formatted string.
:return: Date string
"""
return datetime.now().strftime("%Y%b%d")
def prepend_date_to_var(self, sect, option):
"""
Function that allows the ability to prepend a
date to a section variable.
:param sect: INI section to look for variable
:param option: INI search variable under INI section
:return: Date is prepended to variable
"""
if self.config_parser.get(sect, option):
var = self.config_parser.get(sect, option)
var_with_date = var + '_' + self.get_date()
self.config_parser.set(sect, option, var_with_date)
I got a string from a server response:
responseString:"{"session":"vvSbMInXHRJuZQ==","age":7200,"prid":"901Vjmx9qenYKw","userid":"user_1"}"
then I do:
responseString[1..-2].tokenize(',')
got:
[""session":"vvSbMInXHRJuZQ=="", ""age":7200", ""prid":"901Vjmx9qenYKw"", ""userid":"user_1""]
get(3) got:
""userid":"user_1""
what I need is the user_1, is there anyway I can actually get it? I have been stuck here, other json methods get similar result, how to remove the outside ""?
Thanks.
If you pull out the proper JSON from responseStr, then you can use JsonSlurper, as shown below:
def s = 'responseString:"{"session":"vvSbMInXHRJuZQ==","age":7200,"prid":"901Vjmx9qenYKw","userid":"user_1"}"'
def matcher = (s =~ /responseString:"(.*)"/)
assert matcher.matches()
def responseStr = matcher[0][1]
import groovy.json.JsonSlurper
def jsonSlurper = new JsonSlurper()
def json = jsonSlurper.parseText(responseStr)
assert "user_1" == json.userid
This code can help you get you to the userid.
def str= 'responseString:"{:"session":"vvSbMInXHRJuZQ==","age":7200,"prid":"901Vjmx9qenYKw","userid":"user_1","hdkshfsd":"sdfsdfsdf"}'
def match = (str=~ /"userid":"(.*?)"/)
log.info match[0][1]
this pattern can help you getting any of the values you want from the string. Try replacing userid with age, you will get that
def match = (str=~ /"age":"(.*?)"/)
#Michael code is also correct. Its just that you have clarified that you want the user Name to be specific
as part of my Ph.D. research, I am scraping numerous webpages and search for keywords within the scrape results.
This is how I do it thus far:
# load data with as pandas data frame with column df.url
df = pd.read_excel('sample.xls', header=0)
# define keyword search function
def contains_keywords(link, keywords):
try:
output = requests.get(link).text
return int(any(x in output for x in keywords))
except:
return "Wrong/Missing URL"
# define the relevant keywords
mykeywords = ('for', 'bar')
# store search results in new column 'results'
df['results'] = df.url.apply(lambda l: contains_keywords(l, mykeywords))
This works just fine. I only have one problem: the list of relevant keywords mykeywordschanges frequently, whilst the webpages stay the same. Running the code takes a long time, since I request over and over.
I have two questions:
(1) Is there a way to store the results of request.get(link).text?
(2) And if so, how to I search within the saved file(s) producing the same result as with the current script?
As always, thank you for your time and help! /R
You can download the content of the urls and save them in separate files in a directory (eg: 'links')
def get_link(url):
file_name = os.path.join('/path/to/links', url.replace('/', '_').replace(':', '_'))
try:
r = requests.get(url)
except Exception as e:
print("Failded to get " + url)
else:
with open(file_name, 'w') as f:
f.write(r.text)
Then modify the contains_keywords function to read local files, so you won't have to use requests every time you run the script.
def contains_keywords(link, keywords):
file_name = os.path.join('/path/to/links', link.replace('/', '_').replace(':', '_'))
try:
with open(file_name) as f:
output = f.read()
return int(any(x in output for x in keywords))
except Exception as e:
print("Can't access file: {}\n{}".format(file_name, e))
return "Wrong/Missing URL"
Edit: i just added a try-except block in get_link and used absolute path for file_name
def a():
import json
path=open('C:\\Users\\Bishal\\code\\57.json').read()
config=json.load(path)
for key in config:
return key
You have already read the file path=open('C:\Users\Bishal\code\57.json').read(), so when you try to load with json.load(path), the file pointer is at the end of the file; hence nothing gets loaded or parsed.
Either load the file directly into json, or read the contents and then parse the string with json.loads (note the s)
Option 1:
path = open(r'C:\Users\Bishal\code\57.json').read()
config = json.loads(path)
Option 2:
path = open(r'C:\Users\Bishal\code\57.json')
config = json.load(path)
path.close()
Then you can do whatever you like with the result:
for key,item in config.items():
print('{} - {}'.format(key, item))