Wrapping WriteToText Within DoFn - python-3.x

I'm trying to wrap WriteToText within a DoFn to allow for some customization/flexibility in how I write files. Specifically, I want to write different files based on the on an argument/input (based on value provider argument). This is the code I have so far:
class WriteCustomFile(beam.DoFn):
def __init__(self,input,output):
self.input = input
self.output = output
def process(self, element):
import re
def FileVal(path):
File1Regex = re.compile(r"[^\w](testfile)[\w]+(\.csv|\.txt)$")
File2Regex = re.compile(r"[^\w](tester)[\w-]+(\.csv|\.txt)$")
PathStr = str(path)
if File1Regex.search(PathStr) != None:
return "file1"
elif File2Regex.search(PathStr) != None:
return "file2"
File1Header = "Header1,Header2,Header3,Header4,Header5"
File2Header = "Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8"
if FileVal(self.input.get()) == "file1":
yield WriteToText(self.output.get(),shard_name_template='',header=File1Header)
elif FileVal(self.input.get()) == "file2":
yield WriteToText(self.output.get(),shard_name_template='',header=File2Header)
When I call this DoFn from within the pipeline, it does not write a file. What can I do to get this DoFn to work or is there a better way to handle this?
Thank you!

Here the best thing to do is probably partition your input into multiple PCollections (either using Partition or a DoFn with multiple outputs), and write each one out separate.
More generally one can use Dynamic Destinations, but this is not yet supported for Python.

Related

How to check which function has been returned in python?

I have two methods which take different number of arguments. Here are the two functions:
def jumpMX(self,IAS,list):
pass
def addMX(self,IAS):
pass
I am using a function which will return one of these functions to main.I have stored this returned function in a variable named operation.
Since the number of parameters are different for both,how do I identify which function has been returned?
if(operation == jumpMX):
operation(IAS,list)
elif(operation == addMX):
operation(IAS)
What is the syntax for this?Thanks in advance!
You can identify a function through its __name__ attribute:
def foo():
pass
print(foo.__name__)
>>> foo
...or in your case:
operation.__name__ #will return either "jumpMX" or "addMX" depending on what function is stored in operation
Here's a demo you can modify to your needs:
import random #used only for demo purposes
def jumpMX(self,IAS,list):
pass
def addMX(self,IAS):
pass
def FunctionThatWillReturnOneOrTheOtherOfTheTwoFunctionsAbove():
# This will randomly return either jumpMX()
# or addMX to simulate different scenarios
funcs = [jumpMX, addMX]
randomFunc = random.choice(funcs)
return randomFunc
operation = FunctionThatWillReturnOneOrTheOtherOfTheTwoFunctionsAbove()
name = operation.__name__
if(name == "jumpMX"):
operation(IAS,list)
elif(name == "addMX"):
operation(IAS)
You can import those functions and test for equality like with most objects in python.
classes.py
class MyClass:
#staticmethod
def jump(self, ias, _list):
pass
#staticmethod
def add(self, ias):
pass
main.py
from classes import MyClass
myclass_instance = MyClass()
operation = get_op() # your function that returns MyClass.jump or MyClass.add
if operation == MyClass.jump:
operation(myclass_instance, ias, _list)
elif operation == MyClass.add:
operation(myclass_instance, ias)
However, I must emphasize that I don't know what you're trying to accomplish and this seems like a terribly contrived way of doing something like this.
Also, your python code examples are not properly formatted. See the PEP-8 which proposes a standard style-guide for python.

Dictionary to switch between methods with different arguments

A common workaround for the lack of a case/switch statement in python is the use of a dictionary. I am trying to use this to switch between methods as shown below, but the methods have different argument sets and it's unclear how I can accommodate that.
def method_A():
pass
def method_B():
pass
def method_C():
pass
def method_D():
pass
def my_function(arg = 1):
switch = {
1: method_A,
2: method_B,
3: method_C,
4: method_D
}
option = switch.get(arg)
return option()
my_function(input) #input would be read from file or command line
If I understand correctly, the dictionary keys become associated with the different methods, so calling my_function subsequently calls the method which corresponds to the key I gave as input. But that leaves no opportunity to pass any arguments to those subsequent methods. I can use default values, but that really isn't the point. The alternative is nested if-else statements to choose, which doesn't have this problem but arguably less readable and less elegant.
Thanks in advance for your help.
The trick is to pass *args, **kwargs into my_function and the **kwargs onto to your choosen function and evaluate it there.
def method_A(w):
print(w.get("what")) # uses the value of key "what"
def method_B(w):
print(w.get("whatnot","Not provided")) # uses another keys value
def my_function(args,kwargs):
arg = kwargs.get("arg",1) # get the arg value or default to 1
switch = {
1: method_A,
2: method_B,
}
option = switch.get(arg)
return option(kwargs)
my_function(None, {"arg":1, "what":"hello"} ) # could provide 1 or 2 as 1st param
my_function(None, {"arg":2, "what":"hello"} )
Output:
hello
Not provided
See Use of *args and **kwargs for more on it.

Write a recursive function to list all paths of parts.txt

Write a function list_files_recursive that returns a list of the paths of all the parts.txt files without using the os module's walk generator. Instead, the function should use recursion. The input will be a directory name.
Here is the code I have so far and I think it's basically right, but what's happening is that the output is not one whole list?
def list_files_recursive(top_dir):
rec_list_files = []
list_dir = os.listdir(top_dir)
for item in list_dir:
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_files_recursive(item_path)
else:
if os.path.basename(item_path) == 'parts.txt':
rec_list_files.append(os.path.join(item_path))
print(rec_list_files)
return rec_list_files
This is part of the output I'm getting (from the print statement):
['CarItems/Honda/Accord/1996/parts.txt']
[]
['CarItems/Honda/Odyssey/2000/parts.txt']
['CarItems/Honda/Odyssey/2002/parts.txt']
[]
So the problem is that it's not one list and that there's empty lists in there. I don't quite know why this isn't not working and have tried everything to work through it. Any help is much appreciated on this!
This is very close, but the issue is that list_files_recursive's child calls don't pass results back to the parent. One way to do this is to concatenate all of the lists together from each child call, or to pass a reference to a single list all the way through the call chain.
Note that in rec_list_files.append(os.path.join(item_path)), there's no point in os.path.join with only a single parameter. print(rec_list_files) should be omitted as a side effect that makes the output confusing to interpret--only print in the caller. Additionally,
else:
if ... :
can be more clearly written here as elif: since they're logically equivalent. It's always a good idea to reduce nesting of conditionals whenever possible.
Here's the approach that works by extending the parent list:
import os
def list_files_recursive(top_dir):
files = []
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
files.extend(list_files_recursive(item_path))
# ^^^^^^ add child results to parent
elif os.path.basename(item_path) == "parts.txt":
files.append(item_path)
return files
if __name__ == "__main__":
print(list_files_recursive("foo"))
Or by passing a result list through the call tree:
import os
def list_files_recursive(top_dir, files=[]):
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_files_recursive(item_path, files)
# ^^^^^ pass our result list recursively
elif os.path.basename(item_path) == "parts.txt":
files.append(item_path)
return files
if __name__ == "__main__":
print(list_files_recursive("foo"))
A major problem with these functions are that they only work for finding files named precisely parts.txt since that string literal was hard coded. That makes it pretty much useless for anything but the immediate purpose. We should add a parameter for allowing the caller to specify the target file they want to search for, making the function general-purpose.
Another problem is that the function doesn't do what its name claims: list_files_recursive should really be called find_file_recursive, or, due to the hardcoded string, find_parts_txt_recursive.
Beyond that, the function is a strong candidate for turning into a generator function, which is a common Python idiom for traversal, particularly for situations where the subdirectories may contain huge amounts of data that would be expensive to keep in memory all at once. Generators also allow the flexibility of using the function to cancel the search after the first match, further enhancing its (re)usability.
The yield keyword also makes the function code itself very clean--we can avoid the problem of keeping a result data structure entirely and just fire off result items on demand.
Here's how I'd write it:
import os
def find_file_recursive(top_dir, target):
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
yield from find_file_recursive(item_path, target)
elif os.path.basename(item_path) == target:
yield item_path
if __name__ == "__main__":
print(list(find_file_recursive("foo", "parts.txt")))

Calling various function based on input in python 3.x

I'm writing some code to get various data from a class (which extracts data from a '.csv' file). I was wondering if there was a way to call one of these methods based off the name of an input
I've attempted to create a function called get(), which takes in 'param_name' - the name of the method contained within the class that I want to call. I was wondering if there was a more elegant way to solve this without creating a large amount of if statements.
def get(param_name):
# Some initialisation of the .csv file goes here. This works as intended.
list_of_objects = [] # Initialised above, as a list of objects with methods function1(), function2() for getting data out of the .csv
for item in list_of_objects:
if param_name == "name of function 1":
return function1()
if param_name == "name of function 2":
return function2()
You could store your functions ina a dictionary as such:
function_dict = {
'function_1': function_1,
'function_2': function_2
}
To use these you could do:
function_to_use = function_dict.get(param_name)
function_to_use(*args, **kwargs) # *args, **kwargs are arguments to be used.
If you want to return a list after you have applied the function to all item in list_of_objects instead of the for loop you could do:
list(map(function_to_use, list_of_objects))
You could use __getattribute__:
class Alpha:
def f1(self):
print("F1")
x = Alpha()
x.__getattribute__('f1')()
You can do that using globals(), globals() returns a dict containing all methods and attributes.
def fun1():
print('this is fun1')
def fun2():
print('this is fun2')
def get(func_name):
globals()[func_name]()
get('fun1')
get('fun2')
Will Output:
this is fun1
this is fun2

Load inconsistent data in pymongo

I am working with pymongo and am wanting to ensure that data saved can be loaded even if additional data elements have been added to the schema.
I have used this for classes that don't need to have the information processed before assigning it to class attributes:
class MyClass(object):
def __init__(self, instance_id):
#set default values
self.database_id = instance_id
self.myvar = 0
#load values from database
self.__load()
def __load(self):
data_dict = Collection.find_one({"_id":self.database_id})
for key, attribute in data_dict.items():
self.__setattr__(key,attribute)
However, in classes that I have to process the data from the database this doesn't work:
class Example(object):
def __init__(self, name):
self.name = name
self.database_id = None
self.member_dict = {}
self.load()
def load(self):
data_dict = Collection.find_one({"name":self.name})
self.database_id = data_dict["_id"]
for element in data_dict["element_list"]:
self.process_element(element)
for member_name, member_info in data_dict["member_class_dict"].items():
self.member_dict[member_name] = MemberClass(member_info)
def process_element(self, element):
print("Do Stuff")
Two example use cases I have are:
1) List of strings the are used to set flags, this is done by calling a function with the string as the argument. (def process_element above)
2) A dictionary of dictionaries which are used to create a list of instances of a class. (MemberClass(member_info) above)
I tried creating properties to handle this but found that __setattr__ doesn't look for properties.
I know I could redefine __setattr__ to look for specific names but it is my understanding that this would slow down all set interactions with the class and I would prefer to avoid that.
I also know I could use a bunch of try/excepts to catch the errors but this would end up making the code very bulky.
I don't mind the load function being slowed down a bit for this but very much want to avoid anything that will slow down the class outside of loading.
So the solution that I came up with is to use the idea of changing the __setattr__ method but instead to handle the exceptions in the load function instead of the __setattr__.
def load(self):
data_dict = Collection.find_one({"name":self.name})
for key, attribute in world_data.items():
if key == "_id":
self.database_id = attribute
elif key == "element_list":
for element in attribute:
self.process_element(element)
elif key == "member_class_dict":
for member_name, member_info in attribute.items():
self.member_dict[member_name] = MemberClass(member_info)
else:
self.__setattr__(key,attribute)
This provides all of the functionality of overriding the __setattr__ method without slowing down any future calls to __setattr__ outside of loading the class.

Resources