csv file process in Python - python-3.x

I work with a csv data as follow:
ticker,exchange_country,company_name,price,exchange_rate,shares_outstanding,net_income
1,HK,CK HUTCHISON HOLDINGS LTD,1.404816984,7.757949829,3859.677979,31633
2,HK,CLP HOLDINGS LTD,1.312602194,7.757949829,2526.450928,16319
3,HK,HONG KONG & CHINA GAS CO LTD,0.234939214,7.757949829,12717.04199,7546.200195
11,HK,HANG SENG BANK LTD,2.198193203,7.757949829,1911.843018,15451
I have a StockStatRecord class:
class StockStatRecord:
def __init__(self, stock_load):
self.name = stock_load[0]
self.company_name = stock_load[2]
self.exchange_country = stock_load[1]
self.price = stock_load[3]
self.exchange_rate = stock_load[4]
self.shares_outstanding = stock_load[5]
self.net_income = stock_load[6]
How am I supposed to create another class to extract the data from that CSV, parse it, create new record and return the record created? In this class, it also needs to validate the rows when reading. Validation will fail for any row that is missing any piece of information, or if the name (symbol or player name) is empty, or if any of the numbers(int or float) cannot be parsed ( watch out of the division by zero).

There are several ways of doing this, either rolling out the code yourself, or using a Python module that is made for veryfing data-schemas, like Colander, or the extended CSV reader in Pandas (as Zwinck posted in the comment above).
What is not usually needed is a separate class to check values- you can do that on the same class - or usually, have a base class that implements the data-validation mechanisms, and then just have extra information on each field for the actual data classes. And finally, if you need to process data and spill an object back, there is no need for a class because in Python you can have functions independents of classes - there is no need to try to hammer down every piece of code to a class.
One simple thing to there is to (1) use Python's csv.DictReader instead of csv.Reader to read the rows - that way you have each piece of data bound to the column name already, as a dict, instead of a list where you have to manually track the column numbers, then set a property for each of the columns you need validation, so that the fields can be validated on setting - and a __init__ method that simply assigns all fields to their respectiv attributes:
class SockStatRecord:
def __init__(self, row):
for key, value in row.items():
setattr(self, key, value)
#property
def name(self):
return self._name
#name.setter
def name(self, value):
if not name: # example verification for empty name
raise ValueError
self._name = name
# continue for other fields
import csv
reader = csv.Dictreader(open("mydatafile.csv"))
all_records = []
for row in reader:
try:
all_records.append(StockDataRecord(row))
except ValueError:
print("Some error at record: {}".format(row))

Related

How can I assign a custom object to xarray data values?

I have created a DataArray using xarray successfully:
df_invoice_features = xr.DataArray(data=None,
dims={"y", "x"},
coords={"y": unique_invoices, "x": cols})
I created a custom class and assigned one value of xarray to the instance of this class:
class MyArray:
def __init__(self, s):
self.arr = np.array((s))
def set(self, idx, val):
self.arr[idx] = val
def get(self):
return self.arr
df_invoice_features.loc['basket_value_brand', invoice_id] = MyArray(len_b)
It is created successfully again:
But when I want to update the array of this class instance:
df_invoice_features.loc['basket_value_brand', invoice_id].set(0, 10)
It returns this error:
AttributeError: 'DataArray' object has no attribute 'set'
How can I use an array, dictionary or my custom object inside xarray data values?
So df_invoice_features.loc['basket_value_brand', invoice_id] doesn't actually return MyArray(len_b). Instead, it returns an xarray DataArray; specifically the subset of your full DataArray at the coordinate ['basket_value_brand', invoice_id]. This doesn't just include the value at that location (MyArray(len_b)), but also all the other information stored at that DataArray location; i.e., your coordinates, metadata, etc.
If you want to access the actual value at that location, you'll have to use .values; i.e.,
df_invoice_features.loc['basket_value_brand', invoice_id].values
That should get you the MyArray(len_b) you're looking for. However, I'm not entirely clear what you would like to do with that class. If you're trying to change the value of your DataArray at that location, this bit of the xarray docs in particular may be useful to review.

Best way to model JSON data in python

This question may be opinion based, but I figured I'd give it shot.
I am attempting to create a variety of classes which gets its values from JSON data. The JSON data is not under my control so I have to parse the data and select the values I want. My current implementation subclasses UserDict from python3's collection module. However, I have had iterations where I have directly created attributes and set the values to the parsed data.
The reason I changed to using the UserDict is the ease of using the update function.
However, I feel odd calling the object and using MyClass['attribute'] rather than MyClass.attribute
Is there a more pythonic way to model this data?
I am not 100% convinced that this makes sense, but you could try this:
class MyClass (object):
def __init__(self, **kwargs):
for key in kwargs.keys():
setattr(self, key, kwargs[key])
my_json = {"a":1, "b":2, "c":3}
my_instance = MyClass(**my_json)
print (my_instance.a)
# 1
print (my_instance.b)
# 2
print (my_instance.c)
# 3
--- edit
in case you have nested data you could also try this:
class MyClass (object):
def __init__(self, **kwargs):
for key in kwargs.keys():
if isinstance(kwargs[key],dict):
setattr(self, key, MyClass(**kwargs[key]))
else:
setattr(self, key, kwargs[key])
my_json = {"a":1, "b":2, "c":{"d":3}}
my_instance = MyClass(**my_json)
print (my_instance.a)
# 1
print (my_instance.b)
# 2
print (my_instance.c.d)
# 3

Load inconsistent data in pymongo

I am working with pymongo and am wanting to ensure that data saved can be loaded even if additional data elements have been added to the schema.
I have used this for classes that don't need to have the information processed before assigning it to class attributes:
class MyClass(object):
def __init__(self, instance_id):
#set default values
self.database_id = instance_id
self.myvar = 0
#load values from database
self.__load()
def __load(self):
data_dict = Collection.find_one({"_id":self.database_id})
for key, attribute in data_dict.items():
self.__setattr__(key,attribute)
However, in classes that I have to process the data from the database this doesn't work:
class Example(object):
def __init__(self, name):
self.name = name
self.database_id = None
self.member_dict = {}
self.load()
def load(self):
data_dict = Collection.find_one({"name":self.name})
self.database_id = data_dict["_id"]
for element in data_dict["element_list"]:
self.process_element(element)
for member_name, member_info in data_dict["member_class_dict"].items():
self.member_dict[member_name] = MemberClass(member_info)
def process_element(self, element):
print("Do Stuff")
Two example use cases I have are:
1) List of strings the are used to set flags, this is done by calling a function with the string as the argument. (def process_element above)
2) A dictionary of dictionaries which are used to create a list of instances of a class. (MemberClass(member_info) above)
I tried creating properties to handle this but found that __setattr__ doesn't look for properties.
I know I could redefine __setattr__ to look for specific names but it is my understanding that this would slow down all set interactions with the class and I would prefer to avoid that.
I also know I could use a bunch of try/excepts to catch the errors but this would end up making the code very bulky.
I don't mind the load function being slowed down a bit for this but very much want to avoid anything that will slow down the class outside of loading.
So the solution that I came up with is to use the idea of changing the __setattr__ method but instead to handle the exceptions in the load function instead of the __setattr__.
def load(self):
data_dict = Collection.find_one({"name":self.name})
for key, attribute in world_data.items():
if key == "_id":
self.database_id = attribute
elif key == "element_list":
for element in attribute:
self.process_element(element)
elif key == "member_class_dict":
for member_name, member_info in attribute.items():
self.member_dict[member_name] = MemberClass(member_info)
else:
self.__setattr__(key,attribute)
This provides all of the functionality of overriding the __setattr__ method without slowing down any future calls to __setattr__ outside of loading the class.

Creating a list of Class objects from a file with no duplicates in attributes of the objects

I am currently taking some computer science courses in school and have come to a dead end and need a little help. Like the title says, I need of create a list of Class objects from a file with objects that have a duplicate not added to the list, I was able to successfully do this with a python set() but apparently that isn't allowed for this particular assignment, I have tried various other ways but can't seem to get it working without using a set. I believe the point of this assignment is comparing data structures in python and using the slowest method possible as it also has to be timed. my code using the set() will be provided.
import time
class Students:
def __init__(self, LName, FName, ssn, email, age):
self.LName = LName
self.FName = FName
self.ssn = ssn
self.email = email
self.age = age
def getssn(self):
return self.ssn
def main():
t1 = time.time()
f = open('InsertNames.txt', 'r')
studentlist = []
seen = set()
for line in f:
parsed = line.split(' ')
parsed = [i.strip() for i in parsed]
if parsed[2] not in seen:
studentlist.append(Students(parsed[0], parsed[1], parsed[2], parsed[3], parsed[4]))
seen.add(parsed[2])
else:
print(parsed[2], 'already in list, not added')
f.close()
print('final list length: ', len(studentlist))
t2 = time.time()
print('time = ', t2-t1)
main()
A note, that the only duplicates to be checked for are those of the .ssn attribute and the duplicate should not be added to the list. Is there a way to check what is already in the list by that specific attribute before adding it?
edit: Forgot to mention only 1 list allowed in memory.
You can write
if not any(s.ssn==parsed[2] for s in studentlist):
without committing to this comparison as the meaning of ==. At this level of work, you probably are expected to write out the loop and set a flag yourself rather than use a generator expression.
Since you already took the time to write a class representing a student and since ssn is a unique identifier for the instances, consider writing an __eq__ method for that class.
def __eq__(self, other):
return self.ssn == other.ssn
This will make your life easier when you want to compare two students, and in your case make a list (specifically not a set) of students.
Then your code would look something like:
with open('InsertNames.txt') as f:
for line in f:
student = Student(*line.strip().split())
if student not in student_list:
student_list.append(student)
Explanation
Opening a file with with statement makes your code more clean and
gives it the ability to handle errors and do cleanups correctly. And
since 'r' is a default for open it doesn't need to be there.
You should strip the line before splitting it just to handle some
edge cases but this is not obligatory.
split's default argument is ' ' so again it isn't necessary.
Just to clarify the meaning of this item is that the absence of a parameter make the split use whitespaces. It does not mean that a single space character is the default.
Creating the student before adding it to the list sounds like too
much overhead for this simple use but since there is only one
__init__ method called it is not that bad. The plus side of this
is that it makes the code more readable with the not in statement.
The in statement (and also not in of course) checks if the
object is in that list with the __eq__ method of that object.
Since you implemented that method it can check the in statement
for your Student class instances.
Only if the student doesn't exist in the list, it will be added.
One final thing, there is no creation of a list here other than the return value of split and the student_list you created.

Dynamically add methods to a class in Python 3.0

I'm trying to write a Database Abstraction Layer in Python which lets you construct SQL statments using chained function calls such as:
results = db.search("book")
.author("J. K. Rowling")
.price("<40.00")
.title("Harry")
.execute()
but I am running into problems when I try to dynamically add the required methods to the db class.
Here is the important parts of my code:
import inspect
def myName():
return inspect.stack()[1][3]
class Search():
def __init__(self, family):
self.family = family
self.options = ['price', 'name', 'author', 'genre']
#self.options is generated based on family, but this is an example
for opt in self.options:
self.__dict__[opt] = self.__Set__
self.conditions = {}
def __Set__(self, value):
self.conditions[myName()] = value
return self
def execute(self):
return self.conditions
However, when I run the example such as:
print(db.search("book").price(">4.00").execute())
outputs:
{'__Set__': 'harry'}
Am I going about this the wrong way? Is there a better way to get the name of the function being called or to somehow make a 'hard copy' of the function?
You can simply add the search functions (methods) after the class is created:
class Search: # The class does not include the search methods, at first
def __init__(self):
self.conditions = {}
def make_set_condition(option): # Factory function that generates a "condition setter" for "option"
def set_cond(self, value):
self.conditions[option] = value
return self
return set_cond
for option in ('price', 'name'): # The class is extended with additional condition setters
setattr(Search, option, make_set_condition(option))
Search().name("Nice name").price('$3').conditions # Example
{'price': '$3', 'name': 'Nice name'}
PS: This class has an __init__() method that does not have the family parameter (the condition setters are dynamically added at runtime, but are added to the class, not to each instance separately). If Search objects with different condition setters need to be created, then the following variation on the above method works (the __init__() method has a family parameter):
import types
class Search: # The class does not include the search methods, at first
def __init__(self, family):
self.conditions = {}
for option in family: # The class is extended with additional condition setters
# The new 'option' attributes must be methods, not regular functions:
setattr(self, option, types.MethodType(make_set_condition(option), self))
def make_set_condition(option): # Factory function that generates a "condition setter" for "option"
def set_cond(self, value):
self.conditions[option] = value
return self
return set_cond
>>> o0 = Search(('price', 'name')) # Example
>>> o0.name("Nice name").price('$3').conditions
{'price': '$3', 'name': 'Nice name'}
>>> dir(o0) # Each Search object has its own condition setters (here: name and price)
['__doc__', '__init__', '__module__', 'conditions', 'name', 'price']
>>> o1 = Search(('director', 'style'))
>>> o1.director("Louis L").conditions # New method name
{'director': 'Louis L'}
>>> dir(o1) # Each Search object has its own condition setters (here: director and style)
['__doc__', '__init__', '__module__', 'conditions', 'director', 'style']
Reference: http://docs.python.org/howto/descriptor.html#functions-and-methods
If you really need search methods that know about the name of the attribute they are stored in, you can simply set it in make_set_condition() with
set_cond.__name__ = option # Sets the function name
(just before the return set_cond). Before doing this, method Search.name has the following name:
>>> Search.price
<function set_cond at 0x107f832f8>
after setting its __name__ attribute, you get a different name:
>>> Search.price
<function price at 0x107f83490>
Setting the method name this way makes possible error messages involving the method easier to understand.
Firstly, you are not adding anything to the class, you are adding it to the instance.
Secondly, you don't need to access dict. The self.__dict__[opt] = self.__Set__ is better done with setattr(self, opt, self.__Set__).
Thirdly, don't use __xxx__ as attribute names. Those are reserved for Python-internal use.
Fourthly, as you noticed, Python is not easily fooled. The internal name of the method you call is still __Set__, even though you access it under a different name. :-) The name is set when you define the method as a part of the def statement.
You probably want to create and set the options methods with a metaclass. You also might want to actually create those methods instead of trying to use one method for all of them. If you really want to use only one __getattr__ is the way, but it can be a bit fiddly, I generally recommend against it. Lambdas or other dynamically generated methods are probably better.
Here is some working code to get you started (not the whole program you were trying to write, but something that shows how the parts can fit together):
class Assign:
def __init__(self, searchobj, key):
self.searchobj = searchobj
self.key = key
def __call__(self, value):
self.searchobj.conditions[self.key] = value
return self.searchobj
class Book():
def __init__(self, family):
self.family = family
self.options = ['price', 'name', 'author', 'genre']
self.conditions = {}
def __getattr__(self, key):
if key in self.options:
return Assign(self, key)
raise RuntimeError('There is no option for: %s' % key)
def execute(self):
# XXX do something with the conditions.
return self.conditions
b = Book('book')
print(b.price(">4.00").author('J. K. Rowling').execute())

Resources