Unsuccessful in trying to convert a column of strings to integers in Python (hoping to sort) - python-3.x

I am attempting to sort a dataframe by a column called 'GameId', which are currently of type string and when I attempt to sort the result is unexpected. I have tried the following but still return a type string.
TEST['GameId'] = TEST['GameId'].astype(int)
type('GameId')

One way to make the data life easier is using dataclasses!
from dataclasses import dataclass
# here will will be calling the dataclass decorator to send hints for data type!
#dataclass
class Columns:
channel_id : int
frequency_hz : int
power_dBmV : float
name : str
# this class will call the data class to organise the data as data.frequency data.power_dBmV etc
class RadioChannel:
radio_values = ['channel_id', 'frequency', 'power_dBmV']
def __init__(self, data): # self is 'this' but for python, it just means that you mean to reference 'this' or self instance
self.data = data # this instances data is called data here
data = Columns(channel_id=data[0], frequency=data[1], power_dBmv=data[4], name=data[3]) # now we give data var a val!
def present_data(self):
# this is optional class method btw
from rich.console import Console
from rich.table import Table
console = Console()
table = Table(title="My Radio Channels")
for item in self.radio_values:
table.add_column(item)
table.add_row(data.channel_id, data.frequency_hz, data.power_dBmv)
console.print(table)
# ignore this if its confusing
# now inside your functional part of your script
if __name__ == '__main__':
myData = []
# calling an imaginary file here to read
with open("my_radio_data_file", 'r') as myfile:
mylines = myfile.readlines()
for line in myline:
myData.append(line)
myfile.close()
#my data would look like a string ["value", value, 00, 0.0, "hello joe, from: world"]
ch1 = radioChannel(data=myData[0])
ch1.present_data()
This way you can just call the class object on each line of a data file. and print it to see if it lines up. once you get the hang of it, it starts to get fun.
I used rich console here, but it works well with pandas and normal dataframes!
dataclasses help the interpreter find its way with type hints and class structure.
Good Luck and have fun!

Related

python: multiple functions or abstract classes when dealing with data flow requirement

I have more of a design question, but I am not sure how to handle that. I have a script preprocessing.py where I read a .csv file of text column that I would like to preprocess by removing punctuations, characters, ...etc.
What I have done now is that I have written a class with several functions as follows:
class Preprocessing(object):
def __init__(self, file):
self.my_data = pd.read_csv(file)
def remove_punctuation(self):
self.my_data['text'] = self.my_data['text'].str.replace('#','')
def remove_hyphen(self):
self.my_data['text'] = self.my_data['text'].str.replace('-','')
def remove_words(self):
self.my_data['text'] = self.my_data['text'].str.replace('reference','')
def save_data(self):
self.my_data.to_csv('my_data.csv')
def preprocessing(file_my):
f = Preprocessing(file_my)
f.remove_punctuation()
f.remove_hyphen()
f.remove_words()
f.save_data()
return f
if __name__ == '__main__':
preprocessing('/path/to/file.csv')
although it works fine, i would like to be able to expand the code easily and have smaller classes instead of having one large class. So i decided to use abstract class:
import pandas as pd
from abc import ABC, abstractmethod
my_data = pd.read_csv('/Users/kgz/Desktop/german_web_scraping/file.csv')
class Preprocessing(ABC):
#abstractmethod
def processor(self):
pass
class RemovePunctuation(Preprocessing):
def processor(self):
return my_data['text'].str.replace('#', '')
class RemoveHyphen(Preprocessing):
def processor(self):
return my_data['text'].str.replace('-', '')
class Removewords(Preprocessing):
def processor(self):
return my_data['text'].str.replace('reference', '')
final_result = [cls().processor() for cls in Preprocessing.__subclasses__()]
print(final_result)
So now each class is responsible for one task but there are a few issues I do not know how to handle since I am new to abstract classes. first, I am reading the file outside the classes, and I am not sure if that is good practice? if not, should i pass it as an argument to the processor function or have another class who is responsible to read the data.
Second, having one class with several functions allowed for a flow, so every transformation happened in order (i.e, first punctuation is removes, then hyphen is removed,...etc) but I do not know how to handle this order and dependency in abstract classes.

XML parsing with a Class call

I parsed an xml file with xml.etree python module. Its Working well, but now I try to call this code as a module/Class from a main program. I would like to send the xml tree and a filename for the csv writing to the Class.
My dummy file to call the file with the Class:
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='phonebook.xml')
root = tree.getroot()
from xml2csv import Vcard_xml2csv
my_export = Vcard_xml2csv(tree, 'phonebook.csv')
my_export.write_csv()
here is the class:
class Vcard_xml2csv:
"""The XML phone book export to a csv file"""
def __init__(self, tree, csvfilename):
root = tree.getroot()
self.csvfilename = csvfilename
def write_content(contact, csvfilename):
with open(csvfilename, mode='a+') as phonebook_file:
contact_writer = csv.writer(phonebook_file, delimiter=',', quotechar=" ", quoting=csv.QUOTE_MINIMAL)
contact_writer.writerow([contact]) # delete [] if you will see only text separated by a comma
def write_csv(tree):
for contacts in tree.iter(tag='phonebook'):
for contact in contacts.findall("./contact"):
row=[]
for category in contact.findall("./category"):
print('Category: ',category.text)
category=category.text
row.append(category)
for person in contact.findall("./person/realName"):
print('realName: ',person.text)
realName=person.text
row.append(realName)
for mod_time in contact.findall("./mod_time"):
print ('mod_time: ', mod_time.text)
mod_time=mod_time.text
row.append(mod_time)
for uniqueid in contact.findall("./uniqueid"):
print ('uniqueid: ', uniqueid.text)
uniqueid_=uniqueid.text
row.append(uniqueid_)
numberlist=[]
for number in contact.findall("./telephony/number"):
print('id',number.attrib['id'],'type:',number.attrib['type'], 'prio:',number.attrib['prio'], 'number: ',number.text)
id_=number.attrib['id']
numberlist.append(id_)
type_=number.attrib['type']
numberlist.append(type_)
prio_=number.attrib['prio']
numberlist.append(prio_)
number_=number.text
numberlist.append(number_)
contact = row + numberlist
write_content(contact, csvfilename)
numberlist=[]
Iam geeting the ERROR below:
for contacts in tree.iter(tag='phonebook'): AttributeError:
'Vcard_xml2csv' object has no attribute 'iter' Thanks for your help!
When we define a method in a class, like write_csv() in the example, the first parameter is always the class instance. Think of it as a way to access class attributes and methods. Conventionally, for readability, the first parameter is called self.
In the write_csv method, tree has become this class instance and that is the reason you see the error. The resolution to this would be to define the method like the following:
def write_csv(self, tree)
....
....
and the call to the method would be:
my_export.write_csv(tree)
I hope this helps. More about self here

__post_init__ of python 3.x dataclasses is not called when loaded from yaml

Please note that I have already referred to StackOverflow question here. I post this question to investigate if calling __post_init__ is safe or not. Please check the question till the end.
Check the below code. In step 3 where we load dataclass A from yaml string. Note that it does not call __post_init__ method.
import dataclasses
import yaml
#dataclasses.dataclass
class A:
a: int = 55
def __post_init__(self):
print("__post_init__ got called", self)
print("\n>>>>>>>>>>>> 1: create dataclass object")
a = A(33)
print(a) # print dataclass
print(dataclasses.fields(a))
print("\n>>>>>>>>>>>> 2: dump to yaml")
s = yaml.dump(a)
print(s) # print yaml repr
print("\n>>>>>>>>>>>> 3: create class from str")
a_ = yaml.load(s)
print(a_) # print dataclass loaded from yaml str
print(dataclasses.fields(a_))
The solution that I see for now is calling __-post_init__ on my own at the end like in below code snippet.
a_.__post_init__()
I am not sure if this is safe recreation of yaml serialized dataclass. Also, it will pose a problem when __post_init__ takes kwargs in case when dataclass fields are dataclasses.InitVar type.
This behavior is working as intended. You are dumping an existing object, so when you load it pyyaml intentionally avoids initializing the object again. The direct attributes of the dumped object will be saved even if they are created in __post_init__ because that function runs prior to being dumped. When you want the side effects that come from __post_init__, like the print statement in your example, you will need to ensure that initialization occurs.
There are few ways to accomplish this. You can use either the metaclass or adding constructor/representer approaches described in pyyaml's documentation. You could also manually alter the dumped string in your example to be ''!!python/object/new:' instead of ''!!python/object:'. If your eventual goal is to have the yaml file generated in a different manner, then this might be a solution.
See below for an update to your code that uses the metaclass approach and calls __post_init__ when loading from the dumped class object. The call to cls(**fields) in from_yaml ensures that the object is initialized. yaml.load uses cls.__new__ to create objects tagged with ''!!python/object:' and then loads the saved attributes into the object manually.
import dataclasses
import yaml
#dataclasses.dataclass
class A(yaml.YAMLObject):
a: int = 55
def __post_init__(self):
print("__post_init__ got called", self)
yaml_tag = '!A'
yaml_loader = yaml.SafeLoader
#classmethod
def from_yaml(cls, loader, node):
fields = loader.construct_mapping(node, deep=True)
return cls(**fields)
print("\n>>>>>>>>>>>> 1: create dataclass object")
a = A(33)
print(a) # print dataclass
print(dataclasses.fields(a))
print("\n>>>>>>>>>>>> 2: dump to yaml")
s = yaml.dump(a)
print(s) # print yaml repr
print("\n>>>>>>>>>>>> 3: create class from str")
a_ = yaml.load(s, Loader=A.yaml_loader)
print(a_) # print dataclass loaded from yaml str
print(dataclasses.fields(a_))

What's get_products() missing 1 required positional argument: 'self'

I am trying to program for a friend of mine for fun and practice to make myself better in Python 3.6.3, I don't really understand why I got this error.
TypeError: get_products() missing 1 required positional argument: 'self'
I have done some research, it says I should initialize the object, which I did, but it is still giving me this error. Can anyone tell me where I did wrong? Or is there any better ways to do it?
from datetime import datetime, timedelta
from time import sleep
from gdax.public_client import PublicClient
# import pandas
import requests
class MyGdaxHistoricalData(object):
"""class for fetch candle data for a given currency pair"""
def __init__(self):
print([productList['id'] for productList in PublicClient.get_products()])
# self.pair = input("""\nEnter your product name separated by a comma.
self.pair = [i for i in input("Enter: ").split(",")]
self.uri = 'https://api.gdax.com/products/{pair}/candles'.format(pair = self.pair)
#staticmethod
def dataToIso8681(data):
"""convert a data time object to the ISO-8681 format
Args:
date(datetime): The date to be converted
Return:
string: The ISO-8681 formated date
"""
return 0
if __name__ == "__main__":
import gdax
MyData = MyGdaxHistoricalData()
# MyData = MyGdaxHistoricalData(input("""\nEnter your product name separated by a comma.
# print(MyData.pair)
Possibly you missed to create object of PublicClient. Try PublicClient().get_products()
Edited:
why I need the object of PublicClient?
Simple thumb rule of OOP's, if you wanna use some property(attribute) or behavior(method) of class, you need a object of that class. Else you need to make it static, use #staticmethod decorator in python.

csv file process in Python

I work with a csv data as follow:
ticker,exchange_country,company_name,price,exchange_rate,shares_outstanding,net_income
1,HK,CK HUTCHISON HOLDINGS LTD,1.404816984,7.757949829,3859.677979,31633
2,HK,CLP HOLDINGS LTD,1.312602194,7.757949829,2526.450928,16319
3,HK,HONG KONG & CHINA GAS CO LTD,0.234939214,7.757949829,12717.04199,7546.200195
11,HK,HANG SENG BANK LTD,2.198193203,7.757949829,1911.843018,15451
I have a StockStatRecord class:
class StockStatRecord:
def __init__(self, stock_load):
self.name = stock_load[0]
self.company_name = stock_load[2]
self.exchange_country = stock_load[1]
self.price = stock_load[3]
self.exchange_rate = stock_load[4]
self.shares_outstanding = stock_load[5]
self.net_income = stock_load[6]
How am I supposed to create another class to extract the data from that CSV, parse it, create new record and return the record created? In this class, it also needs to validate the rows when reading. Validation will fail for any row that is missing any piece of information, or if the name (symbol or player name) is empty, or if any of the numbers(int or float) cannot be parsed ( watch out of the division by zero).
There are several ways of doing this, either rolling out the code yourself, or using a Python module that is made for veryfing data-schemas, like Colander, or the extended CSV reader in Pandas (as Zwinck posted in the comment above).
What is not usually needed is a separate class to check values- you can do that on the same class - or usually, have a base class that implements the data-validation mechanisms, and then just have extra information on each field for the actual data classes. And finally, if you need to process data and spill an object back, there is no need for a class because in Python you can have functions independents of classes - there is no need to try to hammer down every piece of code to a class.
One simple thing to there is to (1) use Python's csv.DictReader instead of csv.Reader to read the rows - that way you have each piece of data bound to the column name already, as a dict, instead of a list where you have to manually track the column numbers, then set a property for each of the columns you need validation, so that the fields can be validated on setting - and a __init__ method that simply assigns all fields to their respectiv attributes:
class SockStatRecord:
def __init__(self, row):
for key, value in row.items():
setattr(self, key, value)
#property
def name(self):
return self._name
#name.setter
def name(self, value):
if not name: # example verification for empty name
raise ValueError
self._name = name
# continue for other fields
import csv
reader = csv.Dictreader(open("mydatafile.csv"))
all_records = []
for row in reader:
try:
all_records.append(StockDataRecord(row))
except ValueError:
print("Some error at record: {}".format(row))

Resources