I want to process a bunch of text files using NLTK, splitting them on a particular keyword. I am therefore trying to "subclass StreamBackedCorpusView, and override the read_block() method", as suggested by the documentation.
class CustomCorpusView(StreamBackedCorpusView):
def read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
class CustomCorpusReader(PlaintextCorpusReader):
CorpusView = CustomCorpusViewer
However my knowledge of inheritance is rusty, and it seems my overriding is not taken into account. The output of
corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.words())
is identical to the output of
corpus = PlaintextCorpusReader("/path/to/files", ".*")
print(corpus.words())
I guess I'm missing something obvious, but what ?
The documentation actually suggests two ways of defining a custom corpus view :
Call the StreamBackedCorpusView constructor, and provide your block reader function via the block_reader argument.
Subclass StreamBackedCorpusView, and override the read_block() method.
It also suggests the first way is easier, and indeed I managed to get it working as the following :
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *
class CustomCorpusReader(PlaintextCorpusReader):
def _custom_read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
def custom(self, fileids=None):
return concat(
[
self.CorpusView(fileid, self._custom_read_block, encoding=enc)
for (fileid, enc) in self.abspaths(fileids, True)
]
)
corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.custom())
Related
I have more of a design question, but I am not sure how to handle that. I have a script preprocessing.py where I read a .csv file of text column that I would like to preprocess by removing punctuations, characters, ...etc.
What I have done now is that I have written a class with several functions as follows:
class Preprocessing(object):
def __init__(self, file):
self.my_data = pd.read_csv(file)
def remove_punctuation(self):
self.my_data['text'] = self.my_data['text'].str.replace('#','')
def remove_hyphen(self):
self.my_data['text'] = self.my_data['text'].str.replace('-','')
def remove_words(self):
self.my_data['text'] = self.my_data['text'].str.replace('reference','')
def save_data(self):
self.my_data.to_csv('my_data.csv')
def preprocessing(file_my):
f = Preprocessing(file_my)
f.remove_punctuation()
f.remove_hyphen()
f.remove_words()
f.save_data()
return f
if __name__ == '__main__':
preprocessing('/path/to/file.csv')
although it works fine, i would like to be able to expand the code easily and have smaller classes instead of having one large class. So i decided to use abstract class:
import pandas as pd
from abc import ABC, abstractmethod
my_data = pd.read_csv('/Users/kgz/Desktop/german_web_scraping/file.csv')
class Preprocessing(ABC):
#abstractmethod
def processor(self):
pass
class RemovePunctuation(Preprocessing):
def processor(self):
return my_data['text'].str.replace('#', '')
class RemoveHyphen(Preprocessing):
def processor(self):
return my_data['text'].str.replace('-', '')
class Removewords(Preprocessing):
def processor(self):
return my_data['text'].str.replace('reference', '')
final_result = [cls().processor() for cls in Preprocessing.__subclasses__()]
print(final_result)
So now each class is responsible for one task but there are a few issues I do not know how to handle since I am new to abstract classes. first, I am reading the file outside the classes, and I am not sure if that is good practice? if not, should i pass it as an argument to the processor function or have another class who is responsible to read the data.
Second, having one class with several functions allowed for a flow, so every transformation happened in order (i.e, first punctuation is removes, then hyphen is removed,...etc) but I do not know how to handle this order and dependency in abstract classes.
I have a BaseClass and two classes (Volume and testing) which inherits from the BaseClass. The class "Volume" use a method "driving_style" from another python module. I am trying to write another method "test_Score" which wants to access variables computed in the method "driving_style" which I want to use to compute further. These results will be accessed to the class "testing" as shown.
from training import Accuracy
import ComputeData
import model
class BaseClass(object):
def __init__(self, connections):
self.Type = 'Stock'
self.A = connections.A
self.log = self.B.log
def getIDs(self, assets):
ids = pandas.Series(assets.ids, index=assets.B)
return ids
class Volume(BaseClass):
def __init__(self, connections):
BaseClass.__init__(self, connections)
self.daystrade = 30
self.high_low = True
def learning(self, data, rootClass):
params.daystrade = self.daystrade
params.high_low = self.high_low
style = Accuracy.driving_style()
return self.Object(data.universe, style)
class testing(BaseClass):
def __init__(self, connections):
BaseClass.__init__(self, connections)
def learning(self, data, rootClass):
test_score = Accuracy.test_score()
return self.Object(data.universe, test_score)
def driving_style(date, modelDays, params):
daystrade = params.daystrade
high_low = params.high_low
DriveDays = model.DateRange(date, params.daystrade)
StopBy = ComputeData.instability(DriveDays)
if high_low:
style = ma.average(StopBy)
else:
style = ma.mean(StopBy)
return style
def test_score(date, modelDays, params):
"want to access the following from the method driving_style:"
DriveDays =
StopBy =
return test_score ("which i compute using values DriveDays and StopBy and use test_score in the method learning inside
the 'class - testing' which inherits some params from the BaseClass")
You can't use locals from a call to a function that was made elsewhere and has already returned.
A bad solution is to store them as globals that you can read from later (but that get replaced on every new call). A better solution might to return the relevant info to the caller along with the existing return values (return style, DriveDays, StopBy) and somehow get it to where it needs to go. If necessary, you could wrap the function into a class and store the computed values as attributes on an instance of the class, while keeping the return type the same.
But the best solution is probably to refactor, so the stuff you want is computed by dedicated methods that you can call directly from test_score and driving_style independently, without duplicating code or creating complicated state dependencies.
In short, basically any time you think you need to access locals from another function, you're almost certainly experiencing an XY problem.
background: so, I am working on an NLP problem. where I need to extract different types of features based on different types of text documents. and I currently have a setup where there is a FeatureExtractor base class, which is subclassed multiple times depending on the different types of docs and all of them calculate a different set of features and return a pandas data frame as output.
all these subclasses are further called by one wrapper type class called FeatureExtractionRunner which calls all the subclasses and calculates the features on all docs and returns the output for all types of docs.
Problem: this pattern of calculating features leads to lots of subclasses. currently, I have like 14 subclasses, since I have 14 types of docs.it might expand further. and this is too many classes to maintain. Is there an alternative way of doing this? with less subclassing
here is some sample representative code of what i explained:
from abc import ABCMeta, abstractmethod
class FeatureExtractor(metaclass=ABCMeta):
#base feature extractor class
def __init__(self, document):
self.document = document
#abstractmethod
def doc_to_features(self):
return NotImplemented
class ExtractorTypeA(FeatureExtractor):
#do some feature calculations.....
def _calculate_shape_features(self):
return None
def _calculate_size_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods like
f1 = self._calculate_shape_features(self.document)
f2 = self._calculate_size_features(self.document)
#do some calculations on the document and return a pandas dataframe by merging them (merge f1, f2....etc)
data = "dataframe-1"
return data
class ExtractorTypeB(FeatureExtractor):
#do some feature calculations.....
def _calculate_some_fancy_features(self):
return None
def _calculate_some_more_fancy_features(self):
return None
def doc_to_features(self):
#calls all the fancy feature calculation methods
f1 = self._calculate_some_fancy_features(self.document)
f2 = self._calculate_some_more_fancy_features(self.document)
#do some calculations on the document and return a pandas dataframe (merge f1, f2 etc)
data = "dataframe-2"
return data
class ExtractorTypeC(FeatureExtractor):
#do some feature calculations.....
def doc_to_features(self):
#do some calculations on the document and return a pandas dataframe
data = "dataframe-3"
return data
class FeatureExtractionRunner:
#a class to call all types of feature extractors
def __init__(self, document, *args, **kwargs):
self.document = document
self.type_a = ExtractorTypeA(self.document)
self.type_b = ExtractorTypeB(self.document)
self.type_c = ExtractorTypeC(self.document)
#more of these extractors would be there
def call_all_type_of_extractors(self):
type_a_features = self.type_a.doc_to_features()
type_b_features = self.type_b.doc_to_features()
type_c_features = self.type_c.doc_to_features()
#more such extractors would be there....
return [type_a_features, type_b_features, type_c_features]
all_type_of_features = FeatureExtractionRunner("some document").call_all_type_of_extractors()
Answering the question first, you may avoid subclassing entirely at the cost of writing the __init__ method each time. Or you may get rid off the classes entirely and convert them to a bunch of functions. Or even you may join all the classes in a single one. Note that none of these methods will make the code simpler or more maintainable, indeed they would just change it's shape to some extent.
IMHO this situation is a perfect example of inherent problem complexity by which I mean that the domain (NLP) and particular use case (doc feature extraction) are complex in and out themselves.
For example, featureX and featureY are likely to be totally different things that cannot be calculated altogether, thus you end up with one method each. Similarly, the procedure to merge these features in a dataframe might be different than the one to merge the fancy features. Having lots of functions/classes in this situation seems totally reasonable to me, also having them separate is logical and maintainable wise.
That said real code reduction might be possible if you can combine some feature calculation methods into a more generic function, tough I can't say for sure if it would be possible.
I have a method I would like to unit test. The method expects a file path, which is then opened - using a context manager - to parse a value which is then returned, should it be present, simple enough.
#staticmethod
def read_in_target_language(file_path):
"""
.. note:: Language code attributes/values can occur
on either the first or the second line of bilingual.
"""
with codecs.open(file_path, 'r', encoding='utf-8') as source:
line_1, line_2 = next(source), next(source)
get_line_1 = re.search(
'(target-language=")(.+?)(")', line_1, re.IGNORECASE)
get_line_2 = re.search(
'(target-language=")(.+?)(")', line_2, re.IGNORECASE)
if get_line_1 is not None:
return get_line_1.group(2)
else:
return get_line_2.group(2)
I want to avoid testing against external files - for obvious reasons - and do not wish to create temp files. In addition, I cannot use StringIO in this case.
How can I mock the file_path object in my unit test case? Ultimately I would need to create a mock path that contains differing values. Any help is gratefully received.
(Disclaimer: I don't speak Python, so I'm likely to err in details)
I suggest that you instead mock codecs. Make the mock's open method return an object with test data to be returned from the read calls. That might involve creating another mock object for the return value; I don't know if there are some stock classes in Python that you could use for that purpose instead.
Then, in order to actually enable testing the logic, add a parameter to read_in_target_language that represents an object that can assume the role of the original codecs object, i.e. dependency injection by argument. For convenience I guess you could default it to codecs.
I'm not sure how far Python's duck typing goes with regards to static vs instance methods, but something like this should give you the general idea:
def read_in_target_language(file_path, opener=codecs):
...
with opener.open(file_path, 'r', encoding='utf-8') as source:
If the above isn't possible you could just add a layer of indirection:
class CodecsOpener:
...
def open(self, file_path, access, encoding):
return codecs.open(file_path, access, encoding)
class MockOpener:
...
def __init__(self, open_result):
self.open_result = open_result
def open(self, file_path, access, encoding):
return self.open_result
...
def read_in_target_language(file_path, opener=CodecsOpener()):
...
with opener.open(file_path, 'r', encoding='utf-8') as source:
...
...
def test():
readable_data = ...
opener = MockOpener(readable_data)
result = <class>.read_in_target_language('whatever', opener)
<check result>
I'm still learning and like to build things that I will eventually be doing on a regular basis in the future, to give me a better understanding on how x does this or y does that.
I haven't learned much about how classes work entirely yet, but I set up a call that will go through multiple classes.
getattr(monster, monster_class.str().lower())(1)
Which calls this:
class monster:
def vampire(x):
monster_loot = {'Gold':75, 'Sword':50.3, 'Good Sword':40.5, 'Blood':100.0, 'Ore':.05}
if x == 1:
loot_table.all_loot(monster_loot)
Which in turn calls this...
class loot_table:
def all_loot(monster_loot):
loot = ['Gold', 'Sword', 'Good Sword', 'Ore']
loot_dropped = {}
for i in monster_loot:
if i in loot:
loot_dropped[i] = monster_loot[i]
drop_chance.chance(loot_dropped)
And then, finally, gets to the last class.
class drop_chance:
def chance(loot_list):
loot_gained = []
for i in loot_list:
x = random.uniform(0.0,100.0)
if loot_list[i] >= x:
loot_gained.append(i)
return loot_gained
And it all works, except it's not returning loot_gained. I'm assuming it's just being returned to the loot_table class and I have no idea how to bypass it all the way back down to the first line posted. Could I get some insight?
Keep using return.
def foo():
return bar()
def bar():
return baz()
def baz():
return 42
print foo()
I haven't learned much about how classes work entirely yet...
Rather informally, a class definition is a description of the object of that class (a.k.a. instance of the class) that is to be created in future. The class definition contains the code (definitions of the methods). The object (the class instance) basically contains the data. The method is a kind of function that can take arguments and that is capable to manipulate the object's data.
This way, classes should represent the behaviour of the real-world objects, the class instances simulate existence of the real-world objects. The methods represent actions that the object apply on themselves.
From that point of view, a class identifier should be a noun that describes category of objects of the class. A class instance identifier should also be a noun that names the object. A method identifier is usually a verb that describes the action.
In your case, at least the class drop_chance: is suspicious at least because of naming it this way.
If you want to print something reasonable about the object--say using the print(monster)--then define the __str__() method of the class -- see the doc.