How to pre-process data before pandas.read_csv() - python-3.x

I have a slightly broken CSV file that I want to pre-process before reading it with pandas.read_csv(), i.e. do some search/replace on it.
I tried to open the file and and do the pre-processing in a generator, that I then hand over to read_csv():
def in_stream():
with open("some.csv") as csvfile:
for line in csvfile:
l = re.sub(r'","',r',',line)
yield l
df = pd.read_csv(in_stream())
Sadly, this just throws a
ValueError: Invalid file path or buffer object type: <class 'generator'>
Although, when looking at Panda's source, I'd expect it to be able to work on iterators, thus generators.
I only found this [article] (Using a custom object in pandas.read_csv()), outlining how to wrap a generator into a file-like object, but it seems to only work on files in byte-mode.
So in the end I'm looking for a pattern to build a pipeline that opens a file, reads it line-by-line, allows pre-processing and then feeds it into e.g. pandas.read_csv().

After further investigation of Pandas' source, it became apparent, that it doesn't simply require an iterable, but also wants it to be a file, expressed by having a read method (is_file_like() in inference.py).
So, I built a generator the old way
class InFile(object):
def __init__(self, infile):
self.infile = open(infile)
def __next__(self):
return self.next()
def __iter__(self):
return self
def read(self, *args, **kwargs):
return self.__next__()
def next(self):
try:
line: str = self.infile.readline()
line = re.sub(r'","',r',',line) # do some fixing
return line
except:
self.infile.close()
raise StopIteration
This works in pandas.read_csv():
df = pd.read_csv(InFile("some.csv"))
To me this looks super complicated and I wonder if there is any better (→ more elegant) solution.

Here's a solution that will work for smaller CSV files. All lines are first read into memory, processed, and concatenated. This will probably perform badly for larger files.
import re
from io import StringIO
import pandas as pd
with open('file.csv') as file:
lines = [re.sub(r'","', r',', line) for line in file]
df = pd.read_csv(StringIO('\n'.join(lines)))

Related

How do I suppress the output from log step while running jupyter notebook?

On this page, there is this log_step function which is being used to record what each step in a pandas pipeline is doing. The exact function is:
from functools import wraps
import datetime as dt
def log_step(func):
#wraps(func)
def wrapper(*args, **kwargs):
tic = dt.datetime.now()
result = func(*args, **kwargs)
time_taken = str(dt.datetime.now() - tic)
print(f"just ran step {func.__name__} shape={result.shape} took {time_taken}s")
return result
return wrapper
and it is used in the following fashion:
import pandas as pd
df = pd.read_csv('https://calmcode.io/datasets/bigmac.csv')
#log_step
def start_pipeline(dataf):
return dataf.copy()
#log_step
def set_dtypes(dataf):
return (dataf
.assign(date=lambda d: pd.to_datetime(d['date']))
.sort_values(['currency_code', 'date']))
My question is: how do I keep the #log_step in front of my functions and be able to use them at will, while setting the results of #log_step by default, to not be outputed when I run my Jupyter notebook? I suspect the answer comes down to something more general about using decorators but I don't really know what to look for. Thanks!
You can indeed remove the print statement or, if you want to not alter the decorating function, you can as well redirect the sys to avoid seeing the prints, as explained here.

How to read multiple text files as strings from two folders at the same time using readline() in python?

Currently have version of the following script that uses two simple readline() snippets to read a single line .txt file from two different folders. Running under ubuntu 18.04 and python 3.67 Not using glob.
Encountering 'NameError' now when trying to read multiple text files from same folders using 'sorted.glob'
readlines() causes error because input from .txt files must be strings not lists.
New to python. Have tried online python formatting, reindent.py etc. but no success.
Hoping it's a simple indentation issue so it won't be an issue in future scripts.
Current error from code below:
Traceback (most recent call last):
File "v1-ReadFiles.py", line 21, in <module>
context_input = GenerationInput(P1=P1, P3=P3,
NameError: name 'P1' is not defined
Current modified script:
import glob
import os
from src.model_use import TextGeneration
from src.utils import DEFAULT_DECODING_STRATEGY, LARGE
from src.flexible_models.flexible_GPT2 import FlexibleGPT2
from src.torch_loader import GenerationInput
from transformers import GPT2LMHeadModel, GPT2Tokenizer
for name in sorted(glob.glob('P1_files/*.txt')):
with open(name) as f:
P1 = f.readline()
for name in sorted(glob.glob('P3_files/*.txt')):
with open(name) as f:
P3 = f.readline()
if __name__ == "__main__":
context_input = GenerationInput(P1=P1, P3=P3,
genre=["mystery"],
persons=["Steve"],
size=LARGE,
summary="detective")
print("PREDICTION WITH CONTEXT WITH SPECIAL TOKENS")
model = GPT2LMHeadModel.from_pretrained('models/custom')
tokenizer = GPT2Tokenizer.from_pretrained('models/custom')
tokenizer.add_special_tokens(
{'eos_token': '[EOS]',
'pad_token': '[PAD]',
'additional_special_tokens': ['[P1]', '[P2]', '[P3]', '[S]', '[M]', '[L]', '[T]', '[Sum]', '[Ent]']}
)
model.resize_token_embeddings(len(tokenizer))
GPT2_model = FlexibleGPT2(model, tokenizer, DEFAULT_DECODING_STRATEGY)
text_generator_with_context = TextGeneration(GPT2_model, use_context=True)
predictions = text_generator_with_context(context_input, nb_samples=1)
for i, prediction in enumerate(predictions):
print('prediction n°', i, ': ', prediction)
Thanks to afghanimah here:
Problem with range() function when used with readline() or counter - reads and processes only last line in files
Dropped glob. Also moved all model= etc. load functions before 'with open ...'
with open("data/test-P1-Multi.txt","r") as f1, open("data/test-P3-Multi.txt","r") as f3:
for i in range(5):
P1 = f1.readline()
P3 = f3.readline()
context_input = GenerationInput(P1=P1, P3=P3, size=LARGE)
etc.

Function not working, Syntax errors and more

The other day I was working on a project for an Image captioning model on Keras. But when I am running it, I am facing a host of error. Note that I am using Atom Editor and a virtual environment in Python, Running everything from a Command-line.
train_features = load_photo_features(os.path('C:/Users/neelg/Documents/Atom_projects/Main/features.pkl'), train)
In this line, I am receiving this error==>
File "C:\Users\neelg\Documents\Atom_projects\Main\Img_cap.py", line 143
train_features = load_photo_features(os.path('C:/Users/neelg/Documents/Atom_projects/Main/features.pkl'), train)
^
SyntaxError: invalid syntax
I think that the syntax is correct regarding the function, yet the error persists. So, in a seperate file I copied the function and tried to isolate problem.
Code for the standalone function:-
from pickle import load
import os
def load_photo_features(filename, dataset):
all_features = load(open(filename, 'rb'))
features = {k: all_features[k] for k in dataset}
return features
filename = 'C:/Users/neelg/Documents/Atom_projects/Main/Flickr8k_text/Flickr8k.trainImages.txt'
train_features = load_photo_features(os.path('C:/Users/neelg/Documents/Atom_projects/Main/features.pkl'), train)
Now, A different type of problem crops up:
Traceback (most recent call last):
File "C:\Users\neelg\Documents\Atom_projects\Main\testing.py", line 10, in <module>
train_features = load_photo_features(os.path('C:/Users/neelg/Documents/Atom_projects/Main/features.pkl'), train)
TypeError: 'module' object is not callable
Any help? I am trying to import the Flickr_8k dataset, which contains random pictures and another small dataset which are the labels of those photographs...
P.S=>Pls send suggestions after testing the code on tour own editors before submitting because I suspect there is some core problem arising due to the System encoding(As suggested by some others). Also, it is not possible to load the whole code due to it's length and requirement of multiple files.
This error comes from the fact that you're calling os.path which is a module not a function. Just remove it, you don't need it in this use-case, a string is enough for filename in open
I was about to ask you the same question with #ted why do you use os.path when you are trying to load the file.
Normally, I am using the following code for loading from pickle:
def load_obj(filename):
with open(filename, "rb") as fp:
return pickle.load(fp, enconding = 'bytes')
Furthermore, if I try something like that it works:
from pickle import load
import os
import pdb
def load_photo_features(filename):
all_features = load(open(filename, 'rb'))
pdb.set_trace()
#features = {k: all_features[k] for k in dataset}
#return features
train_features = load_photo_features('train.pkl')
I do not know what is the dataset input to proceed, but loading of the pickle file works fine.

Multiprocessing error for NLP application

I'm working on an NLP project. I have a massive dataset of 180 million words. Before I begin training I want to correct the spelling of words. To do this I use TextBlob's spell correct. Because TextBlob takes a while to process anyways, it would be an insanely long amount of time to correct the spelling of 180 million words. So here is my approach (code will follow after this):
Load corpus
Split the corpus into list of sentences using nltk
tokenizer
Multiprocessing: apply function to every iterable item of list generated from step 2
Here is my code:
import codecs
import multiprocessing
import nltk
from textblob import TextBlob
from nltk.tokenize import sent_tokenize
class SpellCorrect():
def __init__(self):
pass
def load_data(self, path):
with codecs.open(path, "r", "utf-8") as file:
data = file.read()
return sent_tokenize(data)
def correct_spelling(self, data):
data = TextBlob(data)
return str(data.correct())
def merge_cleaned_corpus(self, result, path):
result = " ".join(temp for temp in result)
with codecs.open(path, "a", "utf-8") as file:
file.write(result)
if __name__ == "__main__":
SpellCorrect = SpellCorrect()
data = SpellCorrect.load_data(path)
correct_spelling = SpellCorrect.correct_spelling
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
result = pool.apply_async(correct_spelling, (data, ))
result = result.get()
SpellCorrect.merge_cleaned_corpus(tuple(result), path)
When I run this, I get the following error:
_pickle.PicklingError: Can't pickle <class '__main__.SpellCorrect'>: it's not the same object as __main__.SpellCorrect
This error is generated at the line in my code that says result = result.get()
From my probably wrong guess, I'm guessing that the parallel processing component completed successfully and was able to apply my clean up to every iterable sentence. However, I'm unable to retrieve those results.
Can someone tell my why this error is being generated, and what can I do to fix it. Thanks in advance!

Pytorch DataLoader multiple data source

I am trying to use Pytorch dataloader to define my own dataset, but I am not sure how to load multiple data source:
My current code:
class MultipleSourceDataSet(Dataset):
def __init__ (self, json_file, root_dir, transform = None):
with open(root_dir + 'block0.json') as f:
self.result = torch.Tensor(json.load(f))
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.result[0])
def __getitem__ (self):
None
The data source is 50 blocks under root_dir = ~/Documents/blocks/
I split them and avoid to combine them directly before since this is a very big dataset.
How can I load them into a single dataloader?
For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.
What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:
import os
import torch.utils.data as data
class SingeJsonDataset(data.Dataset):
# implement a single json dataset here...
list_of_datasets = []
for j in os.path.listdir(root_dir):
if not j.endswith('.json'):
continue # skip non-json files
list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
# once all single json datasets are created you can concat them into a single one:
multiple_json_dataset = data.ConcatDataset(list_of_datasets)
Now you can feed the concatenated dataset into data.DataLoader.
I should revise my question as 2 different sub-questions:
How to deal with large datasets in PyTorch to avoid memory error
If I am separating large a dataset into small chunks, how can I load multiple mini-datasets
For question 1:
PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.
For question 2:
Please refer to Shai's answer above.

Resources