Attribute error: DatasetDict' object has no attribute 'to_tf_dataset' - nlp

I am working on fine tuning a data for an NLP project using the hugginface library.
Here is the code i am having the challenge with. Has anyone been able to solve this problem?
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_dataset = testdata.to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels"],
batch_size=2,
collate_fn=data_collator,
shuffle=True
)
NB: I have seen suggestions about upgrading to the latest versions, and i have done that but the problem perists.

I faced the same problem. In my case I was working with a csv file. I used the following code to load the dataset:
from datasets import load_dataset
dataset_training = load_dataset("csv", file)
Then the method to_tf_dataset returned:
Attribute error: DatasetDict' object has no attribute 'to_tf_dataset'
To overcome this issue I loaded the content as a pandas Dataframe and then I loaded again using another method:
import pandas as pd
data = pd.read_csv("file.csv")
from datasets import Dataset
dataset = Dataset.from_pandas(data)
After that, to_tf_dataset method worked correctly. I have no explanation for this answer but it worked for me.

Related

How to load large multi file parquet files for tensorflow/pytorch

I am trying to load a few parquet files from a directory into Python for tensorflow/pytorch.
The files are too large to be loaded through the pyarrow.parquet functions
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dir')
table = dataset.read()
This gives out of memory error.
I have also tried using petastorm, but that doesn't work for make_reader() because it isn't of the petastorm type.
with make_batch_reader('dir') as reader:
dataset = make_petastorm_dataset(reader)
When I used the make_batch_reader() and then the make_petastorm_dataset(reader), it again gave an zip not iterable error or something along those lines.
I am not sure how to load the file into Python for ML training.
Some quick help would be greatly appreciated.
Thanks
Zash
For pyarrow, you can list the directory with Python, iterate over *.parquet files, open each one as pq.ParquetFile, and read it one row group at a time. This will alleviate the memory pressure, but won't be super fast without parallelization.
For petastorm, you are right to use make_batch_reader(). Indeed, the error messages are not always helpful; but you can inspect the stack trace and investigate where in petastorm code it originates from.
You can load entire data using dask using below code.
You can also load only chucks of data whenever needed by computing only those lines using the index. [Assuming you have different index].
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
#delayed
def load_chunk(pth):
x = ParquetFile(pth).to_pandas()
x = x.drop('[unwanted_columns_to_save_space]',axis=1)
return x
files = glob.glob('./your_path/*.parquet')
ddf = dd.from_delayed([load_chunk(f) for f in files])
df = ddf.compute()

Numpy error on .arange command

I am trying to use sklearn nmf on a binary file (.bin) imported via numpy and converted to uint8. I import the file no problem, but it's coming in as a 1D array, and when I try and arrange into a 2D array (which sklearn.NMF requires) it errors. I have imported numpy and sklearn.
Import data:
m1 = np.fromfile('file', dtype='uint8')
Code it errors on (I added the - symbol following advice from the docs, it also errors without the - symbol):
m1.arange(962240400).reshape((31020,-31020))
The error:
AttributeError: 'numpy.ndarray' object has no attribute 'arange'
I have tried looking at the official docs and stack overflow, but nothing seems to be working. If anyone has any ideas as to why my code is wrong that would be great.
Use np.arange(962240400).reshape((31020,-31020)), it is a function of numpy, not a method of the array m1
use arange in place of arrange.there should only one 'r'

CSV Input in gensim LDA via corpora.csvcorpus

I wanna use the LDA in gensim for topic modeling over a few thousand documents.
Therefore I´m using a csv-File as Input in the format of a term-document-matrix.
Currently it occurs an error when running the following code:
from gensim import corpora
import_path ="TDM.csv"
dictionary = corpora.csvcorpus(import_path, labels='true')
The error is the following:
dictionary = corpora.csvcorpus(import_path, labels='true')
AttributeError: module 'gensim.corpora' has no attribute 'csvcorpus'
Am I using the module correctly and if so, where is my mistake?
Thanks in advance.
This also bugged me for quite awhile.
It looks like csvcorpus is actually in the experimental stage as you can see in their github issue, https://github.com/RaRe-Technologies/gensim/issues/1583
I would recommend going by the old fashioned way of using the csv package to read your csv file instead.
Cheers.

Attribute Error : Function object has no attribute

I see there are so many questions about this title or for this issue, But I still don't understand why it is occurring.
I have imported Pandas and Numpy.
Then I read my file using pd.read_excel.
Then I viewed the head of my file using .head()
Now, after I sliced my data also the .head method was working fine. But now suddenly it throws an Attribute error and it gets resolved once I re-import my file again, but then, after some time it again gives me the same error. What is wrong that I am doing? and I don't understand this error clearly.
import pandas as pd
import numpy as np
sales = pd.read_excel('SALESC.xlsx', header=0)
sales.isnull().sum()
sales["Date"] = pd.to_datetime(sales['Date of document'])
sales = sales[pd.notnull(sales['Quantity sold']) & pd.notnull(sales['Unit
selling price including tax'])]
sales = sales.iloc[:,[3,6,8,9,10,11,19,35,39]]
sales.head(5)
Can someone explain the problem? and how to resolve it, thanks in advance

How to import .dta via pandas and describe data?

I am new to python and have a simple problem. In a first step, I want to load some sample data I created in Stata. In a second step, I would like to describe the data in python - that is, I'd like a list of the imported variable names. So far I've done this:
from pandas.io.stata import StataReader
reader = StataReader('sample_data.dta')
data = reader.data()
dir()
I get the following error:
anaconda/lib/python3.5/site-packages/pandas/io/stata.py:1375: UserWarning: 'data' is deprecated, use 'read' instead
warnings.warn("'data' is deprecated, use 'read' instead")
What does it mean and how can I resolve the issue? And, is dir() the right way to get an understanding of what variables I have in the data?
Using pandas.io.stata.StataReader.data to read from a stata file has been deprecated in pandas 0.18.1 version and hence you are getting that warning.
Instead, you must use pandas.read_stata to read the file as shown:
df = pd.read_stata('sample_data.dta')
df.dtypes ## Return the dtypes in this object
Sometimes this did not work for me especially when the dataset is large. So the thing I propose here is 2 steps (Stata and Python)
In Stata write the following commands:
export excel Cevdet.xlsx, firstrow(variables)
and to copy the variable labels write the following
describe, replace
list
export excel using myfile.xlsx, replace first(var)
restore
this will generate for you two files Cevdet.xlsx and myfile.xlsx
Now you go to your jupyter notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('Cevdet.xlsx')
This will allow you to read both files into jupyter (python 3)
My advice is to save this data file (especially if it is big)
df.to_pickle('Cevdet')
The next time you open jupyter you can simply run
df=pd.read_pickle("Cevdet")

Resources