Torchtext TabularDataset: data.Field doesn't contain actual imported data? - pytorch

I learned from the Torchtext documentation that the way to import csv files is through TabularDataset. I did it like this:
train = data.TabularDataset(path='./data.csv',
format='csv',
fields=[("label",data.Field(use_vocab=True,include_lengths=False)),
("statement",data.Field(use_vocab=True,include_lengths=True))],
skip_header=True)
"label" and "statement" are the header names of the 2 columns in my csv file. I defined them as data.Field, but "label" and "statement" don't seem to actually contain the data from my csv file, despite being recognized as data field objects by the console with no problem. I found out this issue when I tried to build a vocab list with statement.build_vocab(train, max_size=25000). I printed len(statement.vocab), the return is "2", which obviously doesn't reflect the actual data in the csv file. Did I do something wrong when importing the csv data or is my vocab building done wrong? Is there a separate method to put the data in the field objects? Thanks!!

The fields must be defined separately like this
TEXT = data.Field(sequential=True,tokenize=tokenize, lower=True, include_lengths=True)
LABEL = data.Field(sequential=True,tokenize=tokenize, lower=True)
train = data.TabularDataset(path='./data.csv',
format='csv',
fields=[("label",LABEL),
("statement",TEXT)],
skip_header=True)
test = data.TabularDataset(path='./test.csv',
format='csv',
fields=[("label",LABEL),
("statement",TEXT)],
skip_header=True)

Related

Convert panda dataframe to h5 file

I want to store my dataframe in h5 file. My dataframe is:
dfbhbh=pd.DataFrame([m1bhbh,m2bhbh,adcobhbh,edcobhbh,alisabhbh,elisabhbh,tevolbhbh,distbhbh,metalbhbh,compbhbh,weightbhbh]).T
dfbhbh.columns=['m_1','m_2','a_DCO','e_DCO','a_LISA','e_LISA','t_evol','dist','Z','component','weight']
I am trying to convert it using:
hf=h5py.File('anew', 'w')
for i in range(len(dfbhbh)):
hf.create_dataset('simulations',list(dfbhbh.iloc[i]))
And I'm getting the error
TypeError: Can't convert element 9 (low_alpha_disc) to hsize_t
I removed the entire array of the component (even though it is extremely significant) but the code did not run.
I also tried to insert directly the data in the h5 file like this
hf.create_dataset('simulations', m1bhbh)
I got this error
Dimensionality is too large (dimensionality is too large)
The variable 'm1bhbh' is a float type with length 1499.
Try:
hf.create_dataset('simulations', data = m1bhbh)
instead of
hf.create_dataset('simulations', m1bhbh)
(Don't forget to clear outputs before running it.)

Extracting specific data from a [pandas.core.frame.DataFrame] variable

While extracting data from a .csv file using pandas, I wanted to collect the labels of various columns in that file. Instead of hardcoding, I was trying to extract it from the variable I created from the code below:
train_data = pd.read_csv("Anydatasheet.csv")
features = ["Pclass","Age", "Fare", "Parch", "SibSp","Sex","Embarked"]
X = pd.get_dummies(train_data[features])
X.head()
(By labels above, I mean the bold text circled in the image attached)
Can anyone tell me how to do it?
(Image data source : Kaggle titanic problem data)
enter image description here
What you are looking for is the columns names. You can read them directly:
train_data = pd.read_csv("Anydatasheet.csv")
features_name = train_data.columns
or if you want them as python regular list:
train_data = pd.read_csv("Anydatasheet.csv")
features_name = train_data.columns.tolist()
example:
import pandas as pd
df = pd.DataFrame({"city":[1,1], "A":[5,6]})
print(df.columns.tolist())
output:
['city', 'A']

Python - In a ML code. Getting error : IndexError: list index out of range

I was going thru some ML python code just try to understand what is does and how it works. I noticed a youtube video that took me to this code random-forests-tutorials. The code actually uses hard-coded Array/List. But if I use file as input, then it throws
IndexError: list index out of range in the print_tree function
could someone please help me with resolving this? I have not yet changed anything else in the program besides just pointing it to file as input instead of hard-coded Array.
I created this function to read the CSV data from HEADER and TRAINING files. But to read the TESTING data file i have similar function but am not reading row[5] as it does not exist. the number of columns of Testing data file is 1 short.
def getBackData(filename)
with open(filename, newline='') as csvfile:
rawReader = csv.reader(csvfile, delimiter=',', quotechar='"')
if "_training" in filename:
parsed = ((row[0],
int(row[1]),
int(row[2]),
int(row[3]),
row[4],
row[5])
for row in rawReader)
else:
parsed = rawReader
theData = list(parsed)
return theData
So in the code am using the variables as
training_data = fs.getBackData(fileToUse + "_training.dat")
header = fs.getBackData(fileToUse + "_header.dat")
testing_data =fs.getBackData(fileToUse + "_testing.dat")
Sample Data for Header is
header = ["CYCLE", "PASSES", "FAILURES", "IGNORED", "STATUS", "ACCEPTANCE"]
Sample for Training Data is
"cycle1",10,5,1,"fail","discard"
"cycle2",7,9,0,"fail","discard"
"cycle3",14,2,0,"pass","accept"
Sample for Testing Data is
"cycle71",11,4,1,"failed"
"cycle72",16,0,0,"passed"
I cant believe myself. I was wondering why was it so difficult to use a CSV file when every thing else is so easy in Python. My bad, I am new to it. So I finally figured out what was causing the list out of bound.
the function getBackData to be used to Training DATA & Testing Data only.
Separate function required for Header and Testing Data. Because Header will have equal number of columns, still the data type will be String.
Actually, I was using the function getBackData for Header also. and it was returning the CSV (containing headers) in a 2D list. Typically thats what it does. this was causing the issue.
Headers were supposed to be read as header[index], instead the code was recognizing it as header[row][col]. thats what I missed. I assumed Python to be intelligent enough to understand if only 1 row is there in CSV it should return a 1-D array.
Deserves a smiley :-)

Combining multiple data rows on key lookup

So I am trying to combine multiple CSV files. I have one csv with a current part number list of products we stock. Sorry, I can't embedded images as I am new. I've seen many similar posts but not any with both a merge and a groupby together.
current_products
I have another csv with a list of image files that are associated with that part but are split up on to multiple rows. This list also has many more parts listed than we offer so merging based on the current_products sku is important.
product_images
I would like to reference the first csv for parts I currently use and combine the images files in the following format.
newestproducts
I get a AttributeError: 'function' object has no attribute 'to_csv', although when I just print the output in the terminal it appears to be the way I want it.
current_products = 'currentproducts.csv'
product_images = 'productimages.csv'
image_list = 'newestproducts.csv'
df_currentproducts = pd.read_csv(currentproducts)
df_product_images = pd.read_csv(product_images)
df_current_products['sku'] = df_currentproducts['sku'].astype(str)
df_product_images['sku'] = df_product_images['sku'].astype(str)
df_merged = pd.merge(df_current_products, df_product_images[['sku','images']], on = 'sku', how='left')
df_output = df_merged.groupby(['sku'])['images_y'].apply('&&'.join).reset_index
#print(df_output)
df_output.to_csv(image_list, index=False)
Your are missing () after reset_index:
df_output = df_merged.groupby(['sku']['images_y'].apply('&&'.join).reset_index()
That resulted df_output type to method rather then a dataframe (just print type(df_output) to see that), so obviously he doesn't know any method named to_csv

Why do my lists become strings after saving to csv and re-opening? Python

I have a Dataframe in which each row contains a sentence followed by a list of part-of-speech tags, created with spaCy:
df.head()
question POS_tags
0 A title for my ... [DT, NN, IN,...]
1 If one of the ... [IN, CD, IN,...]
When I write the DataFrame to a csv file (encoding='utf-8') and re-open it, it looks like the data format has changed with the POS tags now appearing between quotes ' ' like this:
df.head()
question POS_tags
0 A title for my ... ['DT', 'NN', 'IN',...]
1 If one of the ... ['IN', 'CD', 'IN',...]
When I now try to use the POS tags for some operations, it turns out they are no longer lists but have become strings that even include the quotation marks. They still look like lists but are not. This is clear when doing:
q = df['POS_tags']
q = list(q)
print(q)
Which results in:
["['DT', 'NN', 'IN']"]
What is going on here?
I either want the column 'POS_tags' to contain lists, even after saving to csv and re-opening. Or I want to do an operation on the column 'POS_tags' to have the same lists again that SpaCy originally created. Any advice how to do this?
To preserve the exact structure of the DataFrame, an easy solution is to serialize the DF in pickle format with pd.to_pickle, instead of using csv, which will always throw away all information about data types, and will require manual reconstruction after re-import. One drawback of pickle is that it's not human-readable.
# Save to pickle
df.to_pickle('pickle-file.pkl')
# Save with compression
df.to_pickle('pickle-file.pkl.gz', compression='gzip')
# Load pickle from disk
df = pd.read_pickle('pickle-file.pkl') # or...
df = pd.read_pickle('pickle-file.pkl.gz', compression='gzip')
Fixing lists after importing from CSV
If you've already imported from CSV, this should convert the POS_tags column from strings to python lists:
from ast import literal_eval
df['POS_tags'] = df['POS_tags'].apply(literal_eval)

Resources