What does this piece of code do in Python with a dataset in Pytorch?
datasets = split_dataset(dataset(),splits=split)
datasets['_unlab'] = dmap(lambda mb: mb[0],dataset())
The second line seems particularly cryptic.
And here is a link to split_dataset and dmap
Code for Split dataset
Code for dmap
Related
For a university assignment we got the task to predict the author's name based on a title and abstract and year. We received one train set and one test set which does not include the author's names.
To start we left out the 'year' column, but our teacher told us that it would be of importance to include that. The title and abstract column have been combined and we used hashing vectorizer and tfidf transformer to convert the text data to numerical. We had then applied hyperparameter tuning followed by the SGDClassifier from SKlearn.
Our problem is now that we want to use two independent variables (title/abstract & year), however after transforming and vectorizing the title/abstract we are unable to add the year column to train the data.
Do you have any tips to combine those?
The tfidf transformer leads to a numpy array which is shown vertically, and we could append the year but that would lead to an enormous dataset.
The picture shows the data before vectorization and transformation
My question now is how vectorize the 'labels' and 'year' column together by using the HashingVectorizer and Tfidf Transformer to use the SGDClassifier.
X = merged_data[['labels', 'year']]
label_vectorizer = HashingVectorizer(ngram_range=(1,2), n_features=2**18)
labels_vect = label_vectorizer.transform(X)
transformer = TfidfTransformer()
X = transformer.fit_transform(labels_vect)
y = merged_data['authorId']
print(X)
Output:
(0, 100016) -1.0
(1, 137240) 1.0
Or:
X = merged_data['labels']
label_vectorizer = HashingVectorizer(ngram_range=(1,2), n_features=2**18)
labels_vect = label_vectorizer.transform(X)
transformer = TfidfTransformer()
X = transformer.fit_transform(labels_vect)
y = merged_data['authorId']
print(X)
Output:
(0, 258452) 0.06262005509267683
(0, 258265) 0.09434206300878606
(0, 255798) 0.0942876246256502
(0, 254434) -0.06461787945275276
(0, 245917) 0.06743365473981061
(0, 245256) -0.06461787945275276
...
(12128, 84695) -0.14448685821655255
(12128, 73740) -0.09330855785758727
(12128, 68673) -0.09849506054591504
(12128, 64492) -0.12171733351456865
(12128, 62658) -0.13655467338246333
And then it would be needed to append the year? I am really not sure how to approach this.
They are already 2 posts about this topics, but they have not been updated for the recent TF2.1 release...
In brief, I've got a lot of tif images to read and parse with a specific pipeline.
import tensorflow as tf
import numpy as np
files = # a list of str
labels = # a list of int
n_unique_label = len(np.unique(labels))
gen = functools.partial(generator, file_list=files, label_list=labels, param1=x1, param2=x2)
dataset = tf.data.Dataset.from_generator(gen, output_types=(tf.float32, tf.int32))
dataset = dataset.map(lambda b, c: (b, tf.one_hot(c, depth=n_unique_label)))
This processing works well. Nevertheless, I need to parallelize the file parsing part, trying the following solution:
files = # a list of str
files = tensorflow.data.Dataset.from_tensor_slices(files)
def wrapper(file_path):
parser = partial(tif_parser, param1=x1, param2=x2)
return tf.py_function(parser, inp=[file_path], Tout=[tf.float32])
dataset = files.map(wrapper, num_parallel_calls=2)
The difference is that I parse one file at a time here with the parser function. However, but it does not work:
File "loader.py", line 643, in tif_parser
image = numpy.array(Image.open(file_path)).astype(float)
File "python3.7/site-packages/PIL/Image.py", line 2815, in open
fp = io.BytesIO(fp.read())
AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'read'
[[{{node EagerPyFunc}}]] [Op:IteratorGetNextSync]
As far as I understand, the tif_parser function does not receive a string but an (unevaluated) tensor. At now, this function is fairly simple:
def tif_parser(file_path, param1=1, param2=2):
image = numpy.array(Image.open(file_path)).astype(float)
image /= 255.0
return image
Here is how I have proceeded
dataset = tf.data.Dataset.from_tensor_slices((files, labels))
def wrapper(file_path, label):
import functools
parser = functools.partial(tif_parser, param1=x1, param2=x2)
return tf.data.Dataset.from_generator(parser, (tf.float32, tf.int32), args=(file_path, label))
dataset = dataset.interleave(wrapper, cycle_length=tf.data.experimental.AUTOTUNE)
# The labels are converted to 1-hot vectors, could be integrated in tif_parser
dataset = dataset.map(lambda i, l: (i, tf.one_hot(l, depth=unique_label_count)))
dataset = dataset.shuffle(buffer_size=file_count, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
Concretely, I generate a data set every time the parser is called. The parser is run cycle_length time at each call, meaning that cycle_length images are read at once. This is suited to my specific case, because I cannot load all the images in memory. I am unsure whether the prefetch is used correctly or not here.
I apologise in advance as I cannot reproduce the dataset I'm working with. So I am just going to describe steps and hope someone is familiar with the whole process.
I'm trying to use LDA Gensim to extract topics from a list of text documents.
from gensim.models import LdaModel
from gensim.corpora import Dictionary
I build dictionary and corpus:
dictionary = Dictionary(final_docs)
corpus = [dictionary.doc2bow(doc) for doc in final_docs]
where final_docs is a list of lists with cleaned tokens for each text like this:
final_docs = [['cat','dog','animal'],['school','university','education'],...['music','dj','pop']]
then I initiate the model like this:
# Set training parameters:
num_topics = 60
chunksize = 100
passes = 20
iterations = 400
eval_every = None
# Make an index to word dictionary
temp = dictionary[0] # This is only to "load" the dictionary.
id2word = dictionary.id2token
model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
alpha='auto', eta='auto', \
iterations=iterations, num_topics=num_topics, \
passes=passes, eval_every=eval_every)
I can print topics and terms (10 most important). And they make sense. So it seems working fine.
for idx in range(n_topics):
print("Topic #%s:" % idx, model.print_topic(idx, 10))
BUT I struggle to plot all the documents as clusters using Bokeh. (And I really need Bokeh because I compare the same plot from different models). I know I have to reduce dimensionality to 2. And I try to do it using CountVectorizer and then T-sne:
from sklearn.feature_extraction.text import CountVectorizer
docs_vect = [' '.join(txt) for txt in final_docs]
cvectorizer = CountVectorizer(min_df=6, max_df=0.50, max_features=10000, stop_words=stop)
cvz = cvectorizer.fit_transform(docs_vect)
X_lda = model.fit_transform(cvz)
But I get this error: AttributeError: 'LdaModel' object has no attribute 'fit_transform'
I'm definitely doing something wrong with CountVectorizer. Could anyone help me out?
I am working on a multiclass classification problem with an unbalanced dataset of images(different class). I tried imblearn library, but it is not working on the image dataset.
I have a dataset of images belonging to 3 class namely A,B,C. A has 1000 data, B has 300 and C has 100. I want to oversample class B and C, so that I can avoid data imbalance. Please let me know how to oversample the image dataset using python.
Actually, it seems imblearn.over_sampling resampling just 2d dims inputs. So one way to oversampling your image dataset by this library is to use reshaping alongside with it, you can:
reshape your images
oversample them
again reshape the new dataset to
the first dims
consider you have an image dataset of size (5000, 28, 28, 3) and dtype of nd.array, following the above instructions, you can use the solution below:
# X : current_dataset
# y : labels
from imblearn.over_sampling import RandomOverSampler
reshaped_X = X.reshape(X.shape[0],-1)
#oversampling
oversample = RandomOverSampler()
oversampled_X, oversampled_y = oversample.fit_resample(reshaped_X , y)
# reshaping X back to the first dims
new_X = oversampled_X.reshape(-1,28,28,3)
hope that was helpful!
I have a simple dataframe consisting of one column. In that column are 10320 observations (numerical). I'm simulating Time-Series data by inserting the data into a plot with a window of 200 observations each. Here is the code for plotting.
import matplotlib.pyplot as plt
from IPython import display
fig_size = plt.rcParams["figure.figsize"]
import time
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
fig, axes = plt.subplots(1,1, figsize=(19,5))
df = dframe.set_index(arange(0,len(dframe)))
std = dframe[0].std() * 6
window = 200
iterations = int(len(dframe)/window)
i = 0
dframe = dframe.set_index(arange(0,len(dframe)))
while i< iterations:
frm = window*i
if i == iterations:
to = len(dframe)
else:
to = frm+window
df = dframe[frm : to]
if len(df) > 100:
df = df.set_index(arange(0,len(df)))
plt.gca().cla()
plt.plot(df.index, df[0])
plt.axhline(y=std, xmin=0, xmax=len(df[0]),c='gray',linestyle='--',lw = 2, hold=None)
plt.axhline(y=-std , xmin=0, xmax=len(df[0]),c='gray',linestyle='--', lw = 2, hold=None)
plt.ylim(min(dframe[0])- 0.5 , max(dframe[0]) )
plt.xlim(-50,window+50)
display.clear_output(wait=True)
display.display(plt.gcf())
canvas = FigureCanvas(fig)
canvas.print_figure('fig.png', dpi=72, bbox_inches='tight')
i += 1
plt.close()
This simulates a flow of real-time data and visualizes it. What I want is to apply theanets RNN LSTM to the data to detect anomalies unsupervised. Because I am doing it unsupervised I don't think that I need to split my data into training and test sets. I haven't found much of anything that makes sense to me so far and have been googling for about 2 hours. Just hoping that you guys may be able to help. I want to put the prediction output of the RNN on the graph as well and define a threshold that, if the error is too large, the values will be identified as anomalous. If you need more information please comment and let me know. Thank you!
READING
Like neurons, LSTM networks are build of interconnected LSTM Blocks whose training is done via BackPropogation Through Time.
Classical anomaly detection using time series required prediction of time series output in future (at one or more points) and finding error on these points with true values. Prediction Error above a threshold will reflect and amomly
SOLUTION
Having said this
You've to train network so you need training sets and test sets both
Use N inputs to predict M outputs (decide upon N and M with experimentation - values for which training error is low)
Scroll a window of (N+M) elements in input data and use this data array of (N+M) items also termed as frame to train or test network.
Typically we use 90% of starting series for training and 10% for testing.
This scheme will fail as if training is not proper there will be false prediction errors which are not-anomaly. So make sure to provide enough training, and most important shuffle training frames and consider all variations.