Size of WebDataset in Pytorch - pytorch

When it comes to the Pytorch Dataloader which takes a default dataset (e.g. datasets.ImageFolder), we can find the size of a dataset that is used by the dataloader with len(dataloader). However, what about WebDataset?
As WebDataset is a PyTorch Dataset, is it possible to get the size of a loader which takes a WebDataset?
https://webdataset.github.io/webdataset/

WebDataset doesn't provide a __len__ method, as it conforms to the PyTorch IterableDataset interface. IterableDataset is designed for stream-like data, and considers it wrong to have a len().
If you have code that depends on len() to be available, you can set the length to some value using with_length():
>>> dataset = wds.WebDataset(url)
>>> len(dataset)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type 'WebDataset' has no len()
>>> dataset = dataset.with_length(10)
>>> len(dataset)
10

Related

enquiry on uniformly distributed random numbers using python

please how can I randomly generate 5,000 integers uniformly distributed in [1, 100] and find the mean using python. I tried the function np.random.randint(100, size=5000), but I got below error message while trying to get the mean.
Traceback (most recent call last):
File "", line 1, in
TypeError: 'numpy.ndarray' object is not callable
You can use random.randint:
import numpy as np
r=np.random.randint(0,100,5000)
Then use mean to find the mean of that:
>>> np.mean(r)
49.4686
You can also use the array method of mean():
>>> r.mean()
49.4686
you can use this:
np.random.randint(1, 100, size=1000).mean()

How to get prediction for a single data entry

I have a trained model stored in pickle. All i need to do is get a single-valued dataframe in pandas and get the prediction by passing it to the model.
To handle the categorical columns, i have used one-hot-encoding. So to convert the pandas dataframe to numpy array, i also used one-hot-encoding on the single valued dataframe. But it shows me error.
import pickle
import category_encoders as ce
import pandas as pd
pkl_filename = "pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
ohe = ce.OneHotEncoder(handle_unknown='ignore', use_cat_names=True)
X_t = pd.read_pickle("case1.pkl")
X_t_ohe = ohe.fit_transform(X_t)
X_t_ohe = X_t_ohe.fillna(0)
Ypredict = pickle_model.predict(X_t_ohe)
print(Ypredict[0])
Traceback (most recent call last):
File "Predict.py", line 14, in
Ypredict = pickle_model.predict(X_t_ohe)
File "/home/neo/anaconda3/lib/python3.6/site-> packages/sklearn/linear_model/base.py", line 289, in predict
scores = self.decision_function(X)
File "/home/neo/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 93 features per sample; expecting 989
This happens because OneHotEncoder actually converts your dataframe into many different numerical columns and your pickle model actually has the trained model from your original file which does not have the same dimensions(same number of column)
To rectify this issue you will need to retrain your model after applying the one-hot-encoder and then save it as a pickle file and reusing that modelel.

Tensorflow TypeError: 'numpy.ndarray' object is not callable

while trying to predict the model i am getting this numpy.ndarray error .it might be the returning statement of the prepare function. what can be possibly done to get rid of this error .
import cv2
import tensorflow as tf
CATEGORIES = ["Dog", "Cat"]
def prepare(filepath):
IMG_SIZE = 50 # 50 in txt-based
img_array = cv2.imread(filepath, cv2.IMREAD_GRAYSCALE)
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
return new_array.reshape(-1, IMG_SIZE, IMG_SIZE, 1)
model = tf.keras.models.load_model("64x3-CNN.model")
prediction = model.predict([prepare('dog.jpg')])
print(prediction) # will be a list in a list.
tried to give the full path still the same error persist.
TypeError Traceback (most recent call last)
<ipython-input-45-f9de27e9ff1e> in <module>
15
16 prediction = model.predict([prepare('dog.jpg')])
---> 17 print(prediction) # will be a list in a list.
18 print(CATEGORIES[int(prediction[0][0])])
TypeError: 'numpy.ndarray' object is not callable
Not sure what the rest of your code looks like. But if you use 'print' as a variable in Python 3 you can get this error:
import numpy as np
x = np.zeros((2,2))
print = np.ones((2,2))
print(x)
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'numpy.ndarray' object is not callable
This type of errors mostly occurs when trying to print an array instead of simple strings or single variable numbers, so I would recommend you to change:
17 print(prediction) # will be a list in a list.
18 print(CATEGORIES[int(prediction[0][0])])
Then you would get:
17 print(str(prediction)) # will be a list in a list.
18 print(str(CATEGORIES[int(prediction[0][0])]))

Getting TypeErrror:DecisionTreeClassifier' object is not iterable in sparkml lib

I am trying to implement a decision tree in spark Mllib with help of Coursera "Machine learning for big data". I have got below error
<class 'pyspark.ml.classification.DecisionTreeClassifier'>
Traceback (most recent call last):
File "C:/sparkcourse/Pycharmproject/Decisiontree.py", line 65, in <module>
model=modelpipeline.fit(traindata)
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 64, in fit
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\pipeline.py", line 93, in _fit
TypeError: 'DecisionTreeClassifier' object is not iterable
Here is the code
from pyspark.sql import SparkSession
from pyspark.sql import DataFrameNaFunctions
#pipeline is estimator or transformer
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import Binarizer
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").enableHiveSupport().getOrCreate()
weatherdata=spark.read.csv("file:///SparkCourse/daily_weather.csv",header="true",inferSchema="true")
#print(weatherdata.columns)
#for input features we explicitly take the columns
featurescolumn=['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am']
#print(featurescolumn)
weatherdata=weatherdata.drop("number")
#print(weatherdata.columns)
#missing value dealing
weatherdata=weatherdata.na.drop()
#print(weatherdata.count(),len(weatherdata.columns))
#create a categorical variable to denote if humid is not low(we weill deal heare relative_humidity_3pm column).if value is
#less than 25% then categorical value is 0 or if higher it will be 1. using binarizer will solve this
binarizer=Binarizer(threshold=24.99999,inputCol='relative_humidity_3pm',outputCol='low_humid')
#we transform whole weatherdata into Binarizer categorical value
binarizerDf=binarizer.transform(weatherdata)
#binarizerDf.select("relative_humidity_3pm",'low_humid').show(4)
#aggregating the fetures that will be used to make prediction into single columns
#The inputCols argument specifies our list of column names we defined earlier, and outputCol is the name of the new column. The second line creates a new DataFrame with the aggregated features in a column.
assembler=VectorAssembler(inputCols=featurescolumn,outputCol="features")
assembled=assembler.transform(binarizerDf)
#assembled.select("features").show(1)
#spliting Train and Test data by calling randomsplit
(traindata, testdata)=assembled.randomSplit([0.80,0.20],seed=1234)
#data counting
print(traindata.count(),testdata.count())
#create decision trees Model
#----------------------------------
#The labelCol argument is the column we are trying to predict, featuresCol specifies the aggregated features column, maxDepth is stopping criterion for tree induction based on maximum depth of tree
#minInstancesPerNode is stopping criterion for tree induction based on minimum number of samples in a node
#impurity is the impurity measure used to split nodes.
decisiontree=DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5,minInstancesPerNode=20,impurity="gini")
print(type(decisiontree))
#creating model by training the decision tree, pipeline solve this
modelpipeline=Pipeline(stages=decisiontree)
model=modelpipeline.fit(traindata)
#predicting test data
predictions=model.transform(testdata)
#showing predictedvalue
prediction=predictions.select('prediction','label').show(5)
The course is using spark 1.6 in cloud era VM. but i have integrated spark 2.1.0 with PyCharm.
stages should a sequence of PipelineStages (Transofmers or Esitmators), not a single Estimator. Replace:
Pipeline(stages=decisiontree)
with
Pipeline(stages=[decisiontree])

sklearn LogisticRegression does not accept csr_matrix

I am a newby and I have to classify the words of a lexicon according to the De Pauw and Wagacha (1998) method (basically, maxent on char n-grams). The data is very large (500 000 entries and millions of n-grams). So I must load the samples as a sparse matrix. But I ran into a problem.
sklearn.linear_model.LogisticRegression().fit(X,y) says it does not accept scipy.sparse.csr.csr_matrix training vectors. I got this error
Traceback (most recent call last):
File "test-LR-4.py", line 8, in <module>
clf.fit(X,y)
File "/usr/lib/pymodules/python2.7/sklearn/svm/base.py", line 441, in fit
% type(X))
ValueError: Training vectors should be array-like, not <class 'scipy.sparse.csr.csr_matrix'>
for the following script:
from sklearn.linear_model import LogisticRegression
import numpy as np
import scipy.sparse as sp
X = sp.csr_matrix([[0, 1, 2],[1, 2, 3],[3, 2, 1]])
y = np.array(range(3))
clf=LogisticRegression(dual=True)
clf.fit(X,y)
As mentioned in comments by #Andreas and #Fred Foo, upgrading the sklearn version (> 0.13) will solve the problem.

Resources