Issue with using .take() function with spark 2+ pyspark

Issue with using .take() function with spark 2+ pyspark - python-3.x

This is the code I am using. Here it runs fine without data.take but gives error when using it in pyspark python
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
data = sc.textFile("re_u.data")
pData=data.take(2000)
ratings = pData.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
Gives Error
AttributeError Traceback (most recent call last)
<ipython-input-12-c9c51af1b2e9> in <module>
2 data = sc.textFile("re_u.data")
3 pData=data.take(2000)
----> 4 ratings = pData.map(lambda l: l.split(','))\
5 .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
AttributeError: 'list' object has no attribute 'map'
Update:
After using your change #Hristo Iliev it helped but encountered another issue that followed with ratings as a list. Thank you for your help!
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.take(2000)
rank = 20
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
Gives error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-7e35afff970b> in <module>
1 rank = 20
2 numIterations = 20
----> 3 model = ALS.train(ratings, rank, numIterations)
C:\spark\spark-3.0.0-preview2-bin-hadoop2.7\python\pyspark\mllib\recommendation.py in train(cls, ratings, rank, iterations, lambda_, blocks, nonnegative, seed)
271 (default: None)
272 """
--> 273 model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations,
274 lambda_, blocks, nonnegative, seed)
275 return MatrixFactorizationModel(model)
C:\spark\spark-3.0.0-preview2-bin-hadoop2.7\python\pyspark\mllib\recommendation.py in _prepare(cls, ratings)
227 else:
228 raise TypeError("Ratings should be represented by either an RDD or a DataFrame, "
--> 229 "but got %s." % type(ratings))
230 first = ratings.first()
231 if isinstance(first, Rating):
TypeError: Ratings should be represented by either an RDD or a DataFrame, but got <class 'list'>.
Please help!

take() is an action that takes the specified number elements from the top of an RDD and transfers them to the driver program. What you get from it is a Python list with the requested elements, which is:
local to the driver, hence you should not take too many elements
doesn't have a map() method simply because Python list class has no map() method
What you most likely want to do is to first apply the transformations to the data RDD and take() from the transformed RDD:
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.take(2000)
You'll get a list of Rating instances.
Since you pass the data further down to ALS, which takes distributed data, i.e., an RDD, and not driver-local list, you have three choices:
Parallelise again the list, turning it into an RDD:
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.take(2000)
ratingsRDD = sc.parallelize(ratings)
rank = 20
numIterations = 20
model = ALS.train(ratingsRDD, rank, numIterations)
Use the sample() method to sample a subset of the data in the RDD:
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.sample(False, 0.1, 42)
rank = 20
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
Here sample(False, 0.1, 42) means take approximately 10% of the original data and use 42 as the seed of the pseudorandom number generator. Fixing the seed will allow for reproducibility while testing. You should adjust 0.1 to the proper value so you get about 2000 samples. Notice that those samples will be taken from random places inside the RDD and will most likely not be the first 2000.
Emulate take() while staying withing the realm of RDDs:
data = sc.textFile("re_u.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\
.zipWithIndex()\
.filter(lambda l: l[1] < 2000)\
.map(lambda l: l[0])
rank = 20
numIterations = 20
model = ALS.train(ratings, rank, numIterations)
zipWithIndex() creates tuples of the RDD content where the first element comes from the RDD and the second one is the index in the RDD (essentially, the line number). You can then filter only elements with index less than 2000 and then get rid of the index using map(lambda l: l[0]).
Method 2 is probably the best one.

Related

DataCollatorForMultipleChoice gives KeyError: 'labels' in trainer.train

I am working on multiple-choice QA. I am using the official notebook of huggingface/transformers which is implemented for SWAG dataset.
I want to use it for other multiple-choice datasets. Therefore, I add some modifications related to dataset. all code is given in notebook.
SWAG dataset contains following columns including 'label'.
train: Dataset({
features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
num_rows: 73546
})
The dataset that I want to use has the following columns including 'answerKey' for target.
train: Dataset({
features: ['id', 'question_stem', 'choices', 'answerKey'],
num_rows: 4957
})
The error is given in dataloader which is
#dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
print(features[0].keys())
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
It is given the error in the following line:
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
the error is obtained in trainer.train()
KeyError Traceback (most recent call last)
<ipython-input-64-3435b262f1ae> in <module>()
----> 1 trainer.train()
5 frames
<ipython-input-60-d1262e974b03> in <listcomp>(.0)
18 print(features[0].keys())
19 label_name = "label" if "label" in features[0].keys() else "labels"
---> 20 labels = [feature.pop(label_name) for feature in features]
21 batch_size = len(features)
22 num_choices = len(features[0]["input_ids"])
KeyError: 'labels'
I don't know what causes the error. I think it is related to target keys. But I could not solve it. Any ideas?
Thanks,

I got the same error, and realised It was due to the lookup, it's looking for either "label" or "labels" as feature in your dataset.
Perhaps, if your answerKey is the label, you can rename this field.

Performing a Principal Component Analysis to reconstruct time series creates more values than expected

I want to do a Principal Component Analysis following this notebook to reconstruct the DJIA (I'm using alpha_ventage) from its components (found with Quandl). Yet, it seems that I create more values than expected, than the original dataframe, when reconstructing the values multiplying the principal components by their weights
kernel_pca = KernelPCA(n_components=5).fit(df_z_components)
pca_5 = kernel_pca.transform(-daily_df_components)
weights = fn_weighted_average(kernel_pca.lambdas_)
reconstructed_values = np.dot(pca_5, weights)
Indeed, daily_df_components is created from the components of the DJIA by the quandl API which seem to have more data than the library I use to get the DJIA Index, alpha_ventage.
Here is the full code
"""
Obtaining the components data from quandl
"""
import quandl
QUANDL_API_KEY = 'MYKEY'
quandl.ApiConfig.api_key = QUANDL_API_KEY
SYMBOLS = [
'AAPL', 'MMM', 'BA', 'AXP', 'CAT',
'CVX', 'CSCO', 'KO', 'DD', 'XOM',
'GS', 'HD', 'IBM', 'INTC', 'JNJ',
'JPM', 'MCD', 'MRK', 'MSFT', 'NKE',
'PFE', 'PG', 'UNH', 'UTX', 'TRV',
'VZ', 'V', 'WMT', 'WBA', 'DIS'
]
wiki_symbols = ['WIKI/%s'%symbol for symbol in SYMBOLS]
df_components = quandl.get(
wiki_symbols,
start_date='2017-01-01',
end_date='2017-12-31',
column_index=11)
df_components.columns = SYMBOLS
filled_df_components = df_components.fillna(method='ffill')
daily_df_components = filled_df_components.resample('24h').ffill()
daily_df_components = daily_df_components.fillna(method='bfill')
"""
Download the all-time DJIA dataset
"""
from alpha_vantage.timeseries import TimeSeries
# Update your Alpha Vantage API key here...
ALPHA_VANTAGE_API_KEY = 'MYKEY'
ts = TimeSeries(key=ALPHA_VANTAGE_API_KEY, output_format='pandas')
df, meta_data = ts.get_intraday(symbol='DIA',interval='1min', outputsize='full')
# Finding eigenvectors and eigen values
fn_weighted_average = lambda x: x/x.sum()
weighted_values = fn_weighted_average(fitted_pca.lambdas_)[:5]
from sklearn.decomposition import KernelPCA
fn_z_score = lambda x: (x - x.mean())/x.std()
df_z_components = daily_df_components.apply(fn_z_score)
fitted_pca = KernelPCA().fit(df_z_components)
# Reconstructing the Dow Average with PCA
import numpy as np
kernel_pca = KernelPCA(n_components=5).fit(df_z_components)
pca_5 = kernel_pca.transform(-daily_df_components)
weights = fn_weighted_average(kernel_pca.lambdas_)
reconstructed_values = np.dot(pca_5, weights)
# Combine PCA and Index to compare
df_combined = djia_2020_weird.copy()
df_combined['pca_5'] = reconstructed_values
But it returns:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-100-2808dc14f789> in <module>()
9 # Combine PCA and Index to compare
10 df_combined = djia_2020_weird.copy()
---> 11 df_combined['pca_5'] = reconstructed_values
12 df_combined = df_combined.apply(fn_z_score)
13 df_combined.plot(figsize=(12,8));
3 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
746 if len(data) != len(index):
747 raise ValueError(
--> 748 "Length of values "
749 f"({len(data)}) "
750 "does not match length of index "
ValueError: Length of values (361) does not match length of index (14)
Indeed, reconstructed_values is 361 long and df_combined is 14 values long...
Here is this last dataframe:
DJI
date
2021-01-21 NaN
2021-01-22 311.37
2021-01-23 310.03
2021-01-24 310.03
2021-01-25 310.03
2021-01-26 309.01
2021-01-27 309.49
2021-01-28 302.17
2021-01-29 305.25
2021-01-30 299.20
2021-01-31 299.20
2021-02-01 299.20
2021-02-02 302.13
2021-02-03 307.86
Maybe the reason is that the notebook author was available to get the data for the whole year he was interested in, when I run the data it seems that I only have two months?

Ahoy there, I'm the author of the notebook. It seems Quandl no longer provides historical prices of DJIA after the time of writing, and copyright wasn't granted to redistribute the data. For research, you may consider other free stock tickers to proxy DJIA.
The example usages have been updated in the repo to demostrate KernelPCA, as explained here.

NameError: name 'players_data' is not defined

I got this error. How to define the players_data?
NameError Traceback (most recent call last)
in
----> 1 data = np.vstack((asia[1:], eu[1:], na[1:], oc[1:], sea[1:], players_data[1:]))
2 df = pd.DataFrame({data[0, i]: data[1:, i] for i in range(data.shape[1])})
3 m = asfloat(data[1:, :4])
NameError: name players_data is not defined
asia = open_exl('pubg_as.xls', 0)
eu = open_exl('pubg_eu.xls', 0)
na = open_exl('pubg_na.xls', 0)
oc = open_exl('pubg_oc.xls', 0)
sea = open_exl('pubg_sea.xls', 0)
# Load all data
all_data = np.genfromtxt('PUBG_Player_Statistics.csv', delimiter=',')
all_data[:, 28] = all_data[:, 28] * 100
# Train data
train_data = all_data[1:2000, :][:, [3, 2, 28, 9]]
test_data = all_data[2000:, :][:, [3, 2, 28, 9]]
data = np.vstack((asia[1:], eu[1:], na[1:], oc[1:], sea[1:], players_data[1:]))
df = pd.DataFrame({data[0, i]: data[1:, i] for i in range(data.shape[1])})
m = asfloat(data[1:, :4])

It is hard to know what was the original value stored in players_data, as this is an incomplete code from another user; however, based on what they were doing, my guess is that players_data is:
players_data = train_data
But, why??
They used kmeans algorithm to create 6 clusters that represent the following categories:
['Normal player', 'Waller', 'Experienced Player', 'Both', 'God', 'Aimbot']
In the first 5 variables used in vstack they have information from the best players in the 5 servers. They wanted used this information, and leverage it with "normal players".
At the end, they didn't use neither "train_data" nor "test_data"; however, in the README.md, they mentioned the following:
The reason we mix data from the normal data set is to increase the density of normal players, and make the clustering tight. After test, we think 2000 rows of data has the best performance.
An important thing to notice, is that in the train and test data, they selected 4 columns:
[3, 2, 28, 9]
Which are the same columns that they used in the "top performers files"
def open_exl(address, idx):
data = xlrd.open_workbook(address)
table = data.sheets()[idx]
rows = table.nrows
ct_data = []
for row in range(rows):
ct_data.append(table.row_values(row))
return np.array(ct_data)[:, :4]
Since the code has inconsistencies, it might not be possible to obtain the results as they did; however, it seems as a great opportunity to play with the data and compare the results that you obtain with their previous investigation.

Retrieve latent factors from pyspark matrix factorization model

Newbie to Spark and PySpark.
I am following the collaborative filtering tutorial here.
I was able to train the model. However, I don't know how to access the latent factors (vectors) corresponding to users and products.
Reproducing the top part of the code from the above link here:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)
How can I extract the latent factors from model?

Try:
model.productFeatures()
and
model.productFeatures()

Spark mllib Collaborative Filtering, ValueError: RDD is empty

I'm new to Spark and am running the implicit collaborative fitering from here mllib. When I run the following code on my data, I'm getting the following error:
ValueError: RDD is empty
Here is my data:
101,1000010,1
101,1000011,1
101,1000015,1
101,1000017,1
101,1000019,1
102,1000010,1
102,1000012,1
102,1000019,1
103,1000011,1
103,1000012,1
103,1000013,1
103,1000014,1
103,1000017,1
104,1000010,1
104,1000012,1
104,1000013,1
104,1000014,1
104,1000015,1
104,1000016,1
104,1000017,1
105,1000017,1
And my code:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
data = sc.textFile("s3://xxxxxxxxxxxx.csv")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(l[0], l[1], float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
alpha = 0.01
model = ALS.trainImplicit(ratings, rank, numIterations, alpha)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
# convert pyspark pipeline to DF
ratesAndPreds.toDF().show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Issue with using .take() function with spark 2+ pyspark - python-3.x

Related

DataCollatorForMultipleChoice gives KeyError: 'labels' in trainer.train

Performing a Principal Component Analysis to reconstruct time series creates more values than expected

NameError: name 'players_data' is not defined

Retrieve latent factors from pyspark matrix factorization model

Spark mllib Collaborative Filtering, ValueError: RDD is empty

Categories

Resources