matrix in row of rdd to dataframe - apache-spark

First off sorry for the winded explanation.
Hi There, I am trying to converts some data (in form of RDD) to a dataframe but it's a bit more complex that just that.
I have an RDD: where each item is ROW() with a matrix (list of lists) called features and a list called labels.
I want to convert this RDD to a Dataframe where each row is a single list of features and a scalar which is the label. As you can see the problem arises in that the features in the RDD consists of matrix's rather then vectors.
Thanks,

flatMap(lambda row: [(f,l) for f, l in zip(row.feature, row.label)])
solution was to flatMap the features and labels for each row. (On RDD's)

Related

How to create single row panda DataFrame with headers from big panda DataFrame

I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df

How to multiply a vector with some specific columns in a DataFrame in python?

Here, I am working with LR model while doing predictive modeling. So, I get some parameters after the process, those needs to generate predicted values.
Say, I received parameter vector a=[a_0, a_1, a_2] if I use the only two-character variable to predict the best model, and have a data frame with more than three variables. Now, I want to multiply the vector 'a' with the corresponding column.
Excel formula would be like.
$A$2+J3*$A$3+L3*$A$4
or more general
2+ 8.9*b+ 7*d= e
I tried to get the parameters into an array and then multiply it. But so far I can do if columns number is same as the length of an array. But I am not getting an idea, How to do for specific columns.
Here df_1 is our actual dataFrame, and df_2 is our desired dataFrame. And a is our vector that we want to multiply it with df_1
its better to use numpy array for such operations. If somehow you already have dataframe then you can get numpy array it like below and perform the operations
np_a = a.values
np_df1 = df_1[["b","d"]].values
df_2 = df_1.copy()
df_2["e"] = np.sum(np_a[1:]*np_df1, axis=1) + np_a[0]

How to fillna() all columns of a dataframe from a single row of another dataframe with identical structure

I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)

Spark Dataset: how to change alias of the columns after a flatmap?

I have two spark datasets that I'm trying to join. The join keys are nested in dataset A, so I must flatmap it out first before joining with dataset B. The problem is that as soon as I flatmap that field, the column name becomes the default "_1", "_2", etc. Is it possible to change the alias somehow?
A.flatMap(a => a.keys).join(B).where(...)
After applying the transformation like flatMap you lose the columns as which is logical as after applying transformation like flatMap or map it does not guarantee that the number of column or datatype inside each column remain the same.That's why we lose the column name there.
What you can do is you can fetch all previous column and then apply it to the dataset like this:-
val columns = A.columns
A.flatMap(a => a.keys).toDF(columns:_ *).join(B).where(...)
this will only work if the number of columns is same after applying flatmap
Hope this clears your issue
Thanks

How to create subsets of an RDD with column wise splitting in Pyspark?

I have a large dataset as one RDD. I want to create about 100 column wise subsets of this RDD, so that I am able to run a map transformation on a each subset separately in a loop.
My RDD looks for example like this:
(1,2,3,...,1000)
(1,2,3,...,1000)
(1,2,3,...,1000)
I want a column wise split, for example 10 splits, so one subset should look like this:
(1,2,3,...,100)
(1,2,3,...,100)
(1,2,3,...,100)
How can i do that in Pyspark?
You can use range and loop:
for i in range(0, 1000, 100):
rdd.map(lambda row: row[i:i + 100]).someOtherOperation(...)

Resources