Extracting array column from spark dataframe - apache-spark

My spark dataframe has array column, I have to generate new columns by extracting data from single array column. are there any methods available for this.
id Amount
10 [Tax:10,Total:30,excludingTax:20]
11 [Total:30]
12 [Tax:05,Total:35,excludingTax:30]
I have to generate this dataframe.
ID Tax Total
10 10 30
11 0 30
12 05 35

If you know for sure [Tax:10,Total:30,excludingTax:20] are the only fields in the same order you can always map over entire dataframe and extract them as Amount[0], Amount[1] ...
Then assign them as a instance of a case class and finally convert back to dataframe.
Only thing you have to be care full that you don't call Amount[3] if Amount has only 2 values. That is easily achievable by checking the array length.
Alternately if you don't know the order. Best way is to use JSONRdd. Then loop through the JSON object parse them and create a new row. Finally convert that to a dataframe

Related

Converting simple returns to monthly log returns

I have a pandas DataFrame with simple daily returns. I need to convert it to monthly log returns and add a column to the current DataFrame. I have to use np.log to compute the monthly return. But I can only compute daily log return. Below is my code.
df[‘return_monthly’]= np.log(data([‘Simple Daily Returns’]+1)
The code only produces daily log returns. Is there any particular methods I should be using in the above code to get monthly return??
Please see my input for pandas Dataframe, the third column in excel is the expected out.
The question is a little confusing, but it seems like you want to group the rows by month. This can be done with pandas.resample if you have a datetime index, pandas.groupby, or pandas.pivot.
Here is a simple implementation, let us know if this isn't what you're looking for. Furthermore, your values are less than 1, so the log is negative. You can adjust as needed. I aggregated the months with sum, but there are many other aggregation functions such as mean(), median(), size() and many more. See the link for a list of aggregating functions.
#create dataframe with 1220 values that match your dataset
df = pd.DataFrame({
'Date':pd.date_range(start = '1/1/2019' , end ='5/4/2022' , freq='1D'),
'Return':np.random.uniform(low=1e-6, high=1.0, size=1220) #avoid log 0 which returns NAN
}).set_index('Date') #set the index to the date so we can use resample
Return Log_return
Date
2019-01-31 14.604863 -33.950987
2019-02-28 13.118111 -32.025086
2019-03-31 14.541947 -32.962914
2019-04-30 14.212689 -33.684422
2019-05-31 14.154918 -33.347081
2019-06-30 10.710209 -43.474120
2019-07-31 12.358001 -43.051723
2019-08-31 17.932673 -30.328784
...

Arithmetic operations for groups within a dataframe

I have loaded multiple CSV (time series) to create one dataframe. This dataframe contains data for multiple stocks. Now I want to calculate 1 month return for all the datapoints.
There 172 datapoints for each stock i.e. from index 0 to 171. The time series for next stock starts from index 0 again.
When I am trying to calculate the 1 month return its getting calculated correctly for all data points except for index 0 of new stock. Because it is taking the difference with index 171 of the previous stock.
I want the return to be calculated per stock name basis so I tried the for loop but it doesnt seem working.
e.g. In the attached image (highlighted) the 1 month return is calculated for company name ITC with SHREECEM. I expect for SHREECEM the first value of 1Mreturn should be NaN
Using groupby instead of a for loop you can get the result you want:
Mreturn_function = lambda df: df['mean_price'].diff(periods=1)/df['mean_price'].shift(1)*100
gw_stocks.groupby('CompanyName').apply(Mreturn_function)

How to Solve "IndexError: single positional indexer is out-of-bounds" With DataFrames of Varying Shapes

I have checked the other posts about IndexError: single positional indexer is out-of-bounds but could not find solutions that explain my problem.
I have a DataFrame that looks like:
Date Balance
0 2020-01-07 168.51
1 2020-02-07 179.46
2 2020-03-07 212.15
3 2020-04-07 221.68
4 2020-05-07 292.23
5 2020-06-07 321.61
6 2020-07-07 332.27
7 2020-08-07 351.63
8 2020-09-07 372.26
My problem is I want to run a script that takes in a DataFrame like the one above and returns the balance of the each row using something like df.iloc[2][1]. However, the DataFrame can be anywhere from 1 to 12 rows in length. So if I call df.iloc[8][1] and the DataFrame is less than 9 rows in length then I get the IndexError.
If I want to return the balance for every row using df.iloc[]... how can I handle the index errors without using 12 different try and except statements?
Also the problem is simplified here and the DataFrame can get rather large so I want to try and stay away from looping if possible
Thanks!!
My Solution was to use a loop over the length of the list and append the balance into a list. I then padded the list to the length of 12 with 'NaN' values.
num_months = len(df)
N=12
list_balance_months = []
for month in range(num_months):
list_balance_months .append(df_cd.iloc[month][0])
list_balance_months += [np.nan] * (N - len(list_balance_months ))
balance_month_1, balance_month_2, balance_month_3, balance_month_4, balance_month_5, balance_month_6, balance_month_7, balance_month_8, balance_month_9, balance_month_10, balance_month_11, balance_month_12 = list_credit_months
with this solution, if balance_month_11 is called and the DataFrame only has 4 months of data, instead of index error it will give np.nan (nan).
Please let me know if you can think of a simpler solution!

Restructuring pyspark dataframe

I am solving a regression problem. For that I have cluster the data first and applied regression model on each cluster. Now I want to implement another regression model which will take predicted output of each cluster as a feature and output the aggregated predicted value.
I have already implemented the clustering and regression model in pyspark.
But I am not able to finally extract the output of each cluster as a feature for input to another regression model.
How Can this conversion be achieved in pyspark(prefarably) or pandas efficiently?
Current dataframe :
date cluster predVal actual
31-03-2019 0 14 13
31-03-2019 1 24 15
31-03-2019 2 13 10
30-03-2019 0 14 13
30-03-2019 1 24 15
30-03-2019 2 13 10
Required dataframe
date predVal0 predVal1 predVal2 actual
31-03-2019 14 24 13 38 // 13+15+10
30-03-2019 14 24 13 38 // 13+15+10
You want to do a pivot in pyspark and then create a new column by summing the predVal{i} columns. You should proceed in three steps.
First step, you want to apply a pivot. Your index is the date, your column to pivot is the cluster and the column of the value if the predVal.
df_pivot = df.groupBy('date').pivot('cluster').agg(first('predVal'))
Then, you should apply a sum
df_actual = df.groupBy('date').sum('actual')
At the end, you can join the actual column with the pivot data on the index column data:
df_final = df_pivot.join(df_actual ,['date'])
This link is answering pretty well your question:
- https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

Compare cell in excel

i have some data like below in one column.
Value
-----
A#show
20
20
B#show
20
25
30
C#show
10
10
10
10
D#show
10
E#show
10
20
I want to compare the values between the cell where the last string is "show" and if there is only one then no comparison.
Value Comparison
----------------------
A#show Same
20
20
B#show different
20
25
30
C#show same
10
10
10
10
D#show only one
10
E#show different
10
20
I think it's can be possible using a VBA script
It's a bit unclear what you're trying to compare between values. However, there is a way to do this without VBA.
1) In the Second Column, create a "Header" column which names which header each value belongs to. The first entry would just be A#show, but then the following would be:
=IFERROR(IF(A2*1>0,B1),B2)
2) In the third column, you can utilize countif to see if the header has more than 2 entries (indicating it has a comparison). Here is where you can apply whatever comparative metric you'd like. If it's something unformulaic, just use a pivot table with the 3 columns.

Resources