Spark: How to get value for each day in interval? - apache-spark

I have a table with values and a date starting from which this value is valid:
param
validfrom
value
param1
01-01-2022
1
param2
03-01-2022
2
param1
05-01-2022
11
param1
07-01-2022
1
I need to get values of each parameter on each day before the specified date
For example, I have to get values for 06-01-2022:
param
validfrom
value
param1
01-01-2022
1
param1
02-01-2022
1
param1
03-01-2022
1
param1
04-01-2022
1
param1
05-01-2022
11
param1
06-01-2022
11
param2
03-01-2022
2
param2
04-01-2022
2
param2
05-01-2022
2
param2
06-01-2022
2
In other words, I need to get rows for missing dates and fill values by previous value.
I use window funсtion to get last value like this:
val windowPartitionByCompositeKey = Window.partitionBy(CompositeKey.map(col):_*)
spark.table("table")
.where($"$ValidFrom" <= repDt )
.distinct()
.withColumn("max_validfrom", max($"$ValidFrom").over(windowPartitionByCompositeKey))
.where($"$ValidFrom" === $"max_validfrom")
.drop("max_validfrom")
.show()
and I get next result:
param
validfrom
value
param2
03-01-2022
2
param1
05-01-2022
11
But I need to get value of each param on each day before the specified date.
How I can get this?

Related

To extract distinct values for all categorical columns in dataframe

I have a situation where I need to print all the distinct values that are there for all the categorical columns in my data frame
The dataframe looks like this :
Gender Function Segment
M IT LE
F IT LM
M HR LE
F HR LM
The output should give me the following:
Variable_Name Distinct_Count
Gender 2
Function 2
Segment 2
How to achieve this?
using nunique then passing the series into a new datafame and setting column names.
df_unique = df.nunique().to_frame().reset_index()
df_unique.columns = ['Variable','DistinctCount']
print(df_unique)
Variable DistinctCount
0 Gender 2
1 Function 2
2 Segment 2
This is not good, yet it won't fail to provide the expected output:
new_data = {'Variable_Name':[],'Distinct_Count':[]}
for i in list(df):
new_data['Variable_Name'].append(i)
new_data['Distinct_Count'].append(df[i].nunique())
new_df = pd.DataFrame(new_data)
print(new_df)
Output:
Variable_Name Distinct_Count
0 Gender 2
1 Function 2
2 Segment 2

How to put a number of timedate data into a subset of dataframe while keeping the data type?

I have a dataframe which has name as the index while a column of birth date e.g
> df_birthdate
date
Paul 2009-03-07
Peter 2000-06-23
Pauline 2001-03-03
Paula 2002-02-17
> type(df_birthdate.date[0])
pandas._libs.tslibs.timestamps.Timestamp
> df_huge = pd.DataFrame({'School': ['A','A','A','A','B','B','B','B']})
> df_huge['new_date'] = ''
> idx_t = df_huge.School == 'A'
And I have a huge dataframe called df_huge which I want to put the date into it. I know that the order won't change.
df_huge.loc[idx_t, "new_date"] = df_birthdate.values
The above code works for me in the most cases, however, when the 'date' column is in datetime format, by applying .values, the data which I put into the df_huge dataframe are no longer in datetime format. Any suggestion to put 'date' from df_birthdate into a specific location of the df_huge? Many thanks.
You can omit df_huge['new_date'] = '' for assign empty strings to column:
idx_t = df_huge.School == 'A'
df_huge.loc[idx_t, "new_date"] = df_birthdate.to_numpy()
print (df_huge)
School new_date
0 A 2009-03-07
1 A 2000-06-23
2 A 2001-03-03
3 A 2002-02-17
4 B NaT
5 B NaT
6 B NaT
7 B NaT
print (df_huge.dtypes)
School object
new_date datetime64[ns]
dtype: object

How to apply formula in pandas

I am trying to apply the formula in the column but not able to.
I have data in dataframe:
Date 2018-04-16 00:00:00
Quantity 8317.000
Total Value (Lacs) 259962.50
I want to apply a formula in Total Value (Lacs) column
formula is: = [ Total Value (Lacs) multiplied by 100000 ] divided by [Quantity (000’s) multiplied by 100] by using pandas
I have tried something
a = df['Total Value (Lacs)']
b = df['Quantity']
c = (a * 100000 / b * 100)
print (c)
or
df['Price'] = ((df['Total Value (Lacs)']) * 100000 / (df['Quantity']) * 100)
print (df)
error:
TypeError: unsupported operand type(s) for /: 'str' and 'str'
Edit
I have tried below code:
df['Price'] = float((float(df['Total Value (Lacs)'])) * 100000 / float((df['Quantity'])) * 100)
but getting the wrong value
price 312567632.6
expecting
price 31256.76326
Edit 1
Type error means that you've tried to apply operator / to two strings. There's no such operator defined for str type in python, so you should convert you data to some numeric type, float in your case.
I didn't understand extactly how your data looks like. But if it's like this:
df
Out:
Date Quantity Total Value (Lacs)
2018-04-16 00:00:00 8317.000 259962.50
2018-04-17 00:00:00 7823.000 234004.50
You can convert it to numeric type, convert all the columns to the correct type (I suppose that Date column is an index column):
df_float = df.apply(pd.to_numeric)
df_float.dtypes()
Out:
Quantity float64
Total Value (Lacs) int64
dtype: object
After all, you can just deal with columns:
df['Price'] = (df_float['Total Value (Lacs)'] * 100000
/ df_float['Quantity'] * 100)
df['Price']
Out:
2018-04-16 00:00:00 319930.7592441217
2018-04-17 00:00:00 334309.8102814262
Another approach is define the function and apply it to each row with pd.DataFrame.apply:
def get_price(row):
try:
price = (float(row['Total Value (Lacs)']) * 100000
/ float(row['Quantity']) * 100)
except (TypeError, ValueError): # If bad data in this row, can't convert to float
price = None
return price
df['Price'] = df.apply(get_price, axis=1)
df['Price']
Out:
2018-04-16 00:00:00 319930.7592441217
2018-04-17 00:00:00 334309.8102814262
axis=1 means "aplly to each row"
If you have transposed data - as in your example, you should transpose it or to apply function to each column using axis=0.
Eidt 2:
Looks like your data is just single column, and it has dtype pd.Series. So if you select a row with data['Quantity'], you'll get something like 8317.000 of type str. There's no pd.Series.apply method, of course. So, in that case you may act in this way:
index_to_convert = ['Quantity', 'Total Value (Lacs)']
data[index_to_convert] = pd.to_numeric(data[index_to_convert])
and only numeric columns were converted. The just do the formula:
data['Price'] = (data['Total Value (Lacs)'] * 100000
/ data['Quantity'] * 100)
data
Out:
Date 2018-04-16 00:00:00
Quantity 8317
Total Value (Lacs) 259962
Price 3.12568e+08
But in most cases this solution not so handy, I strongly advice convert your data to DataFrame and deal with it, because DataFrame provides more flexibility and сapabilities.
Сonverting process:
df = data.to_frame().T.set_index('Date')
There are three consecutive actions:
Convert your data into DataFrame
Transpose it to (now columns are vertical virtually)
Set "Date" as index column
Results:
df
Out:
Quantity Total Value (Lacs)
Date
2018-04-16 00:00:00 8317.00 259962.50
After the previous steps you can apply Edit 1 code to your data. Also it's applicable there is more than one series in your data.
One more:
If your data has more than one value for each index, i.e multiple quantities ets:
data
Out:
Date 2018-04-16 00:00:00
Quantity 8317.00
Total Value (Lacs) 259962.50
Date 2018-04-17 00:00:00
Quantity 6434.00
Total Value (Lacs) 230002.50
You also can convert it into pd.DataFrame, step-by-step.
Group your data by an index entries and apply a list to groups:
data.groupby(level=0).apply(list)
Out:
Date [2018-04-16 00:00:00, 2018-04-17 00:00:00]
Quantity [8317.00, 6434.00]
Total Value (Lacs) [259962.50, 230002.50]
Then apply pd.Series to each row:
data.groupby(level=0).apply(list).apply(pd.Series)
Out: 0 1
Date 2018-04-16 00:00:00 2018-04-17 00:00:00
Quantity 8317.00 6434.00
Total Value (Lacs) 259962.50 230002.50
Transpose returned DataFrame, set 'Date' column as index:
series.groupby(level=0).apply(list).apply(pd.Series).T.set_index('Date')
Out:
Quantity Total Value (Lacs)
Date
2018-04-16 00:00:00 8317.00 259962.50
2018-04-17 00:00:00 6434.00 230002.50
Apply the solution from Edit 1.
Hope it helps!
You are getting this error because the data extracted from the dataframe are strings as shown in your error, you will need to convert the string into a float.
Convert your dataframe to values instead of strings. You can achieve that by:
values = df.values
Then you can extract the values from this array.
Alternatively, after extracting data from the dataframe convert it to float by using:
b=float(df['Quantity'])
use this:
df['price'] = ((df['Total Value (Lacs)'].apply(pd.to_numeric)) * 100000 / (df['Quantity'].apply(pd.to_numeric)) * 100)

Return string if values exists in column

I have one column that contains text such as,
column1
3
4
5
6
7
8
9.2
10
11
txt1
txt2
I want to create a new column2 that gives me the following output.
column1 column2
3 3-6
4 3-6
5 3-6
6 3-6
7 7-10
8 7-10
9.2 7-10
10 7-10
11 11
txt1 txt1
txt2 txt2
I have tried with the following Dax function but i dont get it to work as it only returns "value if false". My format on Column1 is text.
column2 = IF(CONTAINS(Table1;Table1[column1];"3";Table1[Column1];"4");"3-8";"9.5-10").........
I have tried with the FIND function aswell without luck.
Someone have any tips? If someone nows how to do this in Excel perhaps it could be figured out that way?:D
/D
I'm not sure exactly what your logic for bucketing values is, but you should be able to write something along these lines:
Column2 = SWITCH(TRUE(),
ISERROR(VALUE(Table1[Column1])), Table1[Column1],
VALUE(Table1[Column1]) >= 3 && VALUE(Table1[Column1]) <= 6, "3-6",
VALUE(Table1[Column1]) >= 7 && VALUE(Table1[Column1]) <= 10, "7-10",
Table1[Column1])
This SWITCH function will return the first thing that evaluates to true, otherwise, it returns the last argument. The first pair checks if the value can be converted to a number and if not returns the original value. The next two pairs check if the number is in certain ranges and returns specified strings for those ranges.
Here's a link that explains the SWITCH(TRUE()...) construction in more detail:
https://powerpivotpro.com/2015/03/the-diabolical-genius-of-switch-true/

How to get occurrences of string lengths for a column of strings in postgresql

Given an input of a text/string column, I want to compute the lengths of the strings in the columns, and the counts for each length.
E.g. a column with the strings:
'Ant'
'Cat'
'Dog'
'Human'
''
NULL
'A human'
Would give:
0 : 1
3 : 3
5 : 1
7 : 1
NOTE: the null string hasn't been counted as a 0 character string, but ignored.
length() comes to mind:
select length(col), count(*)
from t
where col is not null
group by length(col)
order by length(col);

Resources