How to find the lowest number in an array - apache-spark

This is the data frame. I'm trying to get the lowest number in the key column and create an additional column showing the lowest value. I was trying this but it isn't working.
def minUdf = udf((arr: dat1[String])=> {
val filtered = arr.filterNot(_ == "")
if(filtered.isEmpty) 0
else filtered.map(_.toInt).min
})
dat1.select(col("placekey"), minUdf(col("key")).as("lowest_value").show(false)

Related

Python function to perform calculation among each group of data frame

I need to have a function which performs below mentioned action ;
The dataset is :
and output expected is value in 'Difference' column , where remaining are input column.
Please note that within each group we first need to identify the maximum 'Closing_time' and the corrosponding amount will be the maximum value for that period , and then each row value will be subtracted from maximum detected value of previous period and result would be difference for that cell.
Also in case if the record do not have previous period then max value will be NA and difference caculation would be NA for all record for that period,
Adding points - within in each group (Cost_centre, Account, Year, Month) - Closing_time values are like ( D-0 00 CST is min and D-0 18 CST is maximim , similary within D-0,D+1, D+3 etc - D+3 will be maximum)
I tried to find first if previous value exist for each of the group or not and then find maximum time within each period and then crrosponding amount value to it.
Further using the maximum value , tried to subtract record Amount from Maximum value ,
but not getting how to implement , kindly help.
post sharing the above question i came up for this solution.
I splitted this in 3 part -
a) First find previous year and month for each of cost_center and account
b) Find maximum Closing_time within each group of cost_cente,account, year and month. Then pick corrosponding Amount value as amount .
c) using amount coming from b , subtract current amount with b to get diffrence.
def prevPeriod(df):
period =[]
for i in range(df.shape[0]):
if df['Month'][i]==1:
val_year = df['Year'][i]-1
val_month = 12
new_val =(val_year,val_month)
period.append(new_val)
else:
val_year = df['Year'][i]
val_month = df['Month'][i]-1
new_val =(val_year,val_month)
period.append(new_val)
print(period)
df['Previous_period'] = period
return df
def max_closing_time(group_list):
group_list = [item.replace('CST','') for item in group_list]
group_list = [item.replace('D','') for item in group_list]
group_list = [item.split()[:len(item)] for item in group_list]
l3 =[]
l4 =[]
for item in group_list:
l3.append(item[0])
l4.append(item[1])
l3 =[int(item) for item in l3]
l4 = [int(item) for item in l4]
max_datevalue = max(l3)
max_datevalue_index = l3.index(max(l3))
max_time_value = max(l4[max_datevalue_index:])
maximum_period = 'D+'+str(max_datevalue)+' '+str(max_time_value)+' '+'CST'
return maximum_period
def calculate_difference(df):
diff =[]
for i in range(df.shape[0]):
prev_year =df['Previous_period'][i][0]
print('prev_year is',prev_year)
prev_month = df['Previous_period'][i][1]
print('prev_month is', prev_month)
max_closing_time = df[(df['Year']==prev_year)& (df['Month']==prev_month)]['Max_Closing_time']
print('max_closing_time is', max_closing_time)
#max_amount_consider = df[(df['Year']==prev_year)& (df['Month']==prev_month) &(df['Max_Closing_time']==max_closing_time)]['Amount']
if bool(max_closing_time.empty):
found_diff = np.nan
diff.append(found_diff)
else:
max_closing_time_value = list(df[(df['Year']==prev_year)& (df['Month']==prev_month)]['Max_Closing_time'])[0]
max_amount_consider = df[(df['Cost_centre']==df['Cost_centre'][i])&(df['Account']==df['Account'][i])&(df['Year']==prev_year) & (df['Month']==prev_month) &(df['Closing_time']==str(max_closing_time_value))]['Amount']
print('max_amount_consider is',max_amount_consider)
found_diff = int(max_amount_consider) - df['Amount'][i]
diff.append(found_diff)
df['Variance'] = diff
return df
def calculate_variance(df):
'''
Input data frame is coming as query used above to fetch data
'''
try:
df = prevPeriod(df)
except:
print('Error occured in prevPeriod function')
# prerequisite for max_time_period
df2 = pd.DataFrame(df.groupby(['Cost_centre','Account','Year','Month'])['Closing_time'].apply(max_closing_time).reset_index())
df = pd.merge(df,df2, on =['Cost_centre','Account','Year','Month'])
# final calculation
try:
final_result = calculate_difference(df)
except:
print('Error in calculate_difference')
return final_result

How do I extract specific values from a DataFrame and add them to a list?

Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.
Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.

Spotfire: How to increment variables to build scoring mechanism?

I'm trying to figure out how I could use variables in Spotfire (online version) to build a scoring mechanism and populate a calculated column with the final result.
I have a couple of values stored in columns that I would use to evaluate and attribute a score like this:
if column1<10 then segment1 = segment1 + 1
if column1>10 then segment2 = segment2+1
...ETC...
In the end each "segment" should have a score and I would like to simply display the name of the segment that has the highest score.
Ex:
Segment1 has a final value of 10
Segment2 has a final value of 22
Segment3 has a final value of 122
I would display Segment3 as value for the calculated column
Using only "IF" would lead me to a complicated IF structure so I'm more looking for something that looks more like a script.
Is there a way to achieve this with Spotfire?
Thanks
Laurent
To cycle through the data rows and calculate a running score, you can use an IronPython script. The script below is reading the numeric data from Col1 and Col2 of a data table named "Data Table". It calculates a score value for each row and writes it to a tab delimited text string. When done, it adds it to the Spotfire table using the Add Columns function. Note, the existing data needs to have a unique identifier. If not, the RowId() function can be used to create a calculated column for a unique row id.
from Spotfire.Dxp.Data import *
from System.IO import StringReader, StreamReader, StreamWriter, MemoryStream, SeekOrigin
from Spotfire.Dxp.Data.Import import *
from System import Array
def add_column(table, text, col_name):
# read the text data into memory
mem_stream = MemoryStream()
writer = StreamWriter(mem_stream)
writer.Write(text)
writer.Flush()
mem_stream.Seek(0, SeekOrigin.Begin)
# define the structure of the text data
settings = TextDataReaderSettings()
settings.Separator = "\t"
settings.SetDataType(0, DataType.Integer)
settings.SetColumnName(0, 'ID')
settings.SetDataType(1, DataType.Real)
settings.SetColumnName(1, col_name)
# create a data source from the in memory text data
data = TextFileDataSource(mem_stream, settings)
# define the relationship between the existing table (left) and the new data (right)
leftColumnSignature = DataColumnSignature("Store ID", DataType.Integer)
rightColumnSignature = DataColumnSignature("ID", DataType.Integer)
columnMap = {leftColumnSignature:rightColumnSignature}
ignoredColumns = []
columnSettings = AddColumnsSettings(columnMap, JoinType.LeftOuterJoin, ignoredColumns)
# now add the column(s)
table.AddColumns(data, columnSettings)
#get the data table
table=Document.Data.Tables["Data Table"]
#place data cursor on a specific column
cursorCol1 = DataValueCursor.CreateFormatted(table.Columns["Col1"])
cursorCol2 = DataValueCursor.CreateFormatted(table.Columns["Col2"])
cursorColId = DataValueCursor.CreateFormatted(table.Columns["ID"])
cursorsList = Array[DataValueCursor]([cursorCol1, cursorCol2, cursorColId])
text = ""
rowsToInclude = IndexSet(table.RowCount,True)
#iterate through table column rows to retrieve the values
for row in table.GetRows(rowsToInclude, cursorsList):
score = 0
# get the current values from the cursors
col1Val = cursorCol1.CurrentDataValue.ValidValue
col2Val = cursorCol2.CurrentDataValue.ValidValue
id = cursorColId.CurrentDataValue.ValidValue
# now apply rules for scoring
if col1Val <= 3:
score -= 3
elif col1Val > 3 and col2Val > 50:
score += 10
else:
score += 5
text += "%d\t%f\r\n" % (id, score)
add_column(table, text, 'Score_Result')
For an approach with no scripting, but also no accumulation, you can use calculated columns.
To get the scores, you can use a calculated column with case statements. For Segment 1, you might have:
case
when [Col1] > 100 then 10
when [Col1] < 100 and [Col2] > 600 then 20
end
The, once you have the scores, you can create a calculated column, say [MaxSegment]. The expression for this will be Max([Segment1],[Segment2],[Segment3]...). Then display the value of [MaxSegment].
The max function in this case is acting as a row expression and is calculating the max value across the row of the columns given.

How to populate a dataframe column based on the value of another column

Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)

How to perform excel MATCHINDEX or lookups in Python?

I basically need to find relevant row value from first-column i.e (type) for max values of each column in below Data frame.
Lets suppose this is df3 data frame shown below.
I need to find relevant "type" row values for all the max values of numeric columns.
for example: max(price) => 8,85,113 => corr_logx (this is the output I need to find for all the numeric variables)
type price Bedroom_3 Bedroom_4
corr_x 3,56,315 5,01,687 6,05,458
corr_logx 8,85,113 2,27,955 1,28,834
corr_expx 8,34,503 3,62,952 2,30,759
corr_sqrtx 6,29,162 3,36,964 8,96,593
corr_sqrex 7,79,030 8,54,539 6,07,960
Output I need:
Variable transfrm_used
Price corr_logx
Bedroom_3 corr_sqrex
Bedroom_4 corr_sqrtx
For this:
I created a list of numeric column names and used it as range in for-loop. This helps me to step in each column.
Within each column then I created another for loop which will step in each row to match with the max value of the column.
If value matches then it should result in relevant value from the first column/ column 0/ column name - type. Otherwise it should continue looking for the match.
cols_list = df3.columns.difference(['type'])
transfrm_used = []
variable = []
for col_name in cols_list:
variable.append(col_name) # this gives the respective column name
print(variable)
for rows in range(0,5):
if df3[rows,col_name] == np.max(df3.col_name): # works as Match
transfrm_used.append(df3[rows,0])
else:
continue
print('done')
I am looking for a result format where I can get both column names like price, Bedroom_3 etc. and its relevant value from type column like corr_logx.
In excel it is done by using MATCHINDEX and lookups. Here is the complete set of data frame and expected result
type price Bedroom_3 Bedroom_4
corr_x 3,56,315 5,01,687 6,05,458
corr_logx 8,85,113 2,27,955 1,28,834
corr_expx 8,34,503 3,62,952 2,30,759
corr_sqrtx 6,29,162 3,36,964 8,96,593
corr_sqrex 7,79,030 8,54,539 6,07,960
max_price max_Bedroom_3 max_Bedroom_4
max_value: 8,85,113 8,54,539 8,96,593
relevant_type: corr_logx corr_sqrex corr_sqrtx
If you're sure that there's only one instance of each max price, to find max of column, a .max() will do. Then, we can find the type of that row:
max_value = {}
relevant_type = {}
for col_name in cols_list:
max_val = df[col_name].max()
max_value[col_name] = max_val
relevant_type[col_name] = df.loc[df[col_name] == max_val,'type']
Then, we can add these two dicts as rows at the bottom of the dataframe:
df3.append(max_value, ignore_index=True)
df3.append(relevant_type, ignore_index=True)
Or we can make a new dataframe:
combined_dict = {x: [max_value[x],relevant_type[x]] for x in max_value}
df = pd.DataFrame(combined_dict, index=['max value','relevant type'])

Resources