Spotfire: How to increment variables to build scoring mechanism? - spotfire

I'm trying to figure out how I could use variables in Spotfire (online version) to build a scoring mechanism and populate a calculated column with the final result.
I have a couple of values stored in columns that I would use to evaluate and attribute a score like this:
if column1<10 then segment1 = segment1 + 1
if column1>10 then segment2 = segment2+1
...ETC...
In the end each "segment" should have a score and I would like to simply display the name of the segment that has the highest score.
Ex:
Segment1 has a final value of 10
Segment2 has a final value of 22
Segment3 has a final value of 122
I would display Segment3 as value for the calculated column
Using only "IF" would lead me to a complicated IF structure so I'm more looking for something that looks more like a script.
Is there a way to achieve this with Spotfire?
Thanks
Laurent

To cycle through the data rows and calculate a running score, you can use an IronPython script. The script below is reading the numeric data from Col1 and Col2 of a data table named "Data Table". It calculates a score value for each row and writes it to a tab delimited text string. When done, it adds it to the Spotfire table using the Add Columns function. Note, the existing data needs to have a unique identifier. If not, the RowId() function can be used to create a calculated column for a unique row id.
from Spotfire.Dxp.Data import *
from System.IO import StringReader, StreamReader, StreamWriter, MemoryStream, SeekOrigin
from Spotfire.Dxp.Data.Import import *
from System import Array
def add_column(table, text, col_name):
# read the text data into memory
mem_stream = MemoryStream()
writer = StreamWriter(mem_stream)
writer.Write(text)
writer.Flush()
mem_stream.Seek(0, SeekOrigin.Begin)
# define the structure of the text data
settings = TextDataReaderSettings()
settings.Separator = "\t"
settings.SetDataType(0, DataType.Integer)
settings.SetColumnName(0, 'ID')
settings.SetDataType(1, DataType.Real)
settings.SetColumnName(1, col_name)
# create a data source from the in memory text data
data = TextFileDataSource(mem_stream, settings)
# define the relationship between the existing table (left) and the new data (right)
leftColumnSignature = DataColumnSignature("Store ID", DataType.Integer)
rightColumnSignature = DataColumnSignature("ID", DataType.Integer)
columnMap = {leftColumnSignature:rightColumnSignature}
ignoredColumns = []
columnSettings = AddColumnsSettings(columnMap, JoinType.LeftOuterJoin, ignoredColumns)
# now add the column(s)
table.AddColumns(data, columnSettings)
#get the data table
table=Document.Data.Tables["Data Table"]
#place data cursor on a specific column
cursorCol1 = DataValueCursor.CreateFormatted(table.Columns["Col1"])
cursorCol2 = DataValueCursor.CreateFormatted(table.Columns["Col2"])
cursorColId = DataValueCursor.CreateFormatted(table.Columns["ID"])
cursorsList = Array[DataValueCursor]([cursorCol1, cursorCol2, cursorColId])
text = ""
rowsToInclude = IndexSet(table.RowCount,True)
#iterate through table column rows to retrieve the values
for row in table.GetRows(rowsToInclude, cursorsList):
score = 0
# get the current values from the cursors
col1Val = cursorCol1.CurrentDataValue.ValidValue
col2Val = cursorCol2.CurrentDataValue.ValidValue
id = cursorColId.CurrentDataValue.ValidValue
# now apply rules for scoring
if col1Val <= 3:
score -= 3
elif col1Val > 3 and col2Val > 50:
score += 10
else:
score += 5
text += "%d\t%f\r\n" % (id, score)
add_column(table, text, 'Score_Result')
For an approach with no scripting, but also no accumulation, you can use calculated columns.
To get the scores, you can use a calculated column with case statements. For Segment 1, you might have:
case
when [Col1] > 100 then 10
when [Col1] < 100 and [Col2] > 600 then 20
end
The, once you have the scores, you can create a calculated column, say [MaxSegment]. The expression for this will be Max([Segment1],[Segment2],[Segment3]...). Then display the value of [MaxSegment].
The max function in this case is acting as a row expression and is calculating the max value across the row of the columns given.

Related

How do I extract specific values from a DataFrame and add them to a list?

Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.
Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.

How to populate a dataframe column based on the value of another column

Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)

How to perform excel MATCHINDEX or lookups in Python?

I basically need to find relevant row value from first-column i.e (type) for max values of each column in below Data frame.
Lets suppose this is df3 data frame shown below.
I need to find relevant "type" row values for all the max values of numeric columns.
for example: max(price) => 8,85,113 => corr_logx (this is the output I need to find for all the numeric variables)
type price Bedroom_3 Bedroom_4
corr_x 3,56,315 5,01,687 6,05,458
corr_logx 8,85,113 2,27,955 1,28,834
corr_expx 8,34,503 3,62,952 2,30,759
corr_sqrtx 6,29,162 3,36,964 8,96,593
corr_sqrex 7,79,030 8,54,539 6,07,960
Output I need:
Variable transfrm_used
Price corr_logx
Bedroom_3 corr_sqrex
Bedroom_4 corr_sqrtx
For this:
I created a list of numeric column names and used it as range in for-loop. This helps me to step in each column.
Within each column then I created another for loop which will step in each row to match with the max value of the column.
If value matches then it should result in relevant value from the first column/ column 0/ column name - type. Otherwise it should continue looking for the match.
cols_list = df3.columns.difference(['type'])
transfrm_used = []
variable = []
for col_name in cols_list:
variable.append(col_name) # this gives the respective column name
print(variable)
for rows in range(0,5):
if df3[rows,col_name] == np.max(df3.col_name): # works as Match
transfrm_used.append(df3[rows,0])
else:
continue
print('done')
I am looking for a result format where I can get both column names like price, Bedroom_3 etc. and its relevant value from type column like corr_logx.
In excel it is done by using MATCHINDEX and lookups. Here is the complete set of data frame and expected result
type price Bedroom_3 Bedroom_4
corr_x 3,56,315 5,01,687 6,05,458
corr_logx 8,85,113 2,27,955 1,28,834
corr_expx 8,34,503 3,62,952 2,30,759
corr_sqrtx 6,29,162 3,36,964 8,96,593
corr_sqrex 7,79,030 8,54,539 6,07,960
max_price max_Bedroom_3 max_Bedroom_4
max_value: 8,85,113 8,54,539 8,96,593
relevant_type: corr_logx corr_sqrex corr_sqrtx
If you're sure that there's only one instance of each max price, to find max of column, a .max() will do. Then, we can find the type of that row:
max_value = {}
relevant_type = {}
for col_name in cols_list:
max_val = df[col_name].max()
max_value[col_name] = max_val
relevant_type[col_name] = df.loc[df[col_name] == max_val,'type']
Then, we can add these two dicts as rows at the bottom of the dataframe:
df3.append(max_value, ignore_index=True)
df3.append(relevant_type, ignore_index=True)
Or we can make a new dataframe:
combined_dict = {x: [max_value[x],relevant_type[x]] for x in max_value}
df = pd.DataFrame(combined_dict, index=['max value','relevant type'])

Formatting specific rows in a Jupyter Notebook dataframe output

I'm presenting a data frame in Jupyter Notebook. The initial data type of the data frame is float. I want to present rows 1 & 3 of the printed table as integers and rows 2 & 4 as percentage. How do I do that? (I've spent numerous hours looking for a solution with no success)
Here's the code I'm using:
#Creating the table
clms = sales.columns
indx = ['# of Poeple','% of Poeple','# Purchased per Activity','% Purchased per Activity']
basic_stats = pd.DataFrame(data=np.nan,index=indx,columns=clms)
basic_stats.head()
#Calculating the # of people who took part in each activity
for clm in sales.columns:
basic_stats.iloc[0][clm] = int(round(sales[sales[clm]>0][clm].count(),0))
#Calculating the % of people who took part in each activity from the total email list
for clm in sales.columns:
basic_stats.iloc[1][clm] = round((basic_stats.iloc[0][clm] / sales['Sales'].count())*100,2)
#Calculating the # of people who took part in each activity AND that bought the product
for clm in sales.columns:
basic_stats.iloc[2][clm] = int(round(sales[(sales[clm] >0) & (sales['Sales']>0)][clm].count()))
#Calculating the % of people who took part in each activity AND that bought the product
for clm in sales.columns:
basic_stats.iloc[3][clm] = round((basic_stats.iloc[2][clm] / basic_stats.iloc[0][clm])*100,2)
#Present the table
basic_stats
Here's the printed table:
Output table of 'basic_stats' data frame in Jupyter Notebook
Integer representation
You already assign integers to the cells of row 1 and 3 are. The reason why these integers are printed as floats is that all columns have the data type float64. This is caused by the way you initially create the Dataframe. You can view the data types by printing the .dtypes attribute:
basic_stats = pd.DataFrame(data=np.nan,index=indx,columns=clms)
print(basic_stats.dtypes)
# Prints:
# column1 float64
# column2 float64
# ...
# dtype: object
If you don't provide the data keyword argument in the constructor of the Data
frame, the data type of each cell will be object which can be any object:
basic_stats = pd.DataFrame(index=indx,columns=clms)
print(basic_stats.dtypes)
# Prints:
# column1 object
# column2 object
# ...
# dtype: object
When the data type of a cell is object, the content is formatted using it's builtin methods which leads to integers bein formatted properly.
Percentage representation
In order to display percentages, you can use a custom class that prints a float number the way you want:
class PercentRepr(object):
"""Represents a floating point number as percent"""
def __init__(self, float_value):
self.value = float_value
def __str__(self):
return "{:.2f}%".format(self.value*100)
Then just use this class for the values of row 1 and 3:
#Creating the table
clms = sales.columns
indx = ['# of Poeple','% of Poeple','# Purchased per Activity','% Purchased per Activity']
basic_stats = pd.DataFrame(index=indx,columns=clms)
basic_stats.head()
#Calculating the # of people who took part in each activity
for clm in sales.columns:
basic_stats.iloc[0][clm] = int(round(sales[sales[clm]>0][clm].count(),0))
#Calculating the % of people who took part in each activity from the total email list
for clm in sales.columns:
basic_stats.iloc[1][clm] = PercentRepr(basic_stats.iloc[0][clm] / sales['Sales'].count())
#Calculating the # of people who took part in each activity AND that bought the product
for clm in sales.columns:
basic_stats.iloc[2][clm] = int(round(sales[(sales[clm] >0) & (sales['Sales']>0)][clm].count()))
#Calculating the % of people who took part in each activity AND that bought the product
for clm in sales.columns:
basic_stats.iloc[3][clm] = PercentRepr(basic_stats.iloc[2][clm] / basic_stats.iloc[0][clm])
#Present the table
basic_stats
Note: This actually changes the data in your dataframe! If you want to do further processing with the data of rows 1 and 3, you should be aware that these rows don't contain float objects anymore.
Here's one way, kind of a hack, but if its simply for pretty printing, it'll work.
df = pd.DataFrame(np.random.random(20).reshape(4,5))
# first and third rows display as integers
df.loc[0,] = df.loc[0,]*100
df.loc[2,] = df.loc[2,]*100
df.loc[0,:] = df.loc[0,:].astype(int).astype(str)
df.loc[2,:] = df.loc[2,:].astype(int).astype(str)
# second and fourth rows display as percents (with 2 decimals)
df.loc[1,:] = np.round(df.loc[1,:].values.astype(float),4).astype(float)*100
df.loc[3,:] = np.round(df.loc[3,:].values.astype(float),4).astype(float)*100

Creating a dictionary of dictionaries from csv file

Hi so I am trying to write a function, classify(csv_file) that creates a default dictionary of dictionaries from a csv file. The first "column" (first item in each row) is the key for each entry in the dictionary and then second "column" (second item in each row) will contain the values.
However, I want to alter the values by calling on two functions (in this order):
trigram_c(string): that creates a default dictionary of trigram counts within the string (which are the values)
normal(tri_counts): that takes the output of trigram_c and normalises the counts (i.e converts the counts for each trigram into a number).
Thus, my final output will be a dictionary of dictionaries:
{value: {trigram1 : normalised_count, trigram2: normalised_count}, value2: {trigram1: normalised_count...}...} and so on
My current code looks like this:
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((l_rows[0], l_rows[1]) for rows in l_rows)
For example, if the csv file was:
Snippet1, "It was a dark stormy day"
Snippet2, "Hello world!"
Snippet3, "How are you?"
The final output would resemble:
{Snippet1: {'It ': 0.5352, 't w': 0.43232}, Snippet2: {'Hel' : 0.438724,...}...} and so on.
(Of course there would be more than just two trigram counts, and the numbers are just random for the purpose of the example).
Any help would be much appreciated!
First of all, please check classify function, because I can't run it. Here corrected version:
import csv
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((row[0], row[1]) for row in l_rows)
return classified
It returns dictionary with key from first column and value is string from second column.
So you should iterate every dictionary entry and pass its value to trigram_c function. I didn't understand how you calculated trigram counts, but for example if you just count the number of trigram appearence in string you could use the function below. If you want make other counting you just need to update code in the for loop.
def trigram_c(string):
trigram_dict = {}
start = 0
end = 3
for i in range(len(string)-2):
# you could implement your logic in this loop
trigram = string[start:end]
if trigram in trigram_dict.keys():
trigram_dict[trigram] += 1
else:
trigram_dict[trigram] = 1
start += 1
end += 1
return trigram_dict

Resources