How to perform excel MATCHINDEX or lookups in Python? - python-3.x

I basically need to find relevant row value from first-column i.e (type) for max values of each column in below Data frame.
Lets suppose this is df3 data frame shown below.
I need to find relevant "type" row values for all the max values of numeric columns.
for example: max(price) => 8,85,113 => corr_logx (this is the output I need to find for all the numeric variables)
type price Bedroom_3 Bedroom_4
corr_x 3,56,315 5,01,687 6,05,458
corr_logx 8,85,113 2,27,955 1,28,834
corr_expx 8,34,503 3,62,952 2,30,759
corr_sqrtx 6,29,162 3,36,964 8,96,593
corr_sqrex 7,79,030 8,54,539 6,07,960
Output I need:
Variable transfrm_used
Price corr_logx
Bedroom_3 corr_sqrex
Bedroom_4 corr_sqrtx
For this:
I created a list of numeric column names and used it as range in for-loop. This helps me to step in each column.
Within each column then I created another for loop which will step in each row to match with the max value of the column.
If value matches then it should result in relevant value from the first column/ column 0/ column name - type. Otherwise it should continue looking for the match.
cols_list = df3.columns.difference(['type'])
transfrm_used = []
variable = []
for col_name in cols_list:
variable.append(col_name) # this gives the respective column name
print(variable)
for rows in range(0,5):
if df3[rows,col_name] == np.max(df3.col_name): # works as Match
transfrm_used.append(df3[rows,0])
else:
continue
print('done')
I am looking for a result format where I can get both column names like price, Bedroom_3 etc. and its relevant value from type column like corr_logx.
In excel it is done by using MATCHINDEX and lookups. Here is the complete set of data frame and expected result
type price Bedroom_3 Bedroom_4
corr_x 3,56,315 5,01,687 6,05,458
corr_logx 8,85,113 2,27,955 1,28,834
corr_expx 8,34,503 3,62,952 2,30,759
corr_sqrtx 6,29,162 3,36,964 8,96,593
corr_sqrex 7,79,030 8,54,539 6,07,960
max_price max_Bedroom_3 max_Bedroom_4
max_value: 8,85,113 8,54,539 8,96,593
relevant_type: corr_logx corr_sqrex corr_sqrtx

If you're sure that there's only one instance of each max price, to find max of column, a .max() will do. Then, we can find the type of that row:
max_value = {}
relevant_type = {}
for col_name in cols_list:
max_val = df[col_name].max()
max_value[col_name] = max_val
relevant_type[col_name] = df.loc[df[col_name] == max_val,'type']
Then, we can add these two dicts as rows at the bottom of the dataframe:
df3.append(max_value, ignore_index=True)
df3.append(relevant_type, ignore_index=True)
Or we can make a new dataframe:
combined_dict = {x: [max_value[x],relevant_type[x]] for x in max_value}
df = pd.DataFrame(combined_dict, index=['max value','relevant type'])

Related

How to remove a complete row when no match found in a column's string values with any object from a given list?

Please help me complete this piece of code. Let me know of any other detail is required.
Thanks in advance!
Given: a column 'PROD_NAME' from pandas dataframe of string type (e.g. Smiths Crinkle Cut Chips Chicken g), a list of certain words (['Chip', 'Chips' etc])
To do: if none of the words from the list is contained in the strings of the dataframe objects, we drop the whole row. Basically we're removing unnecessary products from a dataframe.
This is what data looks like:
Here's my code:
# create a function to Keep only those products which have
# chip, chips, doritos, dorito, pringle, Pringles, Chps, chp, in their name
def onlyChips(df, *cols):
temp = []
chips = ['Chip', 'Chips', 'Doritos', 'Dorito', 'Pringle', 'Pringles', 'Chps', 'Chp']
copy = cp.deepcopy(df)
for col in [*cols]:
for i in range(len(copy[col])):
for item in chips:
if item not in copy[col][i]:
flag = False
else:
flag = True
break;
# drop only those string which doesn't have any match from chips list, if flag never became True
if not flag:
# drop the whole row
return <new created dataframe>
new = onlyChips(df_txn, 'PROD_NAME')
Filter the rows instead of deleting them. Create a boolean mask for each row. Use str.contains on each column you need to search and see if any of the columns match the given criteria row-wise. Filter the rows if not.
search_cols = ['PROD_NAME']
mask = df[search_cols].apply(lambda x: x.str.contains('|'.join(chips))).any(axis=1)
df = df[mask]

pandas: change the previous cell value of a column based on conditions in another column

I have a Pandas dataset that looks like this:
dataset of words and their features
I would like to replace the "x "in "Gender" column with a condition that if a list of words like "Mädchen" is in the column "Words", "Neutral" should be put in the "Gender" column, in the previous word's row (which is a number).
So, for example, this:
Gender Words
x 10.
x Mädchen
Should become:
Gender Words
Neutral 10.
x Mädchen
I have already tried np.where like this:
Food2_case["Gender"]= np.where(Food2_case.Words.isin(["Mädchen"]), (dropped_data.Words.str.contains('\d',regex= True) == 'A'), "x")
But I've got this error:
ValueError: operands could not be broadcast together with shapes
(8000,) (275988,) ()
Try the following:
for index, row in Food2_case.iterrows():
if(isinstance(row['Words'],str)):
if('Mädchen' in row['Words']):
Food2_case['Gender'][index-1] = 'Neutral'
If I understood your question correctly, it should work.
[EDIT]
If you want to check for other words other than Mädchen, you can do the following:
words_to_check = ['Mädchen', ...]
for index, row in Food2_case.iterrows():
if(isinstance(row['Words'],str)):
if(any((x in row['Words'] for x in words_to_check))):
Food2_case['Gender'][index-1] = 'Neutral'
# Create dataset
data = pd.DataFrame([[0, 0, 0], [10, "Madchen", 5]]).T
data.columns = ["Gender", "Words"]
# Shift one column of interest (take the value of previous row)
data.loc[:, "iswordin"] = data.Words.shift(-1)
# Do what you want to do
data.loc[data.iswordin.isin(["Madchen", "Girl", "boy", "..."]), "Gender"] = "Neutral"
# Now you can drop "iswordin" column which is no longer useful

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Spotfire: How to increment variables to build scoring mechanism?

I'm trying to figure out how I could use variables in Spotfire (online version) to build a scoring mechanism and populate a calculated column with the final result.
I have a couple of values stored in columns that I would use to evaluate and attribute a score like this:
if column1<10 then segment1 = segment1 + 1
if column1>10 then segment2 = segment2+1
...ETC...
In the end each "segment" should have a score and I would like to simply display the name of the segment that has the highest score.
Ex:
Segment1 has a final value of 10
Segment2 has a final value of 22
Segment3 has a final value of 122
I would display Segment3 as value for the calculated column
Using only "IF" would lead me to a complicated IF structure so I'm more looking for something that looks more like a script.
Is there a way to achieve this with Spotfire?
Thanks
Laurent
To cycle through the data rows and calculate a running score, you can use an IronPython script. The script below is reading the numeric data from Col1 and Col2 of a data table named "Data Table". It calculates a score value for each row and writes it to a tab delimited text string. When done, it adds it to the Spotfire table using the Add Columns function. Note, the existing data needs to have a unique identifier. If not, the RowId() function can be used to create a calculated column for a unique row id.
from Spotfire.Dxp.Data import *
from System.IO import StringReader, StreamReader, StreamWriter, MemoryStream, SeekOrigin
from Spotfire.Dxp.Data.Import import *
from System import Array
def add_column(table, text, col_name):
# read the text data into memory
mem_stream = MemoryStream()
writer = StreamWriter(mem_stream)
writer.Write(text)
writer.Flush()
mem_stream.Seek(0, SeekOrigin.Begin)
# define the structure of the text data
settings = TextDataReaderSettings()
settings.Separator = "\t"
settings.SetDataType(0, DataType.Integer)
settings.SetColumnName(0, 'ID')
settings.SetDataType(1, DataType.Real)
settings.SetColumnName(1, col_name)
# create a data source from the in memory text data
data = TextFileDataSource(mem_stream, settings)
# define the relationship between the existing table (left) and the new data (right)
leftColumnSignature = DataColumnSignature("Store ID", DataType.Integer)
rightColumnSignature = DataColumnSignature("ID", DataType.Integer)
columnMap = {leftColumnSignature:rightColumnSignature}
ignoredColumns = []
columnSettings = AddColumnsSettings(columnMap, JoinType.LeftOuterJoin, ignoredColumns)
# now add the column(s)
table.AddColumns(data, columnSettings)
#get the data table
table=Document.Data.Tables["Data Table"]
#place data cursor on a specific column
cursorCol1 = DataValueCursor.CreateFormatted(table.Columns["Col1"])
cursorCol2 = DataValueCursor.CreateFormatted(table.Columns["Col2"])
cursorColId = DataValueCursor.CreateFormatted(table.Columns["ID"])
cursorsList = Array[DataValueCursor]([cursorCol1, cursorCol2, cursorColId])
text = ""
rowsToInclude = IndexSet(table.RowCount,True)
#iterate through table column rows to retrieve the values
for row in table.GetRows(rowsToInclude, cursorsList):
score = 0
# get the current values from the cursors
col1Val = cursorCol1.CurrentDataValue.ValidValue
col2Val = cursorCol2.CurrentDataValue.ValidValue
id = cursorColId.CurrentDataValue.ValidValue
# now apply rules for scoring
if col1Val <= 3:
score -= 3
elif col1Val > 3 and col2Val > 50:
score += 10
else:
score += 5
text += "%d\t%f\r\n" % (id, score)
add_column(table, text, 'Score_Result')
For an approach with no scripting, but also no accumulation, you can use calculated columns.
To get the scores, you can use a calculated column with case statements. For Segment 1, you might have:
case
when [Col1] > 100 then 10
when [Col1] < 100 and [Col2] > 600 then 20
end
The, once you have the scores, you can create a calculated column, say [MaxSegment]. The expression for this will be Max([Segment1],[Segment2],[Segment3]...). Then display the value of [MaxSegment].
The max function in this case is acting as a row expression and is calculating the max value across the row of the columns given.

How to populate a dataframe column based on the value of another column

Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)

Resources