How to check string of one column ,and change of string of another column using pandas - python-3.x

I have big csv file, PF sample data like below
Name,value,data
jack,X16206,hi this is X16206
Riti,X1620600,I want to change X16206.
Aadii,X16206,New value is X1620600.
jan,abc700134,something new 20600.
I have a value X16206(alpha-numeric) with 00 added sometimes and sometimes not, in value column and data column
I want to check the string from value column and change the string present in a sentence which is in the data column as 'exact'
expected output:
Name,value,data
jack,X16206,hi this is [exact]
Riti,X1620600,I want to change [exact].
Aadii,X16206,New value is [exact].
jan,abc700134,something new 20600.
what I have tried so far
df1['num'] = np.where(df1['value'].str.len().isin({6,8}), 1, -1)
def myfn2(row):
if row['num']==1:
row['New_data']=row['data'].replace(row['value'],'[exact]')
else:
row['New_data']=row['data']
return row
df1=df1.apply(myfn2,axis=1)
Output I got
Name,value,data,num,New_data
jack,X16206,hi this is X16206,1,hi this is [exact]
Riti,X1620600,I want to change X16206,1,I want to change X16206.
Aadii,X16206,New value is X1620600,1,New value is [exact]00.
jan,abc700134,something new 20600,-1,something new 20600.
Can anyone please help me how to do this?

Try:
import re
def fn(x):
v = re.sub(r"(?<=\d{4})00$", "", x["value"])
return re.sub(r"(" + v + "0?0?)", r"[exact]", x["data"])
df["data"] = df.apply(fn, axis=1)
print(df)
Prints:
Name value data
0 jack X16206 hi this is [exact]
1 Riti X1620600 I want to change [exact].
2 Aadii X16206 New value is [exact].
3 jan abc700134 something new 20600.

Related

How do I extract specific values from a DataFrame and add them to a list?

Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.
Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.

How to remove a complete row when no match found in a column's string values with any object from a given list?

Please help me complete this piece of code. Let me know of any other detail is required.
Thanks in advance!
Given: a column 'PROD_NAME' from pandas dataframe of string type (e.g. Smiths Crinkle Cut Chips Chicken g), a list of certain words (['Chip', 'Chips' etc])
To do: if none of the words from the list is contained in the strings of the dataframe objects, we drop the whole row. Basically we're removing unnecessary products from a dataframe.
This is what data looks like:
Here's my code:
# create a function to Keep only those products which have
# chip, chips, doritos, dorito, pringle, Pringles, Chps, chp, in their name
def onlyChips(df, *cols):
temp = []
chips = ['Chip', 'Chips', 'Doritos', 'Dorito', 'Pringle', 'Pringles', 'Chps', 'Chp']
copy = cp.deepcopy(df)
for col in [*cols]:
for i in range(len(copy[col])):
for item in chips:
if item not in copy[col][i]:
flag = False
else:
flag = True
break;
# drop only those string which doesn't have any match from chips list, if flag never became True
if not flag:
# drop the whole row
return <new created dataframe>
new = onlyChips(df_txn, 'PROD_NAME')
Filter the rows instead of deleting them. Create a boolean mask for each row. Use str.contains on each column you need to search and see if any of the columns match the given criteria row-wise. Filter the rows if not.
search_cols = ['PROD_NAME']
mask = df[search_cols].apply(lambda x: x.str.contains('|'.join(chips))).any(axis=1)
df = df[mask]

change the column value based on the multiple columns

I am developing code for searching a keyword in the given data.
for example, I have a data in column A & I want to find if
the substring is present in the row if yes give me that keyword against
the data, if that keyword is not present then give me 'blank'.
import pandas as pd
data = pd.read_excel("C:/Users/606736.CTS/Desktop/Keyword.xlsx")
# dropping null value columns to avoid errors
data.dropna(inplace = True)
# Converting the column to uppercase
data["Uppercase"]= data["Skill"].str.upper()
# Below is the keywords I want to search in the data
sub =['MEMORY','PASSWORD','DISK','LOGIN','RESET']
# I have used the below code, which is creating multiple columns &
giving me the boolean output
for keyword in sub:
data[keyword] = data.astype(str).sum(axis=1).str.contains(keyword)
# what I want is, search the keyword if it exits give me the keyword
name else blank
Try this:
data['Keyword'] = np.nan
for i in sub:
data.loc[(data['Uppercase'].apply(lambda x: i in x.split(' ')) & (data['Keyword'].isna()), 'Keyword'] = i

Spotfire: How to increment variables to build scoring mechanism?

I'm trying to figure out how I could use variables in Spotfire (online version) to build a scoring mechanism and populate a calculated column with the final result.
I have a couple of values stored in columns that I would use to evaluate and attribute a score like this:
if column1<10 then segment1 = segment1 + 1
if column1>10 then segment2 = segment2+1
...ETC...
In the end each "segment" should have a score and I would like to simply display the name of the segment that has the highest score.
Ex:
Segment1 has a final value of 10
Segment2 has a final value of 22
Segment3 has a final value of 122
I would display Segment3 as value for the calculated column
Using only "IF" would lead me to a complicated IF structure so I'm more looking for something that looks more like a script.
Is there a way to achieve this with Spotfire?
Thanks
Laurent
To cycle through the data rows and calculate a running score, you can use an IronPython script. The script below is reading the numeric data from Col1 and Col2 of a data table named "Data Table". It calculates a score value for each row and writes it to a tab delimited text string. When done, it adds it to the Spotfire table using the Add Columns function. Note, the existing data needs to have a unique identifier. If not, the RowId() function can be used to create a calculated column for a unique row id.
from Spotfire.Dxp.Data import *
from System.IO import StringReader, StreamReader, StreamWriter, MemoryStream, SeekOrigin
from Spotfire.Dxp.Data.Import import *
from System import Array
def add_column(table, text, col_name):
# read the text data into memory
mem_stream = MemoryStream()
writer = StreamWriter(mem_stream)
writer.Write(text)
writer.Flush()
mem_stream.Seek(0, SeekOrigin.Begin)
# define the structure of the text data
settings = TextDataReaderSettings()
settings.Separator = "\t"
settings.SetDataType(0, DataType.Integer)
settings.SetColumnName(0, 'ID')
settings.SetDataType(1, DataType.Real)
settings.SetColumnName(1, col_name)
# create a data source from the in memory text data
data = TextFileDataSource(mem_stream, settings)
# define the relationship between the existing table (left) and the new data (right)
leftColumnSignature = DataColumnSignature("Store ID", DataType.Integer)
rightColumnSignature = DataColumnSignature("ID", DataType.Integer)
columnMap = {leftColumnSignature:rightColumnSignature}
ignoredColumns = []
columnSettings = AddColumnsSettings(columnMap, JoinType.LeftOuterJoin, ignoredColumns)
# now add the column(s)
table.AddColumns(data, columnSettings)
#get the data table
table=Document.Data.Tables["Data Table"]
#place data cursor on a specific column
cursorCol1 = DataValueCursor.CreateFormatted(table.Columns["Col1"])
cursorCol2 = DataValueCursor.CreateFormatted(table.Columns["Col2"])
cursorColId = DataValueCursor.CreateFormatted(table.Columns["ID"])
cursorsList = Array[DataValueCursor]([cursorCol1, cursorCol2, cursorColId])
text = ""
rowsToInclude = IndexSet(table.RowCount,True)
#iterate through table column rows to retrieve the values
for row in table.GetRows(rowsToInclude, cursorsList):
score = 0
# get the current values from the cursors
col1Val = cursorCol1.CurrentDataValue.ValidValue
col2Val = cursorCol2.CurrentDataValue.ValidValue
id = cursorColId.CurrentDataValue.ValidValue
# now apply rules for scoring
if col1Val <= 3:
score -= 3
elif col1Val > 3 and col2Val > 50:
score += 10
else:
score += 5
text += "%d\t%f\r\n" % (id, score)
add_column(table, text, 'Score_Result')
For an approach with no scripting, but also no accumulation, you can use calculated columns.
To get the scores, you can use a calculated column with case statements. For Segment 1, you might have:
case
when [Col1] > 100 then 10
when [Col1] < 100 and [Col2] > 600 then 20
end
The, once you have the scores, you can create a calculated column, say [MaxSegment]. The expression for this will be Max([Segment1],[Segment2],[Segment3]...). Then display the value of [MaxSegment].
The max function in this case is acting as a row expression and is calculating the max value across the row of the columns given.

How to perform excel MATCHINDEX or lookups in Python?

I basically need to find relevant row value from first-column i.e (type) for max values of each column in below Data frame.
Lets suppose this is df3 data frame shown below.
I need to find relevant "type" row values for all the max values of numeric columns.
for example: max(price) => 8,85,113 => corr_logx (this is the output I need to find for all the numeric variables)
type price Bedroom_3 Bedroom_4
corr_x 3,56,315 5,01,687 6,05,458
corr_logx 8,85,113 2,27,955 1,28,834
corr_expx 8,34,503 3,62,952 2,30,759
corr_sqrtx 6,29,162 3,36,964 8,96,593
corr_sqrex 7,79,030 8,54,539 6,07,960
Output I need:
Variable transfrm_used
Price corr_logx
Bedroom_3 corr_sqrex
Bedroom_4 corr_sqrtx
For this:
I created a list of numeric column names and used it as range in for-loop. This helps me to step in each column.
Within each column then I created another for loop which will step in each row to match with the max value of the column.
If value matches then it should result in relevant value from the first column/ column 0/ column name - type. Otherwise it should continue looking for the match.
cols_list = df3.columns.difference(['type'])
transfrm_used = []
variable = []
for col_name in cols_list:
variable.append(col_name) # this gives the respective column name
print(variable)
for rows in range(0,5):
if df3[rows,col_name] == np.max(df3.col_name): # works as Match
transfrm_used.append(df3[rows,0])
else:
continue
print('done')
I am looking for a result format where I can get both column names like price, Bedroom_3 etc. and its relevant value from type column like corr_logx.
In excel it is done by using MATCHINDEX and lookups. Here is the complete set of data frame and expected result
type price Bedroom_3 Bedroom_4
corr_x 3,56,315 5,01,687 6,05,458
corr_logx 8,85,113 2,27,955 1,28,834
corr_expx 8,34,503 3,62,952 2,30,759
corr_sqrtx 6,29,162 3,36,964 8,96,593
corr_sqrex 7,79,030 8,54,539 6,07,960
max_price max_Bedroom_3 max_Bedroom_4
max_value: 8,85,113 8,54,539 8,96,593
relevant_type: corr_logx corr_sqrex corr_sqrtx
If you're sure that there's only one instance of each max price, to find max of column, a .max() will do. Then, we can find the type of that row:
max_value = {}
relevant_type = {}
for col_name in cols_list:
max_val = df[col_name].max()
max_value[col_name] = max_val
relevant_type[col_name] = df.loc[df[col_name] == max_val,'type']
Then, we can add these two dicts as rows at the bottom of the dataframe:
df3.append(max_value, ignore_index=True)
df3.append(relevant_type, ignore_index=True)
Or we can make a new dataframe:
combined_dict = {x: [max_value[x],relevant_type[x]] for x in max_value}
df = pd.DataFrame(combined_dict, index=['max value','relevant type'])

Resources