I am developing code for searching a keyword in the given data.
for example, I have a data in column A & I want to find if
the substring is present in the row if yes give me that keyword against
the data, if that keyword is not present then give me 'blank'.
import pandas as pd
data = pd.read_excel("C:/Users/606736.CTS/Desktop/Keyword.xlsx")
# dropping null value columns to avoid errors
data.dropna(inplace = True)
# Converting the column to uppercase
data["Uppercase"]= data["Skill"].str.upper()
# Below is the keywords I want to search in the data
sub =['MEMORY','PASSWORD','DISK','LOGIN','RESET']
# I have used the below code, which is creating multiple columns &
giving me the boolean output
for keyword in sub:
data[keyword] = data.astype(str).sum(axis=1).str.contains(keyword)
# what I want is, search the keyword if it exits give me the keyword
name else blank
Try this:
data['Keyword'] = np.nan
for i in sub:
data.loc[(data['Uppercase'].apply(lambda x: i in x.split(' ')) & (data['Keyword'].isna()), 'Keyword'] = i
Related
I have a data frame (df).
The Data frame contains a string column called: supported_cpu.
The (supported_cpu) data is a string type separated by a comma.
I want to use this data for the ML model.
enter image description here
I had to get unique values for the column (supported_cpu). The output is a (list) of unique values.
def pars_string(df,col):
#Separate the column from the string using split
data=df[col].value_counts().reset_index()
data['index']=data['index'].str.split(",")
# Create a list including all of the items, which is separated by column
df_01=[]
for i in range(data.shape[0]):
for j in data['index'][i]:
df_01.append(j)
# get unique value from sub_df
list_01=list(set(df_01))
# there are some leading or trailing spaces in the list_01 which need to be deleted to get unique value
list_02=[x.strip(' ') for x in list_01]
# get unique value from list_02
list_03=list(set(list_02))
return(list_03)
supported_cpu_list = pars_string(df=df,col='supported_cpu')
The output:
enter image description here
I want to map this output to the data frame to encode it for the ML model.
How could I store the output in the data frame? Note : Some row have a multi-value(more than one CPU)
Input: string type separated by a column
output: I did not know what it should be.
Input: string type separated by a column
output: I did not know what it should be.
I really recommend to anyone who's starting using pandas to read about vectorization and thinking in terms of columns (aka Series). This is the way it was build and it is the way in which its supposed to be used.
And from what I understand (I may be wrong) is that you want to get unique values from supported_cpu column. So you could use the Series methods on string to split that particular column, then flatten the resulting array using internal `chain
from itertools import chain
df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')
unique_vals = set(chain(*df['supported_cpus'].tolist()))
unique_vals = (item for item in unique_vals if item)
Multi-values in some rows should be parsed to single values for later ML model training. The list can be converted to dataframe simply by pd.DataFrame(supported_cpu_list).
I have a pandas object df and I would like to save that to .csv:
df.to_csv('output.csv', index = False)
Even if the data frame is displayed right in the terminal after printing, in the *.csv some lines are shifted several columns forward. I do not know how to demonstrate that in the minimal working code. I tried that with the one problematic column, but the result of one column was correct in the *.csv. What should I check, please? The whole column contains strings.
After advice:
selected['SpType'] = selected['SpType'].str.replace('\t', '')
I obtained an error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
selected['SpType'] = selected['SpType'].str.replace('\t', '')
If the tabs are the problem, you could just replace all tabs.
If the tabs occur in column column_name you could do something like:
df['column_name'] = df['column_name'].str.replace('\t', '')
If the problem is in several columns, you could loop over all columns. eg.:
for col in df.columns:
df[col] = df[col].str.replace('\t', '')
df.to_csv('output.csv', index = False)
Please help me complete this piece of code. Let me know of any other detail is required.
Thanks in advance!
Given: a column 'PROD_NAME' from pandas dataframe of string type (e.g. Smiths Crinkle Cut Chips Chicken g), a list of certain words (['Chip', 'Chips' etc])
To do: if none of the words from the list is contained in the strings of the dataframe objects, we drop the whole row. Basically we're removing unnecessary products from a dataframe.
This is what data looks like:
Here's my code:
# create a function to Keep only those products which have
# chip, chips, doritos, dorito, pringle, Pringles, Chps, chp, in their name
def onlyChips(df, *cols):
temp = []
chips = ['Chip', 'Chips', 'Doritos', 'Dorito', 'Pringle', 'Pringles', 'Chps', 'Chp']
copy = cp.deepcopy(df)
for col in [*cols]:
for i in range(len(copy[col])):
for item in chips:
if item not in copy[col][i]:
flag = False
else:
flag = True
break;
# drop only those string which doesn't have any match from chips list, if flag never became True
if not flag:
# drop the whole row
return <new created dataframe>
new = onlyChips(df_txn, 'PROD_NAME')
Filter the rows instead of deleting them. Create a boolean mask for each row. Use str.contains on each column you need to search and see if any of the columns match the given criteria row-wise. Filter the rows if not.
search_cols = ['PROD_NAME']
mask = df[search_cols].apply(lambda x: x.str.contains('|'.join(chips))).any(axis=1)
df = df[mask]
I have 2 data frames(final_combine_df & acs_df) that have a shared column ('CBG'). Dataframe acs_df has 2 additional columns that I want to add to the combined dataframe (acs_total_persons & cs_total_building_units) . For the 'CBG' column values in acs_df that match those in final_combine_df, I want to add the acs_total_persons & acs_total_housing_units values to that row.
acs_df.head()
CBG acs_total_persons acs_total_housing_units
10010211001 1925.0 1013.0 1
10030114011 2668.0 1303.0 2
10070100043 930.0 532.0 3
10139534001 1570.0 763.0 4
10150021023 1059.0 379.0
I tried combine_acs_merge = pd.concat([final_combine,acs_df], sort=True) but it did not seem to match them up. I also tried combine_acs_merge = final_combine.merge(acs_df, on='CBG') and got
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
What do I need to do here?
Note: Column acs_df['CBG'] is type numpy.float64, not a string but it should still return. Oddly, when I run the following: print(acs_df.loc[acs_df['CBG'] == '01030114011']) it returns an empty dataframe. I created the acs_df from a csv file (see below). Is that creating a problem?
acs_df = pd.read_csv(acs_data)
I want to write values from a dataframe into a tkinter treeview/Table, I am not able to do this.
my code:
#Setting up tkinter window.
root = Tk()
tree = ttk.Treeview(root)
#taking file input through a dialog box from the user.
file = filedialog.askopenfile(parent=root,mode='rb',title='Choose a xlsx file')
#readinf the excel file selected by the user and then creating a dataframe of that file.
xls = pd.read_excel(file)
df = pd.DataFrame(xls)
#taking all the columns heading in a variable"df_col".
df_col = df.columns.values
#all the column name are generated dynamically.
tree["columns"]=(df_col)
counter = len(df)
#generating for loop to create columns and give heading to them through df_col var.
for x in range(len(df_col)):
tree.column(x, width=100 )
tree.heading(x, text=df_col[x])
#generating for loop to print values of dataframe in treeview column.
for i in range(counter):
tree.insert('', 0, values=(df[df_col[x]]][i]))
It is not printing the columns and showing the KeyError:0.
Output Required:
The first argument of tree.column() should be the column name, which you assigned with:
tree["columns"]=(df_col)
The problem is that you have named the columns using a string, but you are attempting to access them using integers in:
for x in range(len(df_col)):
tree.column(x, width=100 )
tree.heading(x, text=df_col[x])
Above, you are attempting to access tree.columns(0), instead of tree.columns('Company'), hence the key error.
Try instead:
for x in range(len(df_col)):
tree.column(df_col[x], width=100)
tree.heading(df_col[x], text=df_col[x])
Note that df_col is an ndarray, not a dataframe, which is why df_col[x] works correctly (df[x] would give a key error). This is because df.columns.values returns an ndarray. As a side note, it may be a bit confusing to name an ndarray df_col.
There are also a few issues with your insert. The second argument should correspond to the index of the entry you wish to address. One solution is then to use a row index as the second argument, followed by a row label as text="rowLabel", followed by a list of values for the row:
tree.insert('', i, text=rowLabels[i], values=df.iloc[i,:].tolist())
Where rowLabels should be defined as whatever you want to use in the first column of the table. I would suggest using an index column from the spreadsheet here, if possible. It could be defined by:
rowLabels = df.iloc[:,indexColumn].tolist()
or:
rowLabels = df.index.tolist()
The latter is viable if df has named indices defined by a column during the spreadsheet import. In the former, indexColumn is an int referring to a column number in df that contains unique identifiers.
The option values=df.iloc[i,:].tolist() converts all columns of the ith row into a list, and, since we have passed an index value (the second argument) that gets larger, the call will insert a new row every loop (from the python tkinter docs entry on Treeview --> insert: "if index is greater than or equal to the current number of children, it is inserted at the end").
Finally, I am not sure if you did not post the end of your code, but, in order for the tree to show up, you will also need to use pack, grid, etc.
tree.pack()
or
tree.grid(row=0, column=0)
References:
https://docs.python.org/3/library/tkinter.ttk.html#tkinter.ttk.Treeview
This helpful example makes a few of the steps clear:
https://knowpapa.com/ttk-treeview/
As I was reading over your code. I noticed at the end line you have an extra bracket #:
df[df_col[x]]]
for i in range(counter):
tree.insert('', 0, values=(df[df_col[x]]][i]))
I would assume that would explain the KeyError.