Conditionally concatenate variable names into a new variable in Python - python-3.x

I have a data set with 3 columns and occasional NAs. I am trying to create a new string column called 'check' that will concatenate the name of the variables that don't have an NA in each row in between underscores ('_'). I pasted my code below as well as the data that I have, the data that I need and what I actually get (See the hyperlinks after the code). For some reason, it seems the conditional that I have in place is completely ignored and the example_set['check'] = example_set['check'] + column is executed at every loop with or without the conditional code block. I assume there is a Python/Pandas quirk that I haven't fully comprehended... Can you please help?
example_set = pd.DataFrame({
'A':[3,4,np.nan]
,'B':[1,np.nan,np.nan]
,'C':[3,4,5]
}
)
example_set
columns = list(example_set.columns)
example_set['check'] = '_'
for column in columns:
for row in range(example_set.shape[0]):
if example_set[column][row] != np.nan:
example_set['check'] = example_set['check'] + column
else:
continue
example_set
Data that I have
Data that I was hoping to get
What I actually get

Find the rows that have null values, iterate the values with numpy compress, get the difference of the iteration from the columns, format the strings to your taste and create a new column:
columns = example_set.columns
example_set['check'] = [f'_{"".join(columns.difference(np.compress(boolean,columns)))}_'
for boolean in example_set.isna().to_numpy()]
A B C check
0 3.0 1.0 3 _ABC_
1 4.0 NaN 4 _AC_
2 NaN NaN 5 _C_

The simplest strategy is to copy df. Create a new column in it. Iterate over the rows of the old df, filter for nan cells then get the remaining indices.
Join them in a string and put those values to new df.
It is probably not the most efficient method, but it should be easy to understand.
Here is some code to get you going:
nset = example_set.copy()
nset["checked"] = "__"
for s in range(example_set.shape[0]):
serie = example_set.iloc[s]
nserie = serie[serie.notnull()]
names = "".join(nserie.index.tolist())
nset.at[s, "checked"] = "__" + names + "__"

Please try:
import numpy as np
example_set = pd.DataFrame({
'A':[3,4,np.nan]
,'B':[1,np.nan,np.nan]
,'C':[3,4,5]
}
)
example_set['check'] = '_ABC_'
for i in range(len(example_set)):
list_ = example_set.iloc[i].values.tolist()
if math.isnan(list_[0]):
example_set['check'][i] = example_set['check'][i].replace('A','')
if math.isnan(list_[1]):
example_set['check'][i] = example_set['check'][i].replace('B','')
if math.isnan(list_[2]):
example_set['check'][i] = example_set['check'][i].replace('C','')
Output:
A B C check
0 3.0 1.0 3 _ABC_
1 4.0 NaN 4 _AC_
2 NaN NaN 5 _C_

Related

How to find complete empty row in pandas

I am working on one dataset in which I need to find complete empty columns from the dataset.
example:
A B C D
nan nan nan nan
1 ss nan 3.0
2 bb w2 4.0
nan nan nan nan
Currently, I am using
import pandas as pd
nan_col=[]
for col in df.columns:
if df.loc[df[col].isnull()].empty !=True:
nan_col.append(col)
But this is capturing null values in the specified columns but I need to capture null rows.
expected Answer: row [0,3]
Can anyone suggest me a way to proceed to identify a complete null row in the dataframe.
You can compare if all rows has missing values by DataFrame.isna with DataFrame.all and then get index values by boolean indexing:
L = df.index[df.isna().all(axis=1)].tolist()
#alternative, if huge dataframe slowier
#L = df[df.isna().all(axis=1)].index.tolist()
print (L)
[0, 3]
Or you could use dropna with set and sorted, I get the index after dropping the rows with NaNs and then also get the index of the whole dataframe and use ^ to get the values that aren't in both indexes, then after the I use sorted to sort the list and convert it into a list, like the below:
print(sorted(set(df.index) ^ set(df.dropna(how='all').index)))
If you might have duplicate index, you can do a list comprehension to iterate through the whole df's index, and add the value to the list comprehension if the value isn't in the dropna index, I also use enumerate so that if all indexes are the same (all duplicate index), it would still work, like the below:
idx = df.dropna(how='all').index
print([i for index, i in enumerate(df.index) if index not in idx])
Both codes output:
[0, 3]

How to add an empty column in a dataframe using pandas (without specifying column names)?

I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN

Create pandas dataframe from csv rows in string or list format

I convert some data into a csv string like format row by row for example the rows look like:
string format
1st row: "A,B,R,K,S,E"
2nd row: "B,C,S,E,G,Q,W,R,W" # sometimes longer rows
3rd row: "A,E,R,E,S" # sometimes shorter rows
or list format
1st row: ['A','B','R','K','S','E']
2nd row: ['B','C','S','E','G','Q','W','R','W']
3rd row: ['A','E','R','E','S']
I can also add \n at the end of each row.
I want to create a pandas dataframe from these rows but not sure how.
Normally I just save this data into a .csv file then I do pd.read_csv but I want to skip that step.
Thanks for the help
This will solve your problem:
import numpy as np
import pandas as pd
First_row=['A','B','R','K','S','E']
Second_row=['B','C','S','E','G','Q','W','R','W']
Third_row=['A','E','R','E','S']
df=pd.DataFrame({'1st row':pd.Series(First_row),'2nd row':pd.Series(Second_row),'3rd row':pd.Series(Third_row)})
answer=df.T
answer
0 1 2 3 4 5 6 7 8
1st row A B R K S E NaN NaN NaN
2nd row B C S E G Q W R W
3rd row A E R E S NaN NaN NaN NaN
Method - 1 : From List
Take 2D list and append it. Else, it would add the values in columns.
Method - 2 : From String

Merging sheets of excel using python

I am trying to take data of two sheets and comparing with each other if it matches i want to append column. Let me explain this by showing what i am doing and what i am trying to get in output using python.
This is my sheet1 from excel.xlsx:
it contains four column name,class,age and group.
This is my sheet2 from excel.xlsx:
it contains default, and name column with extra names in it.
So, Now i am trying to match name of sheet2 with sheet1, if the name containing in sheet1 matches with sheet2 then i want to add default value corresponding to that name from sheet2.
This i need in output:
As you can see only Ravi and Neha having default in sheet2 and that name matches with sheet1 name. Suhash and Aish dont have any default value so not anything coming there.
This code i tried:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
df1['DEFAULT'] = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)
and getting output excel like this:
Not getting default against Ravi.
Please help me with this to get this expected output using python.
Assuming you read each sheet into a dataframe (df = sheet1, df2 = sheet2)
it's quite easy and there are a few options (ranked in order of speed, from fastest to slowest):
# .merge
df = df.merge(df2, how='left', on='Name')
# pd.conact
df = pd.concat([df.set_index('Name'), df2.set_index('Name').Default], axis=1, sort='Name', join='inner')
# .join
df = df.set_index('Name').join(df2.set_index('Name'))
# .map
df.Default = df.Name.map(df2.set_index('Name')['Default'].to_dict())
All of them will have the following output:
Name Default Class Age Group
0 NaN NaN 4 2 tig
1 Ravi 2.0 5 5 rose
2 NaN NaN 3 3 lily
3 Suhas NaN 5 5 rose
4 NaN NaN 2 2 sun
5 Neha 3.0 5 5 rose
6 NaN NaN 5 2 sun
7 Aish NaN 5 5 rose
Then you overwrite the original sheet by using df.to_excel
EDIT
So the code you shared has 3 problems. One of which seems to be a language barrier... You only need 1 of the options I gave you. Secondly there's a missing ' when reading the first sheet into df. And lastly you're inconsistent when using the df names. you defined df1 and df2 but used just df in the code which doesn't work
So the correct code would be as follows:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1') #Here the ' was missing
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
## Now you chose one of the options, I used map here, but you can pick any one of them
df1.DEFAULT = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

Resources