How to add value to specific index that is out of bounds - python-3.x

I have a list array
list = [[0, 1, 2, 3, 4, 5],[0],[1],[2],[3],[4],[5]]
Say I add [6, 7, 8] to the first row as the header for my three new columns, what's the best way to add values in these new columns, without getting index out of bounds? I've tried first filling all three columns with "" but when I add a value, it then pushes the "" out to the right and increases my list size.
Would it be any easier to use a Pandas dataframe? Are you allowed "gaps" in a Pandas dataframe?

according to ops comment i think a pandas df is the more appropriate solution. you can not have 'gaps', but nan values like this
import pandas as pd
# create sample data
a = np.arange(1, 6)
df = pd.DataFrame(zip(*[a]*5))
print(df)
output:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
for adding empty columns:
# add new columns, not empty but filled w/ nan
df[5] = df[6] = df[7] = float('nan')
# fill single value in column 7, index 3
df[7].iloc[4] = 123
print(df)
output:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 NaN NaN NaN
1 2 2 2 2 2 NaN NaN NaN
2 3 3 3 3 3 NaN NaN NaN
3 4 4 4 4 4 NaN NaN NaN
4 5 5 5 5 5 NaN NaN 123.0

Related

Multiplying 2 pandas dataframes generates nan

I have 2 dataframes as below
import pandas as pd
dat = pd.DataFrame({'val1' : [1,2,1,2,4], 'val2' : [1,2,1,2,4]})
dat1 = pd.DataFrame({'val3' : [1,2,1,2,4]})
Now with each column of dat and want to multiply dat1. So I did below
dat * dat1
However this generates nan value for all elements.
Could you please help on what is the correct approach? I could run a for loop with each column of dat, but I wonder if there are any better method available to perform the same.
Thanks for your pointer.
When doing multiplication (or any arithmetic operation), pandas does index alignment. This goes for both the index and columns in case of dataframes. If matches, it multiplies; otherwise puts NaN and the result has the union of the indices and columns of the operands.
So, to "avoid" this alignment, make dat1 a label-unaware data structure, e.g., a NumPy array:
In [116]: dat * dat1.to_numpy()
Out[116]:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16
To see what's "really" being multiplied, you can align yourself:
In [117]: dat.align(dat1)
Out[117]:
( val1 val2 val3
0 1 1 NaN
1 2 2 NaN
2 1 1 NaN
3 2 2 NaN
4 4 4 NaN,
val1 val2 val3
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 1
3 NaN NaN 2
4 NaN NaN 4)
(extra: you have the indices same for dat & dat1; please change one of them's index, and then align again to see the union-behaviour.)
You need to change two things:
use mul with axis=0
use a Series instead of dat1 (else multiplication will try to align the indices, there is no common ones between your two dataframes
out = dat.mul(dat1['val3'], axis=0)
output:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

how to change a value of a cell that contains nan to another specific value?

I have a dataframe that contains nan values in particular column. while iterating through the rows, if it come across nan(using isnan() method) then I need to change it to some other value(since I have some conditions). I tried using replace() and fillna() with limit parameter also but they are modifying whole column when they come across the first nan value? Is there any method that I can assign value to specific nan rather than changing all the values of a column?
Example: the dataframe looks like it:
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 NaN
2 x3 3 'cat' 1 2 3 1 1 NaN
3 x4 6 'lion' 8 4 3 7 1 NaN
4 x5 4 'lion' 1 1 3 1 1 NaN
5 x6 8 'cat' 10 10 9 7 1 0.0
an I have a list like
a = [1.0, 0.0]
and I expect to be like
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
I wanted to change the target_class values based on some conditions and assign values of the above list.
I believe need replace NaNs values to 1 only for indexes specified in list idx:
mask = df['target_class'].isnull()
idx = [1,2,3]
df.loc[mask, 'target_class'] = df[mask].index.isin(idx).astype(int)
print (df)
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
Or:
idx = [1,2,3]
s = pd.Series(df.index.isin(idx).astype(int), index=df.index)
df['target_class'] = df['target_class'].fillna(s)
EDIT:
From comments solution is assign values by index and columns values with DataFrame.loc:
df2.loc['x2', 'target_class'] = list1[0]
I suppose your conditions for imputing the nan values does not depend on the number of them in a column. In the code below I stored all the imputation rules in one function that receives as parameters the entire row (containing the nan) and the column you are investigating for. If you also need all the dataframe for the imputation rules, just pass it through the replace_nan function. In the example I imputate the col element with the mean values of the other columns.
import pandas as pd
import numpy as np
def replace_nan(row, col):
row[col] = row.drop(col).mean()
return row
df = pd.DataFrame(np.random.rand(5,3), columns = ['col1', 'col2', 'col3'])
col_to_impute = 'col1'
df.loc[[1, 3], col_to_impute] = np.nan
df = df.apply(lambda x: replace_nan(x, col_to_impute) if np.isnan(x[col_to_impute]) else x, axis=1)
The only thing that you should do is making the right assignation. That is, make an assignation in the rows that contain nulls.
Example dataset:
,event_id,type,timestamp,label
0,asd12e,click,12322232,0.0
1,asj123,click,212312312,0.0
2,asd321,touch,12312323,0.0
3,asdas3,click,33332233,
4,sdsaa3,touch,33211333,
Note: The last two rows contains nulls in column: 'label'. Then, we load the dataset:
df = pd.read_csv('dataset.csv')
Now, we make the appropiate condition:
cond = df['label'].isnull()
Now, we make the assignation over these rows (I don't know the logical of assignation. Therefore I assign 1 value to NaN's):
df1.loc[cond,'label'] = 1
There are another more accurate approaches. fillna() method could be used. You should provide the logical in order to help you.

Copy and Paste Values Based on a Condition in Python

I am trying to populate column 'C' with values from column 'A' based on conditions in column 'B'. Example: If column 'B' equals 'nan', then row under column 'C' equals the row in column 'A'. If column 'B' does NOT equal 'nan', then leave column 'C' as is (ie 'nan'). Next, the values in column 'A' to be removed (only the values that were copied from column A to C).
Original Dataset:
index A B C
0 6 nan nan
1 6 nan nan
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 3 nan nan
8 4 nan nan
Output:
index A B C
0 nan nan 6
1 nan nan 6
2 9 3 nan
3 9 3 nan
4 2 8 nan
5 2 8 nan
6 3 4 nan
7 nan nan 3
8 nan nan 4
Below is what I have tried so far, but its not working.
def impute_unit(cols):
Legal_Block = cols[0]
Legal_Lot = cols[1]
Legal_Unit = cols[2]
if pd.isnull(Legal_Lot):
return 3
else:
return Legal_Unit
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
Seems like you need
df['C'] = np.where(df.B.isna(), df.A, df.C)
df['A'] = np.where(df.B.isna(), np.nan, df.A)
A different, maybe fancy way to do it would be to swap A and C values only when B is np.nan
m = df.B.isna()
df.loc[m, ['A', 'C']] = df.loc[m, ['C', 'A']].values
In other words, change
bk_Final_tax['Legal_Unit'] = bk_Final_tax[['Legal_Block', 'Legal_Lot',
'Legal_Unit']].apply(impute_unit, axis = 1)
for
bk_Final_tax['Legal_Unit'] = np.where(df.Legal_Lot.isna(), df.Legal_Block, df.Legal_Unit)
bk_Final_tax['Legal_Block'] = np.where(df.Legal_Lot.isna(), np.nan, df.Legal_Block)

Merging lines that has duplicates and summing last column

I have this input text file
1;2;29.02.2017;10.00-11.00;5;
1;2;29.02.2017;10.00-11.00;3;
1;3;02.02.2017;09.00-10.00;4;
1;3;03.02.2017;12.00-13.00;2;
1;3;28.02.2017;08.00-09.00;6;
1;3;29.02.2017;10.00-11.00;3;
1;3;29.02.2017;10.00-11.00;2;
1;3;29.02.2017;11.00-12.00;2;
1;3;29.02.2017;12.00-13.00;3;
10;11;28.02.2017;08.00-09.00;6;
10;11;28.02.2017;08.00-09.00;1;
10;12;02.02.2017;09.00-10.00;8;
10;12;28.02.2017;08.00-09.00;2;
10;12;28.02.2017;08.00-09.00;1;
values separated by ';' are as follows:
1- id_1(str), 2- id_2(str), 3- date(str), 4- time(str), 5- area(int)
As output, I need a text file that contains lines from input, that have 1,2,3,4 duplicates and the sum of area. I need lines without duplicates to be dropped, e.g.
1;2;29.02.2017;10.00-11.00;8; (sum of 5 from line 1 and 3 from line 2)
1;3;29.02.2017;10.00-11.00;5;
10;11;28.02.2017;08.00-09.00;7;
10;12;28.02.2017;08.00-09.00;3;
What I achieved so far, is getting dropped lines without duplicates, but I had to remove the area column.
I used this:
seen = set()
for line1 in imp:
line1_lower = line1.lower()
if line1_lower in seen:
print(line1)
else:
seen.add(line1_lower)
You can use read_csv first with parameter names for create column names (if csv have no header):
import pandas as pd
from pandas.compat import StringIO
temp=u"""1;2;29.02.2017;10.00-11.00;5;
1;2;29.02.2017;10.00-11.00;3;
1;3;02.02.2017;09.00-10.00;4;
1;3;03.02.2017;12.00-13.00;2;
1;3;28.02.2017;08.00-09.00;6;
1;3;29.02.2017;10.00-11.00;3;
1;3;29.02.2017;10.00-11.00;2;
1;3;29.02.2017;11.00-12.00;2;
1;3;29.02.2017;12.00-13.00;3;
10;11;28.02.2017;08.00-09.00;6;
10;11;28.02.2017;08.00-09.00;1;
10;12;02.02.2017;09.00-10.00;8;
10;12;28.02.2017;08.00-09.00;2;
10;12;28.02.2017;08.00-09.00;1;"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=";", names=['id_1','id_2','date','time','area','tmp'])
print (df)
id_1 id_2 date time area tmp
0 1 2 29.02.2017 10.00-11.00 5 NaN
1 1 2 29.02.2017 10.00-11.00 3 NaN
2 1 3 02.02.2017 09.00-10.00 4 NaN
3 1 3 03.02.2017 12.00-13.00 2 NaN
4 1 3 28.02.2017 08.00-09.00 6 NaN
5 1 3 29.02.2017 10.00-11.00 3 NaN
6 1 3 29.02.2017 10.00-11.00 2 NaN
7 1 3 29.02.2017 11.00-12.00 2 NaN
8 1 3 29.02.2017 12.00-13.00 3 NaN
9 10 11 28.02.2017 08.00-09.00 6 NaN
10 10 11 28.02.2017 08.00-09.00 1 NaN
11 10 12 02.02.2017 09.00-10.00 8 NaN
12 10 12 28.02.2017 08.00-09.00 2 NaN
13 10 12 28.02.2017 08.00-09.00 1 NaN
Then groupby and aggregate size and sum, last use boolean indexing for remove duplicates - get values where size is greater as 1:
cols = ['id_1','id_2','date','time']
df = df.groupby(cols)['area'].agg(['size', 'sum'])
df = df[df['size'] > 1].drop('size',axis=1).reset_index()
print (df)
id_1 id_2 date time sum
0 1 2 29.02.2017 10.00-11.00 8
1 1 3 29.02.2017 10.00-11.00 5
2 10 11 28.02.2017 08.00-09.00 7
3 10 12 28.02.2017 08.00-09.00 3
Another solution is remove duplicates first by boolean indexing with duplicated and then aggregate sum:
cols = ['id_1','id_2','date','time']
mask = df.duplicated(cols, keep=False)
df = df[mask].groupby(cols, as_index=False)['area'].sum()
print (df)
id_1 id_2 date time area
0 1 2 29.02.2017 10.00-11.00 8
1 1 3 29.02.2017 10.00-11.00 5
2 10 11 28.02.2017 08.00-09.00 7
3 10 12 28.02.2017 08.00-09.00 3

Resources