Increase the values in a column values based on values in other column in pandas - python-3.x

I have my source data in the form of csv file as below:
id,col1,col2
123,11|22|33||||||,val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7
I need to add a new column(fnlsrc) which will have the values based on values in Col2 and Col1, i.e if col1 has 9 values(separated with pipe) and col2 has 3 values(separated with pipe), then in fnlsrc column I have to load 9 values(separated with pipe) 3 set of col2(val1|val3|val2|val1|val3|val2|val1|val3|val2). Please refer the output below, which will help in understanding the requirement easily:
id,col1,col2,fnlsrc
123,11|22|33||||||,val1|val3|val2,val1|val3|val2|val1|val3|val2|val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7,val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7
I have tried following code, but its adding only the one set:
zipped = zip(df['col1'], df['col2'])
for s,t in zipped:
count = int((s.count('|') + 1)/(t.count('|') + 1))
for val in range(count):
df['fnlsrc'] = t

As the new column is based on the other two, I would use panda's apply() function. I defined a function that calculates the new column value based on the other two columns, which is then applied to each row:
def new_value(x):
# Find out number of values in both columns
col1_numbers = x['col1'].count('|') + 1
col2_numbers = x['col2'].count('|') + 1
# Calculate how many times col2 should appear in the new column
repetition = int(col1_numbers/col2_numbers)
# Create list of strings containing the values of the new column
values = [x['col2']]*repetition
# Join the list of strings with pipes
return '|'.join(values)
# Apply the function on every row
df['fnlsrc'] = df.apply(lambda x:new_value(x), axis=1)
df
Output:
id col1 col2 fnlsrc
0 123 11|22|33|||||| val1|val3|val2 val1|val3|val2|val1|val3|val2|val1|val3|val2
1 456 99||77|||88|||||||||6| val4|val5|val6|val7 val4|val5|val6|val7|val4|val5|val6|val7|val4|v...
Full output in your input format:
id,col1,col2,fnlsrc
123,11|22|33||||||,val1|val3|val2,val1|val3|val2|val1|val3|val2|val1|val3|val2
456,99||77|||88|||||||||6|,val4|val5|val6|val7,val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7|val4|val5|val6|val7

Related

Find and Add Missing Column Values Based on Index Increment Python Pandas Dataframe

Good Afternoon!
I have a pandas dataframe with an index and a count.
dictionary = {1:5,2:10,4:3,5:2}
df = pd.DataFrame.from_dict(dictionary , orient = 'index' , columns = ['count'])
What I want to do is check from df.index.min() to df.index.max() that the index increment is 1. If a value is missing like in my case the 3 is missing then I want to add 3 to the index with a 0 in the count.
The output will look like the below df2 but done in a programmatic fashion so I can use it on a much bigger dataframe.
RESULTS EXAMPLE DF:
dictionary2 = {1:5,2:10,3:0,4:3,5:2}
df2 = pd.DataFrame.from_dict(dictionary2 , orient = 'index' , columns = ['count'])
Thank you much!!!
Ensure the index is sorted:
df = df.sort_index()
Create an array that starts from the minimum index to the maximum index
complete_array = np.arange(df.index.min(), df.index.max() + 1)
Reindex, fill the null value with 0, and optionally change the dtype to Pandas Int:
df.reindex(complete_array, fill_value=0).astype("Int16")
count
1 5
2 10
3 0
4 3
5 2

How to add an empty column in a dataframe using pandas (without specifying column names)?

I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

Python pandas ranking dataframe columns

I am trying to use rank function on two columns in my dataframe.
Problem:
One of the column contains blank values which is not allowing me to do groupby before ranking.
ERROR: ValueError: Length mismatch: Expected axis has 1122 elements, new values have 1814 elements
df_source['col1'] = df_source['col1'].apply(lambda \
x:x.strip()).replace('',np.nan)
df_source['Rank'] = df_source.groupby(by=['col0','col1']) \
['col1'].transform(lambda x: x.rank(na_option='bottom'))
**Actual:**
col0 col1
98630 a
a
90211 a
31111 a
b
23323 c
**Expected**
col0 col1 Rank
98630 a 1
a 2
90211 a 1
31111 a 1
b 1
23323 c 1
This code gives the expected result. I have tried to avoid groupby function for columns with null values.
df['col0'] = df['col0'].replace('', np.nan)
df_int = df.loc[df['col0'].notnull(), 'col1'].unique()
df = df[~(df['col0'].isin(df_int) & df['col1'].isnull())]

Resources