Assigning values from one row to multiple rows - python-3.x

I am trying to assign the values from a single row in a DataFrame to multiple rows. I have a DF def_security where the first row looks like this (the column headers are AGG and SPY, and the row index is the date)
AGG SPY
2006-01-01 95 21
The rest of the DF all have zeros.
AGG SPY
2006-01-02 0 0
...........
I would like to assign the same value as the first row (the values are calculated and not assigned scalars) to the next 250 rows of def_security. The column headers are user-input and the number of columns or the column headers are not pre-defined. However, there are same number of columns in each row
I am trying with the code
def_security.iloc[1:251] = def_security.iloc[0]
but it is returning error msg "could not broadcast input array from shape(250) into shape (250,2)".
What is the easiest way to do this ?

You were nearly right :-)
Try this:
def_security.iloc[1:251] = def_security.iloc[0].values

I assume this code would work:
def_security.iloc[1:251,-2]=def_security.iloc[0].at['AGG']
def_security.iloc[1:251,-1]=def_security.iloc[0].at['SPY']

Related

Iterate in column for specific value and insert 1 if found or 0 if not found in new column python

I have a DataFrame as shown in the attached image. My columns of interest are fgr and fgr1. As you can see, they both contain values corresponding to years.
I want to iterate in the the two columns and for any value present, I want 1 if the value is present or else 0.
For example, in fgr the first value is 2028. So, the first row in column 2028 will have a value 1 and all other columns have value 0.
I tried using lookup but I did not succeed. So, any pointers will be really helpful.
Example dataframe
Data:
Data file in Excel
This fill do you job. You can use for loops aswell but I think this approach will be faster.
df["Matched"] = df["fgr"].isin(df["fgr1"])*1
Basically you check if values from one are in anoter column and if they are, you get True or False. You then multiply by 1 to get 1 and 0 instead of True or False.
From this answer
Not the most efficient, but should work for your case(time consuming if large dataset)
s = df.reset_index().melt(['index','fgr','fgr1'])
s['value'] = s.variable.eq(s.fgr.str[:4]).astype(int)
s['value2'] = s.variable.eq(s.fgr1.str[:4]).astype(int)
s['final'] = np.where(s['value']+s['value2'] > 0,1,0)
yourdf = s.pivot_table(index=['index','fgr','fgr1'],columns = 'variable',values='final',aggfunc='first').reset_index(level=[1,2])
yourdf

{Python} - [Pandas] - How sum columns by condition less than in columns name

First explaining the dataframe, the values of columns '0-156', '156-234', '234-546' .... '> 76830' is the percentage distribution for each range of distances in meters, totaling 100%.
Column 'Cell Name' refers to the data element of the other columns and the column 'Distance' is the column that will trigger the desired sum.
I need to sum the values of the columns '0-156', '156-234', '234-546' .... '> 76830' which are less than the value of the 'Distance' (Meters) column.
Below creation code for testing.
import pandas as pd
# initialize list of lists
data = [['Test1',0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
['Test2',0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
['Test3',0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Cell Name','0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','>76830','Distance'])
Example of what should be done:
The value of column 'Distance' = 158.412286772863 therefore would have to sum the values <= of the following columns, 0-156, '156-234' totalizing 19.43116162 %.
Thanks so much!
As I understand it, you want to sum up all the percentage values in a row, where the lower value of the column-description (in case of '0-156' it would be 0, in case of '156-234' it would be 156, and so on...) is smaller than the value in the distance column.
First I would suggest, that you transform your string-like column-names into values, as an example:
lowerlimit=df.columns[2]
>>'156-234'
Then read the string only till the '-' and make it a number
int(lowerlimit[:lowerlimit.find('-')])
>> 156
You can loop this through all your columns and make a new row for the lower limits.
For a bit more simplicity I left out the first column for your example, and added another first row with the lower limits of each column, that you could generate as described above. Then this code works:
data = [[0,156,234,546,1014,1950,3510,6630,11430,30030,53430,76830,1e-23],[0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
[0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
[0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','76830-','Distance'])
df['lastindex']=None
df['sum']=None
After creating basically your dataframe, I add two columns 'lastindex' and 'sum'.
Then I am searching for the last index in every row, that is has its lower limit below the distance given in that row (df.iloc[x,-3]); afterwards I'm summing up the respective columns in that row.
for i in np.arange(1,len(df)):
df.at[i,'lastindex']=np.where(df.iloc[0,:-3]<df.iloc[i,-3])[0][-1]
df.at[i,'sum']=sum(df.iloc[i][0:df.at[i,'lastindex']+1])
I hope, this is helpful. Best, lepakk

How to populating one column in a dataframe from the truncated value of another column

I have column in a Pandas dataframe (final_combine_df) that is called GEOID. I will have a 15 character string number like this : '371899201001045'. I want to create a new column in my data frame called 'CB_GrpID' that is equal to just the first 12 characters of the GEOID values (ex: '371899201001'). I tried this, but it just returned the same GEOID value (non-truncated) in the new 'CB_GrpID':
final_combine_df['CB_GrpID'] = final_combine_df['GEOID'][:12]
What am I doing wrong here?
final_combine_df.iloc[0]['CB_GrpID']
>>371899201001045
pandas.Series.str
Working with text
The str accessor is what you're looking for. It gives access to the strings in each cell along with "vectorized" string methods.
final_combined_df['GEOID'].str[:12]
What you were doing:
final_combined_df['GEOID'][:12]
Was just getting the first 12 elements of the column.
Follow this format. Use a lambda function to return the first 12 digits of a string. Note python starts at index 0 and the upper limit is exclusive not inclusive, meaning the last element you want is at index 11, however you set the upper limit to 12 to ensure that 11 is included. Just FYI in case you were unaware.
df[‘new_var’] = df[‘old_var’].apply(lambda x: x[:12])

Compare two columns and output new column based on order of reference column

I'm trying to compare two columns (list) with same IDs (just in different order). I want to reference first columns order, compare it to the next column, and then reformat 2nd columns order based on first columns order in new column (or list). From there I can pull corresponding columns that match the order of first column (price, demographic, etc).
Input:
First column (reference column):
12321
12323
324214
32313452
1232132
fs2421
sfasrfas
asfasd
Second column (re-order necessary):
12321
sfasrfas
12323
324214
1232132
fs2421
asfasd
32313452
I have tried writing a for loop in python with two separate lists for each column IDs as well as Index/Match in excel but can only seem to output 'matching' IDs.
Excel
=INDEX($A$2:$A$589,MATCH(C2,$A$2:$A$589,0),2)
Python
## setting empty list and extract only matched values from both lists made above ##
matched_IDs = []
unique_IDs = []
for Part_No in updated_2_list:
if Part_No in updated_1_list:
matched_IDs.append(Part_No)
elif Part_No not in updated_2_list:
unique_IDs.append(Part_No)
print(matched_IDs)
#len(matched_IDs)
len(matched_IDs)
I expect to match the order of first column in new column (or list).
Output:
Third column (new column after second column was re-ordered)
12321
12323
324214
32313452
1232132
fs2421
sfasrfas
asfasd
You mean like this:
=INDEX(C:C,MATCH(A1,C:C,0))

Choose exact value in DataFrame

I'm looking through a UCI Adult dataframe (https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data). I want to output and count all the rows, where native country is 'Germany'. The following code:
df[df['native-country']=="Germany"]
Says me that all the rows are False. Is there any other way to count the amount of rows and/or print them out? Dummie might not be an option, since there are more than 20 different countries in the dataframe.
I think you have white blank in the country field
Try
df[df['native-country']==" Germany"]
Or
df[df['native-country'].str.contains("Germany")]
Your command df[df['native-country']=="Germany"] should already print only rows that match the condition. If you're seeing rows of False values, you might actually be executing df['native-country']=="Germany", which returns a boolean mask of True and False.
To count the occurrences of each unique value in the native-country column, try:
df['native-country'].value_counts()

Resources