How to get values of one column based on another column using specific match values - python-3.x

I have 5 columns contains [ Voltage,Bus,Load,load_Values,transmission, transmission_Values]. all the column name with Values contain numerical value based on their corresponding value.The csv files looks like that below
Voltage Bus Load load_Values transmission transmission_Values
Voltage(1) 2 load(1) 3 transmission(1) 2
Voltage(2) 2 load(2) 4 transmission(2) 3
Voltage(5) 3 load(3) 5 transmission(3) 5
I have to fetch value of Bus based on Transmission and load. for example
To get the value of bus. First, I need to fetch the value of transmission(2) which is 3. Now based on this value, I need to get the value of load which is load(3)=5.Next, Based on this value, I have to get the value of Voltage(5) which is 3.
I tried to get the value of single column based on the their corresponding column value.
total=df[df['load']=='load(1)']['load_Values']
next_total= df[df['transmission']=='transmission['total']']['transmission_Values']
v_total= df[df['Voltage']=='Voltage(5)']['Voltage_Values']
How to get all these values automatically. For example, if i have 1100 values in every column, How I can fetch all the values for 1100 in these columns.
This is how dataset looks like
So to get the Value of VRES_LD which is new column. For that I have to look for the I__ND_LD Column which has value I__ND_LD(1) and corressponding value stored in I__ND_LD_Values which is 10. Once I get the value 10 now based on that I ahve to Look for I__BS_ND column which has I__BS__ND(10) and its value is 5.0 in I__BS_ND_Values. Based on this value, I have to find the value of V_BS(5) which is 0.986009. Now this value should be store in the new column VRES_LD. Please let me know if you get it now.

I generalized your solution so you can work with as many values as you want.
I changed the name "Load_Value" to "load_value_name" to avoid confusion since there is a variable named "load_value" in lowercase.
You can start with as many values as you want; in our example we start with "1":
start_values = [1]
load_value_name = [f"^I__ND_LD({n})" for n in start_values]
#Output: but you'll have more than one if needed
['^I__ND_LD(1)']
Then we fetch all the values:
load_values=df[df['I__ND_LD'].isin(load_names)]['I__ND_LD_Values'].values.astype(np.int)
#output: again, more if needed
array([10])
let's get the bus names:
bus_names = [f"^I__BS_ND({n})" for n in load_values]
bus_values = df[df['I__BS_ND'].isin(bus_names)]['I__BS_ND_Values'].values.astype(np.int)
#output
array([5])
And finally voltage:
voltage_bus_value = [f"^V_BS({n})" for n in bus_values]
voltage_values = df[df['V_BS'].isin(voltage_names)]['V_BS_Values'].values
#output
array([0.98974069])
Notes:
Instead of rounding I downcasted to int; and .isin() method looks for all occurances so you can fetch all of the values.

If I understand correctly, you should be able to create key/value tables and use merge. The step to voltage is a little unclear, but the basic idea below should work, I think:
df = pd.DataFrame({'voltage': {0: 'Voltage(1)', 1: 'Voltage(2)', 2: 'Voltage(5)'},
'bus': {0: 2, 1: 2, 2: 3},
'load': {0: 'load(1)', 1: 'load(2)', 2: 'load(3)'},
'load_values': {0: 3, 1: 4, 2: 5},
'transmission': {0: 'transmission(1)',
1: 'transmission(2)',
2: 'transmission(3)'},
'transmission_values': {0: 2, 1: 3, 2: 5}})
load = df[['load', 'load_values']].copy()
trans = df[['transmission','transmission_values']].copy()
load['load'] = load['load'].str.extract('(\d)').astype(int)
trans['transmission'] = trans['transmission'].str.extract('(\d)').astype(int)
(df[['bus']].merge(trans, how='left', left_on='bus', right_on='transmission')
.merge(load, how='left', left_on='transmission_values', right_on='load'))
resulting in:
bus transmission transmission_values load load_values
0 2 2 3 3.0 5.0
1 2 2 3 3.0 5.0
2 3 3 5 NaN NaN

I think you need to do 3 things.
1.You need to put a number inside a string. You do it like this:
n_cookies = 3
f"I want {n_cookies} cookies"
#Output
I want 3 cookies
2.Let's say the values you need to fetch are:
transmission_values = [2,5,20]
You than need to fetch those load values:
load_values_to_fetch = [f"transmission({n})" for n in transmission_values]
#output
[transmission(2),transmission(5),transmission(20)]
3.Get all the voltage values from the df. Use .isin() method:
voltage_value= df[df['Voltage'].isin(load_values_to_fetch )]['Voltage_Values'].values
I hope I understood the problem correctly. Try and let us know because I can't try the code without data

Related

Combine multiple rows based on Id and other column using pandas python [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Use of Replace() in Python Dataframe for Multiple Columns but same value

Query: I need to replace the 1 old value with the 1 new value for a bunch of columns (not all columns) in a dataframe. The question is about the syntax to be used. Is there a shorter syntax?
Sample Dataframe:
df = pd.DataFrame({'A': [0,1,2,3,4],
'B': [5,6,7,0,9],
'C': [2,0,9,3,0],
'D': [1,3,0,5,2]})
I need all 0 to be replaced with 10 in the above df but only for column A and C (Not for B or D).
Code that I use to do this:
Method 1: Two separate commands.
df['A'].replace({0:10},inplace=True)
df['C'].replace({0:10},inplace=True)
Method 2: One command using dictionary in dictionary
df.replace({'A': {0:10}, 'C': {0:10}},inplace=True)
Method 3: Keeping new value out of dictionary
df.replace({'A':0,'C':0},10,inplace=True)
Expected Outcome:
A B C D
0 10 5 2 1
1 1 6 10 3
2 2 7 9 0
3 3 0 3 5
4 4 9 10 2
I am able to get expected outcome using all three methods. But I have a doubt that can we give a list of columns and enter old and new values for replacement only once?
Something like:
df.replace({['col_ref'...]:{'old':'new'})
#OR
df['col_ref'...].replace()
In my scenario, there are 26 columns out of 52 that need replacing, and the value is to be replaced through a regex command. Now I can store the regex command as a variable and use the method 2 to do this. But this also requires entering the variable name for 26 times. Is there any shorter way where I can enter these 26 columns and the regex replacement {'r':'r2'} only once?
I was looking on how to do this quicker myself this week and found this method and setup to handle instead of a for loop:
col_list = ['A', 'B']
df[col_list] = df[col_list.replace(0,10,inplace=True)
If you are using regex for a string:
col_list = ['A', 'B']
df[col_list] = df[col_list.replace('[\$,]','',regex=True, inplace=True)
I tried this.
for col in [list of columns]:
df.replace({col:{'r':'r2'}},regex=True,inplace=True)
This is the shortest way I could think of to write minimum code characters.
However, if there is a faster way, other answers are welcome.

How to use dataframe column in for loop

I am trying to implement a formula to create a new column in Dataframe using existing column but that column is a summation from 0 to a number present in some other column.
I was trying something like this:
dataset['B']=sum([1/i for i in range(dataset['A'])])
I know something like this would work
dataset['B']=sum([1/i for i in range(10)])
but I want to make this 10 dynamic based on some different column.
I keep on getting this error.
TypeError: 'Series' object cannot be interpreted as an integer
First of all, I should admit that I could not understand you question completely. However, what I understood that you want to iterate over the rows of a DataFrame and make a new column by doing some operation/s on that value.
Is that is so, then I would recommend you following link
Regarding TypeError: 'Series' object cannot be interpreted as an integer:
The init signature range() takes integers as input. i.e [i for i in range(10)] should give you [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. However, if one of the value from your dataset['A'] is float, or not integer , this might result in the error you are having. Moreover, if you notice, the first value is a zero, as a result, 1/i should result in a different error. As a result, you might have to rewrite the code as [1/i for i in range (1 , row_value_of_dataset['A'])]
It will be highly appreciate if you could make an example of what you DataFrame might look like and what is your desired output. Then perhaps it is easier to post a solution.
BTW forgot to post what I understood from your question:
#assume the data:
>>>import pandas as pd
>>>data = pd.DataFrame({'A': (1, 2, 3, 4)})
#the data
>>>data
A
0 1
1 2
2 3
3 4
#doing operation on each of the rows
>>>data['B']=data.apply(lambda row: sum([1/i for i in range(1, row.A)] ), axis=1)
# Column B is the newly added data
>>>data
A B
0 1 0.000000
1 2 1.000000
2 3 1.500000
3 4 1.833333
Perhaps explicitly use cumsum, or even apply?
Anyway trying to move an array/list item directly into a dataframe and seem to view this as a dictionary. Try something like this, I've not tested it,
array_x = [x, 1/x for x in dataset.values.tolist()] # or `dataset.A.tolist()`
df = pd.DataFrame(data=(np.asarray(array_x)))
df.columns = [A, B]
Here the idea is to break the Series apart into a list, and input the list into a dataframe. This can be explicitly done without needing to go Series->list->dataframe and is not very efficient.

Python - How to dynamically exclude a column name from a list of columns of a Panda Dataframe

So far I am able to get the list of all column names present in the dataframe or to get a specific column names based on its datatype, starting letters, etc...
Now my requirement is to get the whole list of column names or a sublist and to exclude one column from it (i.e Target variable / Label Column. This is a part of Machine Learning. So I am using the terms that are used in machine learning)
Please note I am not speaking about the data present in those columns. I am just taking the column names and want to exclude a particular column by its name
Please see below example for better understanding :
# Get all the column names from a Dataframe
df.columns
Index(['transactionID', 'accountID', 'transactionAmountUSD',
'transactionAmount', 'transactionCurrencyCode',
'accountAge', 'validationid', 'LABEL'],
dtype='object')
# Get only the Numeric Variables (Columns with numeric values in it)
df._get_numeric_data().columns
Index(['transactionAmountUSD', 'transactionAmount', 'accountAge', 'LABEL'],
dtype='object')
Now inorder to get remaining column names I am subtracting both the above commands
string_cols = list(set(list(df.columns))-set(df._get_numeric_data().columns))
Ok everything goes well until I hit this.
I have found out that Label column though it has numeric values it should not be present in the list of numeric variables. It should be excluded.
(i.e) I want to exclude a particular column name (not using its index in the list but using its name explicitly)
I tried similar statements like the following ones but in vain. Any inputs on this will be helpful
set(df._get_numeric_data().columns-set(df.LABEL)
set(df._get_numeric_data().columns-set(df.LABEL.column)
set(df._get_numeric_data().columns-set(df['LABEL'])
I am sure I am missing a very basic thing but not able to figure it out.
First of all, you can exclude all numeric columns much more simply with
pd.DataFrame.select_dtypes(exclude=[np.number])
transactionID accountID transactionCurrencyCode validationid
0 a a a a
1 a a a a
2 a a a a
3 a a a a
4 a a a a
Second of all, there are many ways to drop a column. See this post
df._get_numeric_data().drop('LABEL', 1)
transactionAmountUSD transactionAmount accountAge
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
If you really wanted the columns, use pd.Index.difference
df._get_numeric_data().columns.difference(['LABEL'])
Index(['accountAge', 'transactionAmount', 'transactionAmountUSD'], dtype='object')
Setup
df = pd.DataFrame(
[['a', 'a', 1, 1, 'a', 1, 'a', 1]] * 5,
columns=[
'transactionID', 'accountID', 'transactionAmountUSD',
'transactionAmount', 'transactionCurrencyCode',
'accountAge', 'validationid', 'LABEL']
)
Pandas' index supports set operations, so to exclude one column from column index you can just write something like
import pandas as pd
df = pd.DataFrame(columns=list('abcdef'))
print(df.columns.difference({'b'}))
which will return to you
Index(['a', 'c', 'd', 'e', 'f'], dtype='object')
I hope this is what you want :)
Considering LABEL column as your output and the other features as your input, you can try this:
feature_names = [x for x in df._get_numeric_data().columns if x not in ['LABEL']]
input = df[feature_names]
output= df['LABEL']
Hope this helps.

Why did 'reset_index(drop=True)' function unwantedly remove column?

I have a Pandas dataframe named data_match. It contains columns '_worker_id', '_unit_id', and 'caption'. (Please see attached screenshot for some of the rows in this dataframe)
Let's say the index column is not in ascending order (I want the index to be 0, 1, 2, 3, 4...n) and I want it to be in ascending order. So I ran the following function attempting to reset the index column:
data_match=data_match.reset_index(drop=True)
I was able to get the function to return the correct output in my computer using Python 3.6. However, when my coworker ran that function in his computer using Python 3.6, the '_worker_id' column got removed.
Is this due to the (drop=True) clause next to reset_index? But I didn't know why it worked in my computer and not in my coworker's computer. Can anybody advise?
As the saying goes, "What happens in your interpreter stays in your
interpreter". It's impossible to explain the discrepancy without seeing the
full history of commands entered into both Python interactive sessions.
However, it is possible to venture a guess:
df.reset_index(drop=True)
drops the current index of the DataFrame and replaces it with an index of
increasing integers. It never drops columns.
So, in your interactive session, _worker_id was a column. In your co-worker's
interactive session, _worker_id must have been an index level.
The visual difference can be somewhat subtle. For example, below, df has a
_worker_id column while df2 has a _worker_id index level:
In [190]: df = pd.DataFrame({'foo':[1,2,3], '_worker_id':list('ABC')}); df
Out[190]:
_worker_id foo
0 A 1
1 B 2
2 C 3
In [191]: df2 = df.set_index('_worker_id', append=True); df2
Out[191]:
foo
_worker_id
0 A 1
1 B 2
2 C 3
Notice that the name _worker_id appears one line below foo when it is an
index level, and on the same line as foo when it is a column. That is the only
visual clue you get when looking at the str or repr of a DataFrame.
So to repeat: When _worker_index is a column, the column is unaffected by
df.reset_index(drop=True):
In [194]: df.reset_index(drop=True)
Out[194]:
_worker_id foo
0 A 1
1 B 2
2 C 3
But _worker_index is dropped when it is part of the index:
In [195]: df2.reset_index(drop=True)
Out[195]:
foo
0 1
1 2
2 3

Resources