I am having a column like this 9(05),X(05),X(15). I want to separate this 9,X,X into one column and data in () into another column. How can I do that? - python-3.x

I am having a column like this 9(05),X(05),X(15). I want to separate this 9,X,X into one column and data in () into another column. How can I do that?
input column is
9(05)
x(05)
x(15)
x(15)
s9(07)

Use extract:
pat = r'(.*?)\((.*?)\)'
df[['a','b']] = df['col'].str.extract(pat, expand=True)
print (df)
col a b
0 9(05) 9 05
1 x(05) x 05
2 x(15) x 15
3 x(15) x 15
4 s9(07) s9 07

Related

Create Multiple Dataframes using Loop & function

I have a df over 1M rows similar to this
ID Date Amount
x May 1 10
y May 2 20
z May 4 30
x May 1 40
y May 1 50
z May 2 60
x May 1 70
y May 5 80
a May 6 90
b May 8 100
x May 10 110
I have to sort the data based on the date and then create new dataframes depending on the times the value is present in Amount column. So if x has made purchase 3 time then I need it in 3 different dataframes. first_purchase dataframe would have every ID that has purchased even once irrespective of date or amount.
If an ID purchases 3 times, I need that ID to be in first purchase then second and then 3rd with Date and Amount.
Doing it manually is easy with:-
df = df.sort_values('Date')
first_purchase = df.drop_duplicates('ID')
after_1stpurchase = df[~df.index.isin(first_purchase.index)]
second data frame would be created with:-
after_1stpurchase = after_1stpurchase.sort_values('Date')
second_purchase = after_1stpurchase.drop_duplicates('ID')
after_2ndpurchase = after_1stpurchase[~after_1stpurchase.index.isin(second_purchase.index)]
How do I create the loop to provide me with each dataframes?
IIUC, I was able to achieve what you wanted.
import pandas as pd
import numpy as np
# source data for the dataframe
data = {
"ID":["x","y","z","x","y","z","x","y","a","b","x"],
"Date":["May 01","May 02","May 04","May 01","May 01","May 02","May 01","May 05","May 06","May 08","May 10"],
"Amount":[10,20,30,40,50,60,70,80,90,100,110]
}
df = pd.DataFrame(data)
# convert the Date column to datetime and still maintain the format like "May 01"
df['Date'] = pd.to_datetime(df['Date'], format='%b %d').dt.strftime('%b %d')
# sort the values on ID and Date
df.sort_values(by=['ID', 'Date'], inplace=True)
df.reset_index(inplace=True, drop=True)
print(df)
Original Dataframe:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 40 May 01 x
4 70 May 01 x
5 110 May 10 x
6 50 May 01 y
7 20 May 02 y
8 80 May 05 y
9 60 May 02 z
10 30 May 04 z
.
# create a list of unique ids
list_id = sorted(list(set(df['ID'])))
# create an empty list that would contain dataframes
df_list = []
# count of iterations that must be seperated out
# for example if we want to record 3 entries for
# each id, the iter would be 3. This will create
# three new dataframes that will hold transactions
# respectively.
iter = 3
for i in range(iter):
df_list.append(pd.DataFrame())
for val in list_id:
tmp_df = df.loc[df['ID'] == val].reset_index(drop=True)
# consider only the top iter(=3) values to be distributed
counter = np.minimum(tmp_df.shape[0], iter)
for idx in range(counter):
df_list[idx] = df_list[idx].append(tmp_df.loc[tmp_df.index == idx])
for df in df_list:
df.reset_index(drop=True, inplace=True)
print(df)
Transaction #1:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 50 May 01 y
4 60 May 02 z
Transaction #2:
Amount Date ID
0 40 May 01 x
1 20 May 02 y
2 30 May 04 z
Transaction #3:
Amount Date ID
0 70 May 01 x
1 80 May 05 y
Note that in your data, there are four transactions for 'x'. If lets say you wanted to track the 4th iterative transaction as well. All you need to do is change the value if 'iter' to 4 and you will get the fourth dataframe as well with the following value:
Amount Date ID
0 110 May 10 x

Use Switch/Case Statement to build DF2, by Iterating Over Rows in DF1

I've loaded data from a tab deliminated file into a DF. The Tab data is a form filled out with a template.
A critical concept is that a variable number of rows makes up one entry in the form. In DF1 below, every time the index is "A", a new record is starting. So the code will need to iterate through the rows to rebuild each record in DF2. Each record will be represented as one row in DF2.
Based on the fact that each "A" row in DF1 starts a new form entry (and corresponding row in DF2), we can see in DF1 below there are just two entries in my example, and will be just two rows in DF2. Also imortant: there are a different number of pieces of data (columns) in each row. Z has 2 (then NAs), A has 3, B has 4.
All of this needs to be mapped to DF2 depending on the index letters Z, A, B (note there are more index letters but this is simplified for this example).
DF 1
- A B C D
Z xyz 5 NA NA
A COA aa bb NA
B RE 01 02 03
B DE 04 05 06
A COB dd ee NA
B RE 01 02 03
B DE 04 05 06
In the past i've done this type of thing in VBA and would have used a CASE statement to transform the data. I've found a good start using dictionaries in this thread:
Replacements for switch statement in Python?
One code example at the above thread suggests using a dictionary type case statement:
return{
'a': 1,
'b': 2,
}[x]
This seems like it would work although i'm not certain how to execute in practice. In addition for each A, B, etc above, I need to output multiple instructions, depending on the index letter. For the most part, the instructions are where to map in DF2. For example, in my:
Index A:
Map column A to DF2.iloc[1]['B']
Map column B to DF2.iloc[1]['C']
Map column C to DF2.iloc[1]['D']
Index B:
Would have four instructions, similar to above.
DF2 would end up looking like so
- A B C D E F G H I J K L
1 xyz COA aa bb RE 01 02 03 DE 04 05 06
2 xyz COB dd ee RE 01 02 03 DE 04 05 06
So for each row in DF1, a different number of instructions is being performed depending on the "index letter." All instructions are telling the code where to put the data in DF2. The mapping instruction for each different index letter will always be the same for the columns, only the row will be changing (some type of counter as you move from one record group to the next in DF2).
How can I handle the different number of instructions for each type of index letter in a switch/case type format?
Thank you
I think you can use:
#filter only 2,3 index rows
df1 = df[df.index.isin([2,3])].copy()
#create new column for same value if 2 in index
df1['new'] = np.where(df1.index == 2, 'Z', df1.A)
#create groups by compare 2
df1['g'] = (df1.index == 2).cumsum()
#convert columns to index and reshape, then change order
df1 = (df1.set_index(['g','new']).unstack()
.swaplevel(0,1, axis=1)
.sort_index(axis=1, ascending=[False, True]))
#default columns names
df1.columns = range(len(df1.columns))
print (df1)
0 1 2 3 4 5 6 7 8 9 10 11
g
1 ABC aa bb cc R 01 02 NaN D NaN 03 04
2 DEF dd ee ff R 01 02 NaN D NaN 03 04

Dataframe: Computed row based on cell above and cell on the left

I have a dataframe with a bunch of integer values. I then compute the column totals and append it as a new row to the dataframe. So far so good.
Now I want to append another computed row where the value of each cell is the sum of cell above and the cell on the left. You can see what I mean below:
----------------------------------------------------------------
|250000 |0 |145000 |145000 |220000 |165000 |145000 |145000 |
----------------------------------------------------------------
|250000 |250000 |395000 |540000 |760000 |925000 |1070000|1215000 |
----------------------------------------------------------------
How can this be done?
I think you need Series.cumsum with select last row (total row) by DataFrame.iloc:
df = pd.DataFrame({
'B':[4,5,4],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
})
df.loc['sum'] = df.sum()
df.loc['cumsum'] = df.iloc[-1].cumsum()
#if need only cumsum row
#df.loc['cumsum'] = df.sum().cumsum()
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
sum 13 24 9 14
cumsum 13 37 46 60

How do I copy to a range, rather than a list, of columns?

I am looking to append several columns to a dataframe.
Let's say I start with this:
import pandas as pd
dfX = pd.DataFrame({'A': [1,2,3,4],'B': [5,6,7,8],'C': [9,10,11,12]})
dfY = pd.DataFrame({'D': [13,14,15,16],'E': [17,18,19,20],'F': [21,22,23,24]})
I am able to append the dfY columns to dfX by defining the new columns in list form:
dfX[[3,4]] = dfY.iloc[:,1:3].copy()
...but I would rather do so this way:
dfX.iloc[:,3:4] = dfY.iloc[:,1:3].copy()
The former works! The latter executes, returns no errors, but does not alter dfX.
Are you looking for
dfX = pd.concat([dfX, dfY], axis = 1)
It returns
A B C D E F
0 1 5 9 13 17 21
1 2 6 10 14 18 22
2 3 7 11 15 19 23
3 4 8 12 16 20 24
And you can append several dataframes in this like pd.concat([dfX, dfY, dfZ], axis = 1)
If you need to append say only column D and E from dfY to dfX, go for
pd.concat([dfX, dfY[['D', 'E']]], axis = 1)

Index and Match Multiple Criteria in Excel

I have two tables of data with the same columns. The first is populated with experimental data:
`Treatment Species TimeofDay Temp Light X
M B AM 25 25 2
M B PM 26 50 3
M B PM 27 150 4
M C AM 25 25 5
M C AM 26 150 6
M C PM 27 500 7
M C PM 28 800 9
A B AM 25 25 2
A B PM 26 50 3
A B PM 27 150 4
A C AM 25 25 5
A C AM 26 150 6
A C PM 27 500 7
A C PM 28 800 9
In the second table I want to fill in the "X" column with values from the first table so that in the given row A) all of the categorical values match exactly those from table one AND B) Temp and Light values are closest (but won't be exact) to matching values in table 1.
For example, if in Table 2 I have:
Treatment Species TimeofDay Temp Light X
M B PM 30 200
I would want the function to give me 4 in the X column.
I am familiar with indexing a value in one column given a match in another and with finding a closest matching value in a column, but I am having trouble putting all these criteria together:
=INDEX(Table1!F$2:F$20,
MATCH(Table2!A$2,Table1!A$2:A$20,0))
MATCH(Table2!B$2,Table1!B$2:B$20,0))
MATCH(Table2!C$2,Table1!C$2:C$20,0))
MATCH(MIN(ABS(Table1!D$2:D$20-Table2!D$2)),ABS(Table1!D$2:D$20-Table2!D$2), 0))
MATCH(MIN(ABS(Table1!E$2:E$20-Table2!E$2)),ABS(Table1!E$2:E$20-Table2!E$2),0))
Thank you in advance for your help.
This formula is working, put in F2:
=INDEX(Table1!F:F,MATCH(MIN(IF(A2=Table1!A:A,IF(B2 = Table1!B:B,IF(C2 = Table1!C:C,ABS((Table1!D:D+Table1!E:E)-(D2+E2)))))),IF(A2=Table1!A:A,IF(B2 = Table1!B:B,IF(C2 = Table1!C:C,ABS((Table1!D:D+Table1!E:E)-(D2+E2))))),0))
This is an array formula and must be confirmed with Ctrl-Shift-Enter when exiting edit mode. Then copy down.
As to your other comment the priority if the difference is the same will be the first in order.

Resources