Use Switch/Case Statement to build DF2, by Iterating Over Rows in DF1 - python-3.x

I've loaded data from a tab deliminated file into a DF. The Tab data is a form filled out with a template.
A critical concept is that a variable number of rows makes up one entry in the form. In DF1 below, every time the index is "A", a new record is starting. So the code will need to iterate through the rows to rebuild each record in DF2. Each record will be represented as one row in DF2.
Based on the fact that each "A" row in DF1 starts a new form entry (and corresponding row in DF2), we can see in DF1 below there are just two entries in my example, and will be just two rows in DF2. Also imortant: there are a different number of pieces of data (columns) in each row. Z has 2 (then NAs), A has 3, B has 4.
All of this needs to be mapped to DF2 depending on the index letters Z, A, B (note there are more index letters but this is simplified for this example).
DF 1
- A B C D
Z xyz 5 NA NA
A COA aa bb NA
B RE 01 02 03
B DE 04 05 06
A COB dd ee NA
B RE 01 02 03
B DE 04 05 06
In the past i've done this type of thing in VBA and would have used a CASE statement to transform the data. I've found a good start using dictionaries in this thread:
Replacements for switch statement in Python?
One code example at the above thread suggests using a dictionary type case statement:
return{
'a': 1,
'b': 2,
}[x]
This seems like it would work although i'm not certain how to execute in practice. In addition for each A, B, etc above, I need to output multiple instructions, depending on the index letter. For the most part, the instructions are where to map in DF2. For example, in my:
Index A:
Map column A to DF2.iloc[1]['B']
Map column B to DF2.iloc[1]['C']
Map column C to DF2.iloc[1]['D']
Index B:
Would have four instructions, similar to above.
DF2 would end up looking like so
- A B C D E F G H I J K L
1 xyz COA aa bb RE 01 02 03 DE 04 05 06
2 xyz COB dd ee RE 01 02 03 DE 04 05 06
So for each row in DF1, a different number of instructions is being performed depending on the "index letter." All instructions are telling the code where to put the data in DF2. The mapping instruction for each different index letter will always be the same for the columns, only the row will be changing (some type of counter as you move from one record group to the next in DF2).
How can I handle the different number of instructions for each type of index letter in a switch/case type format?
Thank you

I think you can use:
#filter only 2,3 index rows
df1 = df[df.index.isin([2,3])].copy()
#create new column for same value if 2 in index
df1['new'] = np.where(df1.index == 2, 'Z', df1.A)
#create groups by compare 2
df1['g'] = (df1.index == 2).cumsum()
#convert columns to index and reshape, then change order
df1 = (df1.set_index(['g','new']).unstack()
.swaplevel(0,1, axis=1)
.sort_index(axis=1, ascending=[False, True]))
#default columns names
df1.columns = range(len(df1.columns))
print (df1)
0 1 2 3 4 5 6 7 8 9 10 11
g
1 ABC aa bb cc R 01 02 NaN D NaN 03 04
2 DEF dd ee ff R 01 02 NaN D NaN 03 04

Related

Create Multiple Dataframes using Loop & function

I have a df over 1M rows similar to this
ID Date Amount
x May 1 10
y May 2 20
z May 4 30
x May 1 40
y May 1 50
z May 2 60
x May 1 70
y May 5 80
a May 6 90
b May 8 100
x May 10 110
I have to sort the data based on the date and then create new dataframes depending on the times the value is present in Amount column. So if x has made purchase 3 time then I need it in 3 different dataframes. first_purchase dataframe would have every ID that has purchased even once irrespective of date or amount.
If an ID purchases 3 times, I need that ID to be in first purchase then second and then 3rd with Date and Amount.
Doing it manually is easy with:-
df = df.sort_values('Date')
first_purchase = df.drop_duplicates('ID')
after_1stpurchase = df[~df.index.isin(first_purchase.index)]
second data frame would be created with:-
after_1stpurchase = after_1stpurchase.sort_values('Date')
second_purchase = after_1stpurchase.drop_duplicates('ID')
after_2ndpurchase = after_1stpurchase[~after_1stpurchase.index.isin(second_purchase.index)]
How do I create the loop to provide me with each dataframes?
IIUC, I was able to achieve what you wanted.
import pandas as pd
import numpy as np
# source data for the dataframe
data = {
"ID":["x","y","z","x","y","z","x","y","a","b","x"],
"Date":["May 01","May 02","May 04","May 01","May 01","May 02","May 01","May 05","May 06","May 08","May 10"],
"Amount":[10,20,30,40,50,60,70,80,90,100,110]
}
df = pd.DataFrame(data)
# convert the Date column to datetime and still maintain the format like "May 01"
df['Date'] = pd.to_datetime(df['Date'], format='%b %d').dt.strftime('%b %d')
# sort the values on ID and Date
df.sort_values(by=['ID', 'Date'], inplace=True)
df.reset_index(inplace=True, drop=True)
print(df)
Original Dataframe:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 40 May 01 x
4 70 May 01 x
5 110 May 10 x
6 50 May 01 y
7 20 May 02 y
8 80 May 05 y
9 60 May 02 z
10 30 May 04 z
.
# create a list of unique ids
list_id = sorted(list(set(df['ID'])))
# create an empty list that would contain dataframes
df_list = []
# count of iterations that must be seperated out
# for example if we want to record 3 entries for
# each id, the iter would be 3. This will create
# three new dataframes that will hold transactions
# respectively.
iter = 3
for i in range(iter):
df_list.append(pd.DataFrame())
for val in list_id:
tmp_df = df.loc[df['ID'] == val].reset_index(drop=True)
# consider only the top iter(=3) values to be distributed
counter = np.minimum(tmp_df.shape[0], iter)
for idx in range(counter):
df_list[idx] = df_list[idx].append(tmp_df.loc[tmp_df.index == idx])
for df in df_list:
df.reset_index(drop=True, inplace=True)
print(df)
Transaction #1:
Amount Date ID
0 90 May 06 a
1 100 May 08 b
2 10 May 01 x
3 50 May 01 y
4 60 May 02 z
Transaction #2:
Amount Date ID
0 40 May 01 x
1 20 May 02 y
2 30 May 04 z
Transaction #3:
Amount Date ID
0 70 May 01 x
1 80 May 05 y
Note that in your data, there are four transactions for 'x'. If lets say you wanted to track the 4th iterative transaction as well. All you need to do is change the value if 'iter' to 4 and you will get the fourth dataframe as well with the following value:
Amount Date ID
0 110 May 10 x

Sample dataframe with number of records sampled per hour predefined

I have to sample a dataframe (df1) and I have another dataframe (df2) that tells me how many records I should retrieve from each hour of the day.
For example,
df1:
Hour number
0. 00 A
1. 00 B
2. 00 C
3. 01 D
4. 01 A
5. 01 B
6. 01 D
df2:
Hour number
0. 00 1
1. 01 2
So that in the end, I would get for example, record number 1 for midnight and records 3 and 5 for 1 am (or any other combination so long as it respects the number in df2)
The thing is that I need to write this in a function in order for me to call this inside another function.
So far I have
def sampling(frame):
return np.random.choice(frame.index)
but I am failing to add the constraints of the df2.
Could anybody help?
First we add the number of samples required as a new column using merge and the apply sample to each group of Hour values. Finally we remove the added column by returning all but the last column:
def sampling(df1, df2):
return df1.merge(df2, on='Hour').groupby('Hour').apply(lambda x: x.sample(x.Number[0])).reset_index(0,True).iloc[:,:-1]
df1 = pd.DataFrame({'Hour': [0,0,0,1,1,1,1], 'Value': list('ABCDABD')})
df2 = pd.DataFrame({'Hour': [0,1], 'Number': [1,2]})
sampling(df1, df2)
Result:
Hour Value
2 0 C
4 1 A
5 1 B

I am having a column like this 9(05),X(05),X(15). I want to separate this 9,X,X into one column and data in () into another column. How can I do that?

I am having a column like this 9(05),X(05),X(15). I want to separate this 9,X,X into one column and data in () into another column. How can I do that?
input column is
9(05)
x(05)
x(15)
x(15)
s9(07)
Use extract:
pat = r'(.*?)\((.*?)\)'
df[['a','b']] = df['col'].str.extract(pat, expand=True)
print (df)
col a b
0 9(05) 9 05
1 x(05) x 05
2 x(15) x 15
3 x(15) x 15
4 s9(07) s9 07

Excel: Index match if day of date matches?

I have created a time line in excel like this:
Sheet 2
A B C D E
____________________________
01 02 03 04 05
I have some data in sheet 1
Column A Column E
01/01/2017 Supplier X
05/01/2017 Supplier B
I am wanting to return the name of a supplier using index match where the day (listed on my timeline) matches the date in column A on sheet 1.
Here's what i'm trying to use but it produces #N/A and # Value errors
=INDEX(Sheet1!$E:$E,MATCH(F$22,DAY(Sheet1!$A:$A),0))
Desired result:
A B C D E
Supplier X_____________________ Supplier B
01 02 03 04 05
Please can someone show me where i am going wrong?

Generating AA, AB, ..., ZZ in excel using numbers 0-25

I have created column as mentioned below:
0 A 00
1 B 01
2 C 02
3 D 03
4 E 04
5 F 05
6 G 06
7 H 07
8 I 08
9 J 09
10 K 010
I have two columns , first column has 0-10 and second column has A-K and I have created a new column with combinations of AA , AB till ZZ in numbers as in 00 , 01 till 010. I need to substitute the number with the alphabets using vlookup and if condition . Kindly help me with it
Can I do it with help of =Vlookup() and/or =If() condition alone?
Sorry, not really sure what you're after, I can't read your initial starting data very clearly ?
To replace a value, you can:
=SUBSTITUTE(A1,"0","A")
not sure if that does what you want, though?
I'll start a new answer, it's different enough from the first post.
So, taking your initial data:
I add 2 more columns: It's simply the mapping of Column C, split out into column D and E. At time of writing this, I have no idea how you translate that programatically, so until we have the rules, I'll just do this . and we can plug that other part in later.
0 A 00 0 0
1 B 01 0 1
2 C 02 0 2
3 D 03 0 3
4 E 04 0 4
5 F 05 0 5
6 G 06 0 6
7 H 07 0 7
8 I 08 0 8
9 J 09 0 9
10 K 010 0 10
And then use this formula:
=CONCATENATE(VLOOKUP(D1,$A$1:$B$11,2,FALSE),VLOOKUP(E1,$A$1:$B$11,2,FALSE))
and you'll end up with this:
AA
AB
AC
AD
AE
AF
AG
AH
AI
AJ
AK
If your ultimate goal is to generate AA thru ZZ, first enter the following UDF in a standard module:
Public Function BumpString(s As String) As String
Dim aryIN(1 To 1000) As Integer
Dim L As Long, i As Long, carry As Boolean
L = Len(s)
j = 1
For i = L To 1 Step -1
aryIN(j) = Asc(Mid(s, i, 1))
j = j + 1
Next i
For i = 1 To L
carry = False
aryIN(i) = aryIN(i) + 1
If aryIN(i) = 91 Then
aryIN(i) = 65
carry = True
Else
Exit For
End If
Next
If carry Then
L = L + 1
aryIN(L) = 65
End If
For i = 1 To L
BumpString = Chr(aryIN(i)) & BumpString
Next i
End Function
Then pick a cell, say D1 and enter AABelow it enter:
=bumpstring(D1)
Then copy D2 down thru D676
You don't need VLOOKUP for this.
You can use the CODE function with the first and last character:
=(CODE(LEFT(A1))-65) & (CODE(RIGHT(A1))-65)
Note that many combinations will map to the same number, such as "BU" and "MA":

Resources