assign value based on timestamps - python-3.x

I got two dataframes, the first one is like
data = [
[11,'a',1],
[16,'b',2],
[15,'a',3],
[19,'b',4]
]
data=pd.DataFrame(data)
and the second one is like
find=[
[4,'a'],
[11,'b'],
[11,'a'],
[16,'b'],
[17,'a'],
]
find=pd.DataFrame(find)
I'd like to assign values to the second dataframe based on the first dataframe. There are few conditions need to be checked, for example:
1. if the 1st row is 4 and a, then return 1
2. if the 2nd row is 11 and b, then return 2
3. if the 3rd row is 11 and a, then return 1
2. if the 4th row is 16 and b, then return 4
I tried to write for loop to do this, but the dataset is pretty big so it takes too much time for running and in the end, it fails.
Is there any good solution for this question? Appreciate!

Related

How to look into previous three row values to Current Row in Python after applying Group by

How I can get the following expected output in python
Sample Input with Expected Output
ACTUAL_EXPECTED_OUTPUT is the expected output column Column.
The scenario is for each account we need to look into IS_DEFAULT COlumn prior three observations and if 1 is there in any of the previous three observation we need to get result as 1 else 0.
Group by the account id and if needed we can use order by MONTH_SINCE_DISB and then for each account id we need to look into prior three observations if 1 is there in any of the three observations for that account id then the new column label should be marked as 1 else 0. Iteratively the same logic should be applied for all accounts_id
Something like this should work
#Create temp column where when first 1 found, ffill the rest to 1 for that ACCT_ID
df['ISDEFAULT_TEMP']=df.groupby('ACCT_ID')['IS_DEFAULT'].apply(lambda x: x.replace(to_replace=0,method='ffill'))
import numpy as np
#Create condition using that new column and if the cumsum >2 for an AcctID , then true
# (.i.e. a IS_DEFAULT=1 has been seen 2 rows ago)
cond=df.groupby('ACCT_ID')['ISDEFAULT_TEMP'].transform('cumsum')>2
#Define that new column given the condition
df['ACTUAL_EXPECTED_OUTPUT']=np.where(cond,1,0)
df.drop('ISDEFAULT_TEMP',axis=1,inplace=True)
df

How to write function to extract n+1 column in pandas

I have a excel file with 200 columns. The first column is no. of visits, and other columns are the data with number of people for that number of visits
Visits A B C D
2 10 0 30 40
3 5 6 0 1
4 2 3 1 0
I want to write a function so that I have multiple dataframes with Visit column and A; visit column and B, and so on (I want to write a function, as the number of columns will increase in the future and I want to automatize the process). Also, I want to remove the data with 0.
Desired output:
dataframe 1:
visits A
dataframe 2:
Visits B
3 6
4 3
This is my first question. So sorry, if it is not properly framed. Thank you for your help.
Use DataFrame.items:
for i,col in df.set_index('visits').items():
print(col[col.ne(0)].to_frame(i).reset_index())
You can create a dict to save by the name of columns
dfs={i:col[col.ne(0)].to_frame(i).reset_index() for i,col in df.set_index('visits').items()}

How to get a column name based on function

Suppose I have a dataset:
A B C Final
1 2 3 C
4 5 6 C
I want final to return the column name of the max value. In the example above, 3 and 6 are the highest values, so columns C will be returned in the 'Final' column.Is there an efficient way/formula you can use to get the column names without using VBA? the real dataset containts 60 columns.
I would try something like this to get the column letter.
=SUBSTITUTE(ADDRESS(1,MATCH(MAX(A3:C3),A3:C3,0),4),1,"")
If you want the header, I think there's an easier way, but the extension from this would be
=INDIRECT(SUBSTITUTE(ADDRESS(1,MATCH(MAX(A3:C3),A3:C3,0),4),1,"")&1)
For the header, the better way
=INDEX(A1:C1,1,MATCH(MAX(A3:C3),A3:C3,0))
HLOOKUP is a way without VBA, but sadly HLOOKUP does only search in the first row of the array, so you should add the "title column" at the bottom
A B C Final
1 2 3 <formula1>
4 5 6 <formula2>
A B C
If the dataset starts at first corner, Formula1 would be :
=HLOOKUP(MAX(A2:C2);A2:C4;2;FALSE)
Formula2 will be
=HLOOKUP(MAX(A3:C3);A3:C4;1;FALSE)
etc...
=HLOOKUP(MAX($A3:$C3);$A3:$C$4;<manual change here backward>;FALSE)
You can use the vector form of the LOOKUP function.
If your first row, with A B C, are the labels, then:
=LOOKUP(2, 1/(MAX(A2:C2)=A2:C2),$A$1:$C$1)

how to count number between certain range of rows?

i would like to count number for every 7 rows, data are in one column. i use this formula, but it is not working.
from B8 to B14329, for every 7 rows, count number if it is equal to 3. so i know how many 3 in every 7 rows.
=COUNTIFS(B8:B14329, OFFSET($B$7,(ROW()-12)*7,0,7,1),B8:B14329,=3)
Thanks a lot!
i want something like this:
data count
3
2
3
1
3
3
1 4
1
2
2
3
3
1
1 2
.....
....
...
Simple and easy:
=SUMPRODUCT((B8:B14329=3)*(MOD(ROW(B8:B14329),7)=1))
Just change the =1 to your needs. To start with row 1 =1, 2 =2 ... 6 =6, 7 =0. This way, to start count at row 8 it is =1
EDIT: having your exaple now, you want something completely different... lol.
=IF(MOD(ROW(),7)=0,COUNTIF(A8:A14,3),"")
Put this in row 14 and then drag down... change the =0 as you need it.
Here's what I would do
Add a new column with the row index (8 to 14239) in your case
Add Yet another column, with a formula to tell whether the column you just added is a multiple of 7. Put it's value like "TRUE" or "FALSE"
You can use the MOD function to check the remainder of the division.
= MOD ( Number , Divisor )
By now, you should have, aside from the columns you already have, something like:
8-----FALSE
9-----FALSE
10-----FALSE
11-----FALSE
12-----FALSE
13-----FALSE
14-----TRUE
15-----FALSE
Once you have that, just apply a filter on the "TRUE/FALSE" column, select the "TRUE" values and you will be able to count the number of "3"s on the actual value column, by also using a filter on it.
I hope it helps, and it's easier than a really messy formula.

Controlling the data partition in Apache Spark

Data Looks Like:
col 1 col 2 col 3 col 4
row 1 row 1 row 1 row 1
row 2 row 2 row 2 row 2
row 3 row 3 row 3 row 3
row 4 row 4 row 4 row 4
row 5 row 5 row 5 row 5
row 6 row 6 row 6 row 6
Problem: I want to partition this data, lets say row 1 and row 2 will be processed as one partition, row 3 and row 4 as another, row 5 and row 6 as another and create a JSON data merging them together with the column (column headers with data values in rows).
Output should be like:
[
{col1:row1,col2:row1:col3:row1:col4:row1},
{col1:row2,col2:row2:col3:row2:col4:row2},
{col1:row3,col2:row3:col3:row3:col4:row3},
{col1:row4,col2:row4:col3:row4:col4:row4},...
]
I tried using repartion(num) available in spark but it is not exactly partitioning as i want. therefore the JSON data generated is not valid. i had issue with why my program was taking same time for processing the data even though i was using different number of cores which can be found here and the repartition suggestion was suggested by #Patrick McGloin . The code mentioned in that problem is something i am trying to do.
Guess what you need is partitionBy. In Scala you can provide to it a custom build HashParitioner, while in Python you pass partitionFunc. There is a number of examples out there in Scala, so let me briefly explain the Python flavour.
partitionFunc expects a tuple, with first element being the key. Lets assume you organise your data in the following fashion:
(ROW_ID, (A,B,C,..)) where ROW_ID = [1,2,3,...,k]. You can always add ROW_ID and remove it afterwards.
To get a new partition every two rows:
rdd.partitionBy(numPartitions = int(rdd.count() / 2),
partitionFunc = lambda key: int(key / 2)
partitionFunc will produce a sequence 0,0,1,1,2,2,... This number will be a number of partition to which given row will belong.

Resources