I have a binary variable (Var C) that identifies when another variable (Var B) is above or below a different variable (Var A). There is typically a series of the same value for Var C. I'd like to make a new variable (Var D) that represents the unique group of data for the time in between switches.
Hope this helps
VarA VarB VarC VarD
30 28 1 1
32 28 1 1
33 30 1 1
32 32 1 1
34 33 1 1
35 36 0 2
37 38 0 2
38 39 0 2
39 39 0 2
40 39 1 3
38 37 1 3
37 36 1 3
35 33 1 3
Thanks in advance for any help
If your data is in columns A through C you can assign value 1 to cell D2 and use the following formula for the rest of the rows:
=IF(C3<>C2,D2+1,D2)
In D2 enter 1
In D3 enter:
=IF(C3=C2,D2,D2+1)
and copy downward.
Related
Click here for the imageI m trying to create a list from 3 different series which will be of the shape "({A} {B} {C})" where A denotes the 1st element from series 1, B is for 1st element from series 2, C is for 1st element from series 3 and this way it should create a list containing 600 element.
List 1 List 2 List 3
u_p0 1 v_p0 2 w_p0 7
u_p1 21 v_p1 11 w_p1 45
u_p2 32 v_p2 25 w_p2 32
u_p3 45 v_p3 76 w_p3 49
... .... ....
u_p599 56 v_p599 78 w_599 98
Now I want the output list as follows
(1 2 7)
(21 11 45)
(32 25 32)
(45 76 49)
.....
These are the 3 series I created from a dataframe
r1=turb_1.iloc[qw1] #List1
r2=turb_1.iloc[qw2] #List2
r3=turb_1.iloc[qw3] #List3
Pic of the seriesFor the output I think formatted string python method will be useful but I m quite not sure how to proceed.
turb_3= ["({A} {B} {C})".format(A=i,B=j,C=k) for i in r1 for j in r2 for k in r3]
Any kind of help will be useful.
Use pandas.DataFrame.itertuples with str.format:
# Sample data
print(df)
col1 col2 col3
0 1 2 7
1 21 11 45
2 32 25 32
3 45 76 49
fmt = "({} {} {})"
[fmt.format(*tup) for tup in df[["col1", "col2", "col3"]].itertuples(False, None)]
Output:
['(1 2 7)', '(21 11 45)', '(32 25 32)', '(45 76 49)']
I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)
My current dataframe data is as follows:
df=pd.DataFrame([[1.4,3.5,4.6],[2.8,5.4,6.4],[7.8,6.5,5.8]],columns=['t','i','m'])
t i m
0 14 35 46
1 28 54 64
2 28 34 64
3 78 65 58
My goal is to apply a vectorized operations on a df with a conditions as follows (pseudo code):
New column of answer starts with value of 1.
For row in df.itertuples():
if (m > i) & (answer in row-1 is an odd number):
answer in row = answer in row-1 + m
elif (m > i):
answer in row = answer in row-1 - m
else:
answer in row = answer in row-1
The desired output is as follows:
t i m answer
0 14 35 46 1
1 28 54 59 60
2 78 12 58 2
3 78 91 48 2
Any elegant solution would be appreciated.
I have ProductID:
ProductID
12
85
14
14
14
17
65
65
I need to create a new column where i could count how many times i have the same ProductID. I mean:
ProductID Count
12 1
85 1
14 1
14 2
14 3
17 1
65 1
65 2
The formula I currently have is = IF((A3=A2),F2+1,1), but it doesn't work.
Use this formula if you want the count of each individual entry to be same. (i.e count for 14 will be 3 as it exists three times in the data set)
=COUNTIF($A$2:A9,A2)
Output is attached for your reference.
In B2 enter:
=COUNTIF($A$1:A2,A2)
and copy down
Is there a way to substitute the cell address containing a text string as the array criteria in the following formula?
=SUM(SUMIF(A5:A10,{1,22,3},E5:E10))
So instead of {1,22,3}, "1, 22, 3" is entered in cell A2 the formula becomes
=SUM(SUMIF(A5:A10,A2,E5:E10))
I have tried but get 0 as a result (refer C16)
A B C D E F G H
1 Tree
2 {1,22,3} 1
3 22
4 Tree Profit 3
5 1 105
6 2 96
7 1 105
8 1 75
9 2 76.8
10 1 45
11
12 330 =SUM(SUMIF(A5:A10,{1,22,3},B5:B10))
13
14 330 =SUMPRODUCT(SUMIF(A5:A10,E2:E3,B5:B10))
15
16 0 =SUM(SUMIF(A5:A10,A2,B5:B10))
17 NB: Custom Format "{"#"}" on Cell A2 I enter 1,22,3 so it displays {1,22,3}
Ok so after some further searching (see Excel string to criteria) and trial and error I have come up with the following solution.
Using Name Manager I created UDF called GetList which Refers to:
=EVALUATE(Sheet1!$A$3) NB: Cell A3 has this formula in it =TEXT(A2,"{#}")
I then used the following formula:
=SUMPRODUCT(SUMIF($A$5:$A$12,GetList,$B$5:$B$12))
which gives the desired result of 321 as per the other two formulas (see D12 below).
If anyone can suggest a better solution then feel free to do so.
Thanks to Dennis to my original post regarding table
A B C D E
1 Tree
2 1,22,3 1
3 {1,22,3} =TEXT(A2,"{#}") 22
4 Tree Profit 3
5 11 105
6 22 96
7 1 105
8 3 75
9 2 76.8
10 1 45
11
12 321 =SUMPRODUCT(SUMIF($A$5:$A$12,GetList,$B$5:$B$12))
13
14 321 =SUM(SUMIF(A5:A10,{1,22,3},B5:B10))
15
16 321 =SUMPRODUCT(SUMIF(A5:A10,E2:E3,B5:B10))
17
18 0 =SUM(SUMIF(A5:A10,A2,B5:B10))
19 NB: Custom Format "{"#"}" on Cell A2 I enter 1,22,3 so it displays {1,22,3}