Unique function in python - python-3.x

for i in df["col"].unique():
...
Here unique function is called after each iteration of loop or is it just called once and stores the result in memory??
Asking this just to check if unique function is executed after every iteration then there is chance that even in next iteration i might be same as in previous iteration.
Ex first time unique function is called then df["col"].unique() gives [1,2] so i would be 1 for first iteration and in second iteration unique function is again called and i may again get 1 as value.

The construction you are using first calculates the .unique() function and uses the result of that function, if iterable, to loop over.
If you'd want the loop to evaluate a function every iteration, you could use structures like:
list = [x.function() for x in items]
check this link for more information:
https://opensource.com/article/18/3/loop-better-deeper-look-iteration-python

Here is a quick code example to see how it works :
df = pd.DataFrame(0, index=np.arange(2), columns=['1','2'])
df.iloc[0][0]=1
for i in df['1'].unique():
print(f'unique values : {i}')
print(df)
df.iloc[1][0]=2
First, we create a 2x2 Dataframe of zeroeswith a 1 at [1][1] position :
1 2
0 1 0
1 0 0
Then we call unique to get all the uniques values of column 1 (i.e 0 and 1).
During the loop, we change the value one of the cell on column 1 (the one we iterate over). But as you can see in the output, it does not add any iteration in the loop.
This means that df.unique() store the result before iterating over it just as MdBrainz said and that modifying the value during the loop ain't going to change how many time the loop iterate.
Output :
unique values : 1
1 2
0 1 0
1 0 0
unique values : 0
1 2
0 1 0
1 2 0

Related

Excel - mark duplicate value and if required or not

I have a data like this:
ID ID-Name CountUnique Required Available **Results**
1-Line 1 Line 1 1 1 Y Y
2-Line 1 Line 1 0 0 N Y-Duplicate
3-Line 1 Line 1 0 1 N Y-Duplicate
1-Line 2 Line 2 1 0 N N-Duplicate
2-Line 2 Line 2 0 1 N N
3-Line 2 Line 2 0 1 N N-Duplicate
I am using excel and I want to use an if condition to determine when a ID have the same values and if it is available or not. If it has duplicates and is available, I want to have Y (for the one that is available) and Y-Duplicate in Results column for all the same ID-Names (regardless if other ID-Names are available or not) and if not available similar logic.
How can I do this for the whole sheet?
My attempts were based on the following logic. If I am able to do it for individual steps then I can combine. The issue I noticed is that I need to take into account the ID-Name and have it used.
Current formulas:
=IF(AND([#Available] = "Y", [#CountUnique] =1),"Y", "Y-Duplicate")
=IF(AND([#Available] = "Y", [#CountUnique] =0),"Y-Duplicate","")
=IF(AND([#Available] = "N", [#CountUnique] =1),"N", "N-Duplicate")
=IF(AND([#Available] = "N", [#CountUnique] =0),"N-Duplicate","")
Thanks.
Not sure if I understood your logic properly, but just in case, note that you may benefit from you field CountUnique
My formula in Results is:
=IF(C2=1;E2;E2&"-Duplicate")

Excel - assign values based on the first unique item

I have got an excel question that I can not answer. Here is my table:
ID Key Count Unique Available Text Results
1 0 Text-1 Dupe-Y
2 1 Y Text-1 Y
3 0 Text-1 Dupe-Y
4 0 Text-1 Dupe-Y
5 1 N Text-2 N
6 1 Y Text-3 Y
7 0 Text-2 Dupe-N
8 0 Duplicate Text-2 Dupe-N
9 0 Duplicate Text-2 Dupe-N
10 0 Y Text-2 Dupe-N
Id Key is just unique key.
Count unique picks up the first time each value in column Text appears. Available can have Y, N, Duplicate and Text is the main column I need to analyze my table. The Results are for the first time each value in Text appears (Count unique = 1), if there is a value in Available then that is the value I need, if Count Unique is 0 then is either Dupe-Y or Dupe-N depending on the value in Available.
I tried with a formula like this one but got stuck after initial progress. =IF(B2=0,"",IFERROR(IF(COUNTIF(D:D,D2)>1,IF(COUNTIF($D:$D,D2)=1,"",C2),1),1))
Note that the column Results is the one I need to populate with a formula that is not affected by sorting or lack of it.
I guess you got all those values and you just need a formula for column Results.
My formul will work only if the data is sorted like in your example. If sorting changes, formula will fail:
My formula is:
=IF(B2=1;D2;"Dupe-"&RIGHT(G1;1))

Is there a way to convert my column of incrementing integers separated by zero to the number of intervals encountered so far in a pandas datafram?

I'm working in pandas and I have a column in my dataframe filled by 0s and incrementing integers starting at one. I would like to add another column of integers but that column would be a counter of how many intervals separated by zero we have encountered to this point. For example my data would like like
Index
1
2
3
0
1
2
0
1
and I would like it to look like
Index IntervalCount
1 1
2 1
3 1
0 1
1 2
2 2
0 2
1 2
Is it possible to do this with vectorized operation or do I have to do this iteratively? Note, it's not important that it be a new column could also overwrite the old one.
You can use cumsum function.
df["IntervalCount"] = (df["Index"] == 1).cumsum()

Intermediate steps in evaluation of Frequency formula

This has reference to [SO question]Counting unique list of items from range based on criteria from other ranges
Formula Suggested by Scot Craner is :
=SUM(--(FREQUENCY(IF(B2:B7<=25,IF(C2:C7<=35,COUNTIF(A2:A7,"<"&A2:A7),""),""),COUNTIF(A2:A7,"<"&A2:A7))>0))
I have been able to understand clearly the logic and evaluation of the formula except for this step shown in the attached snapshots.
As per MS Office document:
FREQUENCY(data_array, bins_array) The FREQUENCY function syntax has
the following arguments: Data_array Required. An array of or
reference to a set of values for which you want to count frequencies.
If data_array contains no values, FREQUENCY returns an array of zeros.
Bins_array Required. An array of or reference to intervals into
which you want to group the values in data_array. If bins_array
contains no values, FREQUENCY returns the number of elements in
data_array.
It is clear to me as to How {1;1;4;0;"";"") comes in data_array and also how {1;1;4;0;5;3} comes in bins_array.But how it evaluates to {2;0;1;1;0;0;0} is not clear to me.
Would appreciate if someone can lucidly explain it.
So you wants to know how
FREQUENCY({1;1;4;0;"";""},{1;1;4;0;5;3}) evaluates to {2;0;1;1;0;0;0}?
Problem is that the bins_array not needs to be sorted to make FREQUENCY working. But of course it internally must sort the bins_array to get the intervals into which to group the values in data_array. Then it groups and counts and then it returns the counted numbers in the same order the bins was given in bins_array.
Scores Bins
1 1
1 1
4 4
0 0
"" 5
"" 3
Bins sorted
0 (<=0)
1 (>0, <=1)
1 (>1, <=1) == not possible
3 (>1, <=3)
4 (>3, <=4)
5 (>4, <=5)
(>5)
Bin Description Result
1 Number of scores (>0, <=1) 2
1 Number of scores (>1, <=1) == not possible 0
4 Number of scores (>3, <=4) 1
0 Number of scores (<=0) 1
5 Number of scores (>4, <=5) 0
3 Number of scores (>1, <=3) 0
Number of scores (>5) 0

pandas how to flatten a list in a column while keeping list ids for each element

I have the following df,
A id
[ObjectId('5abb6fab81c0')] 0
[ObjectId('5abb6fab81c3'),ObjectId('5abb6fab81c4')] 1
[ObjectId('5abb6fab81c2'),ObjectId('5abb6fab81c1')] 2
I like to flatten each list in A, and assign its corresponding id to each element in the list like,
A id
ObjectId('5abb6fab81c0') 0
ObjectId('5abb6fab81c3') 1
ObjectId('5abb6fab81c4') 1
ObjectId('5abb6fab81c2') 2
ObjectId('5abb6fab81c1') 2
I think the comment is coming from this question ? you can using my original post or this one
df.set_index('id').A.apply(pd.Series).stack().reset_index().drop('level_1',1)
Out[497]:
id 0
0 0 1.0
1 1 2.0
2 1 3.0
3 1 4.0
4 2 5.0
5 2 6.0
Or
pd.DataFrame({'id':df.id.repeat(df.A.str.len()),'A':df.A.sum()})
Out[498]:
A id
0 1 0
1 2 1
1 3 1
1 4 1
2 5 2
2 6 2
This probably isn't the most elegant solution, but it works. The idea here is to loop through df (which is why this is likely an inefficient solution), and then loop through each list in column A, appending each item and the id to new lists. Those two new lists are then turned into a new DataFrame.
a_list = []
id_list = []
for index, a, i in df.itertuples():
for item in a:
a_list.append(item)
id_list.append(i)
df1 = pd.DataFrame(list(zip(alist, idlist)), columns=['A', 'id'])
As I said, inelegant, but it gets the job done. There's probably at least one better way to optimize this, but hopefully it gets you moving forward.
EDIT (April 2, 2018)
I had the thought to run a timing comparison between mine and Wen's code, simply out of curiosity. The two variables are the length of column A, and the length of the list entries in column A. I ran a bunch of test cases, iterating by orders of magnitude each time. For example, I started with A length = 10 and ran through to 1,000,000, at each step iterating through randomized A entry list lengths of 1-10, 1-100 ... 1-1,000,000. I found the following:
Overall, my code is noticeably faster (especially at increasing A lengths) as long as the list lengths are less than ~1,000. As soon as the randomized list length hits the ~1,000 barrier, Wen's code takes over in speed. This was a huge surprise to me! I fully expected my code to lose every time.
Length of column A generally doesn't matter - it simply increases the overall execution time linearly. The only case in which it changed the results was for A length = 10. In that case, no matter the list length, my code ran faster (also strange to me).
Conclusion: If the list entries in A are on the order of a few hundred elements (or less) long, my code is the way to go. But if you're working with huge data sets, use Wen's! Also worth noting that as you hit the 1,000,000 barrier, both methods slow down drastically. I'm using a fairly powerful computer, and each were taking minutes by the end (it actually crashed on the A length = 1,000,000 and list length = 1,000,000 case).
Flattening and unflattening can be done using this function
def flatten(df, col):
col_flat = pd.DataFrame([[i, x] for i, y in df[col].apply(list).iteritems() for x in y], columns=['I', col])
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
return df
Unflattening:
def unflatten(flat_df, col):
flat_df.groupby(level=0).agg({**{c:'first' for c in flat_df.columns}, col: list})
After unflattening we get the same dataframe except column order:
(df.sort_index(axis=1) == unflatten(flatten(df)).sort_index(axis=1)).all().all()
>> True
To create unique index you can call reset_index after flattening

Resources