Counting in how many rows a value exists - python-3.x

I would like to count the frequency of words in a data frame. Here is an example of what i'm trying to achieve.
words = ['Dungeon',
'Crawling',
'Puzzle',
'RPG',]
desc =
0 [Dungeon, count, game, kid, draw, toddler, Unique]
1 [Beautiful, simple, music, application, toddle]
2 [Fun, intuitive, number, game, baby, toddler]
Note that desc is a 1690 rows pandas data frame.
Now I would like to check words[i] in desc
I do not want to have nested for loop, so made a function to just check if the word is in the desc and then use apply() to each row and then use sum.
The function I got is:
def tmp(word, desc):
return (word in desc)
However, when I use the following code: desc.apply(tmp, args = words[0]) I get the error that states: tmp() takes 2 positional arguments but 8 were given. However, when I manually use it with values tmp(words[0], desc[0]) it works just fine....

If want avoid loops, use DataFrame constructor with DataFrame.isin and for count True values use sum:
s = pd.DataFrame(desc.tolist()).isin(words).sum(axis=1)
print(s)
0 1
1 0
2 0
dtype: int64

Related

how to find the max values of groupby pandas with the column name

Actually I want to find the maximum values of the teachers_prefix with the teacher's prefix (for eg: in this case mrs, and the value which is 639471
Input can be found
Try using pd.Series.idxmax(). This returns the index of the column/series where you get the max value.
#grouped_df[[grouped_df.idxmax()]] #If series
grouped_df.loc[grouped_df.idxmax()] #If dataFrame
teacher_prefix
mrs 639471
Name: sum, dtype: int64
If you want one or more than one rows based on top n largest values -
grouped_df.nlargest(2, 'sum')

How can I replace a particular column in a data frame based on a condition (categorical variables)?

I need to replace the salary status to 1 or 0 respectively if the salary is greater than 50,000 or less than or equal to 50,000 in a df.
The DataFrame shape:30162*13
I have tried this:
data2['SalStat']=data2['SalStat'].map({"less than or equal to 50,000":0,"greater than 50,000":1})
I also tried data2['SalStat']
and loc without any success.
How can I do the same?
I think your solution is nice.
If want match only by substring, e.g. by greater use Series.str.contains for boolean mask with converting to 0,1:
data2['SalStat']=data2['SalStat'].str.contains('greater').astype(int)
Or:
data2['SalStat']=data2['SalStat'].str.contains('greater').view('i1')
Try this
def status(d): return 0 if d == 'less than or equal to 50,000' else 1
data2['SalStat'] = list(map(status ,data2['SalStat']))

Pandas .apply() function not always being called in python 3

Hello I wanted to increment a global variable 'count' through a function which will be called on a pandas dataframe of length 1458.
I have read other answers where they talk about .apply() not being inplace.
I therefore follow their advice but the count variable still is 4
count = 0
def cc(x):
global count
count += 1
print(count)
#Expected final value of count is 1458 but instead it is 4
# I think its 4, because 'PoolQC' is a categorical column with 4 possible values
# I want the count variable to be 1458 by the end instead it shows 4
all_data['tempo'] = all_data['PoolQC'].apply(cc)
# prints 4 instead of 1458
print("Count final value is ",count)
Yes, the observed effect is because you have categorical type of the column. This is smart of pandas that it just calculates apply for each category. Is counting only thing you're doing there? I guess not, but why you need such a calculation? Can't you use df.shape?
Couple of options I see here:
You can change type of column
e.g.
all_data['tempo'] = all_data['PoolQC'].astype(str).apply(cc)
You can use different non-categorical column
You can use df.shape to see how many rows you have in the df.
You can use apply for whole DataFrame like all_data['tempo'] = df.apply(cc, axis=1).
In such a case you still can use whatever is in all_data['PoolQC'] within cc function, like:
def cc(x):
global count
count += 1
print(count)
return x['PoolQC']

How can you randomly return an element from a list?

in this question we are asked to randomly return an element from a list. where "rand()" is uniformly distributed from 0 to 1. "list" is a list of elements
def r(lst):
return lst[int(random.uniform(a=0,b=1)*len(lst))]
However random.choice() is easier to use
https://docs.python.org/3/library/random.html
You should mention in the question if there are multiple answers to choose from.
return list[int(len(list)*rand())]
This is the correct answer. Multiplying the number of elements len(list) with a random number between 0 and 1 gives you a random number between 0 and len(list). You use int() to convert the value to an integer, effectively rounding it down and then select the item at that position.
return list[(len(list)/rand())]
This doesn't work. len(list) will usually be an integer > 1 and dividing that by a number between 0 and 1 always gives an even bigger number, so you always try to get an item that is after the last one in the list. Also the index will be a float, but the index must be an integer
return list[int(rand()) # i assume you wanted to use a square bracket here
This will always select the first element. It's a random number between 0 and 1 rounded down => 0
return list[len(list)} # same thing here
this will always try to select the element after the last one, which results in an error. Also, this can't even be random without the rand() function ...

Count number of occurences of a string and relabel

I have a n x 1 cell that contains something like this:
chair
chair
chair
chair
table
table
table
table
bike
bike
bike
bike
pen
pen
pen
pen
chair
chair
chair
chair
table
table
etc.
I would like to rename these elements so they will reflect the number of occurrences up to that point. The output should look like this:
chair_1
chair_2
chair_3
chair_4
table_1
table_2
table_3
table_4
bike_1
bike_2
bike_3
bike_4
pen_1
pen_2
pen_3
pen_4
chair_5
chair_6
chair_7
chair_8
table_5
table_6
etc.
Please note that the dash (_) is necessary Could anyone help? Thank you.
Interesting problem! This is the procedure that I would try:
Use unique - the third output parameter in particular to assign each string in your cell array to a unique ID.
Initialize an empty array, then create a for loop that goes through each unique string - given by the first output of unique - and creates a numerical sequence from 1 up to as many times as we have encountered this string. Place this numerical sequence in the corresponding positions where we have found each string.
Use strcat to attach each element in the array created in Step #2 to each cell array element in your problem.
Step #1
Assuming that your cell array is defined as a bunch of strings stored in A, we would call unique this way:
[names, ~, ids] = unique(A, 'stable');
The 'stable' is important as the IDs that get assigned to each unique string are done without re-ordering the elements in alphabetical order, which is important to get the job done. names will store the unique names found in your array A while ids would contain unique IDs for each string that is encountered. For your example, this is what names and ids would be:
names =
'chair'
'table'
'bike'
'pen'
ids =
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
1
1
1
2
2
names is actually not needed in this algorithm. However, I have shown it here so you can see how unique works. Also, ids is very useful because it assigns a unique ID for each string that is encountered. As such, chair gets assigned the ID 1, followed by table getting assigned the ID of 2, etc. These IDs will be important because we will use these IDs to find the exact locations of where each unique string is located so that we can assign those linear numerical ranges that you desire. These locations will get stored in an array computed in the next step.
Step #2
Let's pre-allocate this array for efficiency. Let's call it loc. Then, your code would look something like this:
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
As such, for each unique name we find, we look for every location in the ids array that matches this particular name found. find will help us find those locations in ids that match a particular name. Once we find these locations, we simply assign an increasing linear sequence from 1 up to as many names as we have found to these locations in loc. The output of loc in your example would be:
loc =
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
5
6
7
8
5
6
Notice that this corresponds with the numerical sequence (the right most part of each string) of your desired output.
Step #3
Now all we have to do is piece loc together with each string in our cell array. We would thus do it like so:
out = strcat(A, '_', num2str(loc));
What this does is that it takes each element in A, concatenates a _ character and then attaches the corresponding numbers to the end of each element in A. Because we want to output strings, you need to convert the numbers stored in loc into strings. To do this, you must use num2str to convert each number in loc into their corresponding string equivalents. Once you find these, you would concatenate each number in loc with each element in A (with the _ character of course). The output is stored in out, and we thus get:
out =
'chair_1'
'chair_2'
'chair_3'
'chair_4'
'table_1'
'table_2'
'table_3'
'table_4'
'bike_1'
'bike_2'
'bike_3'
'bike_4'
'pen_1'
'pen_2'
'pen_3'
'pen_4'
'chair_5'
'chair_6'
'chair_7'
'chair_8'
'table_5'
'table_6'
For your copying and pasting pleasure, this is the full code. Be advised that I've nulled out the first output of unique as we don't need it for your desired output:
[~, ~, ids] = unique(A, 'stable');
loc = zeros(numel(A), 1);
for idx = 1 : numel(names)
id = find(ids == idx);
loc(id) = 1 : numel(id);
end
out = strcat(A, '_', num2str(loc));
If you want an alternative to unique, you can work with a hash table, which in Matlab would entail to using the containers.Map object. You can then store the occurrences of each individual label and create the new labels on the go, like in the code below.
data={'table','table','chair','bike','bike','bike'};
map=containers.Map(data,zeros(numel(data),1)); % labels=keys, counts=values (zeroed)
new_data=data; % initialize matrix that will have outputs
for ii=1:numel(data)
map(data{ii}) = map(data{ii})+1; % increment counts of current labels
new_data{ii} = sprintf('%s_%d',data{ii},map(data{ii})); % format outputs
end
This is similar to rayryeng's answer but replaces the for loop by bsxfun. After the strings have been reduced to unique labels (line 1 of code below), bsxfun is applied to create a matrix of pairwise comparisons between all (possibly repeated) labels. Keeping only the lower "half" of that matrix and summing along rows gives how many times each label has previously appeared (line 2). Finally, this is appended to each original string (line 3).
Let your cell array of strings be denoted as c.
[~, ~, labels] = unique(c); %// transform each string into a unique label
s = sum(tril(bsxfun(#eq, labels, labels.')), 2); %'// accumulated occurrence number
result = strcat(c, '_', num2str(x)); %// build result
Alternatively, the second line could be replaced by the more memory-efficient
n = numel(labels);
M = cumsum(full(sparse(1:n, labels, 1)));
s = M((1:n).' + (labels-1)*n);
I'll give you a psuedocode, try it yourself, post the code if it doesn't work
Initiate a counter to 1
Iterate over the cell
If counter > 1 check with previous value if the string is same
then increment counter
else
No- reset counter to 1
end
sprintf the string value + counter into a new array
Hope this helps!

Resources