Pandas - Data Frame string value count - python-3.x

I Need to know how many time a string appears in my dataframe, I used the follow sentence:
print(df['Mobile Register' == df.col1].shape[0])
But my problem is, I need to find all registers where contains Mobile Register, because in my data frame this string can be Mobile Register 1, Mobile Register 2, Mobile Register 3, Mobile Register n ...
So I understand my command will not be used, therefore, what a need to do to find the count?

Use series.str.contains():
df.loc[df['col_name'].str.contains('Mobile Register',na=False)]
This will give you all the rows satisfying the condition.
To find count of a specific column which meets this condition , use:
df.loc[df['col_name'].str.contains('Mobile Register',na=False),'column_name'].count()
If you need count of all columns satisfying this condition:
df.loc[df['col_name'].str.contains('Mobile Register',na=False)].count()

I donĀ“t know it could be helpful or if it is the best solution, but I found what I want with the follow sentence
names = df.col1.tolist()
count=0
for colname in enumerate(names):
print(colname[1])
if ('Mobile Register' in str(colname[1])):
count=count+1

Related

Excel: Calculations with two variables, that both share the same ID

I have been working on a little project in which I analyze some data from a game that I play. My dataset looks like this:
As you can see, it consists of:
Match ID
Map the match was played on
Team name
First pick, second pick and third pick (characters/players)
Points the teams won from this match
What side they played on (A or B)
Who won
Whether they're in the top 64 teams
Currently I am trying to analyze how certain picks perform against other picks. For example, I would like to see how the Xelor first pick (cell D2) performs against all other first picks. To do this, I would need to count the amount of times the Xelor first pick played against all other first pick, and how many times the Xelor pick won. I don't have any problems doing that, but the catch is that I need to make sure I only compare the Xelor first picks with other first picks from the same match (same match ID). For example, I would compare the Xelor first pick (D2) vs the Steamer first pick (D3), as they share the same match ID.
I came up with a messy solution earlier with simple formulas, but it made for a table that had no data every other row, which resulted in some problems analyzing the data. I am now struggling with the Index and Match functions to make a pretty table for my needs, but I am having a hard time.
If anyone could give me a hand on how to do this, or has any clever ideas on how to analyze all picks vs other picks, let me know!
So, it turns out that both the Unique function and the Xlookup functions made this an easy problem to solve.
First, I made a new column showing just the unique match ID values:
=UNIQUE(A:A)
Then, next to that column I looked up the first pick of the A side team using Xlookup:
=XLOOKUP(M2;A:A;C:C;;0;1)
I then did the same in another column for the team on the other side using an inverse search direction:
=XLOOKUP(M2;A:A;C:C;;0;-1)
Lastly, to see which of the two first picks won, I used this formula in a fourth column:
=IF(XLOOKUP(M2;A:A;H:H;;0;1)="Win";N2;O2)
This resulted in the following table (M:P):
Thanks for the help, David!
You could try something like this in M2 cell:
=IF(L2="","",COUNTIFS(TB_GAMES[W/L/D],"Win",
TB_GAMES[Pick 1],L2,TB_GAMES[Match],$K$2))
Then you can expand the formula down.
In L column you have the unique values from users given the Match (K2) and the Pick 1 column values.
=UNIQUE(FILTER(TB_GAMES[Pick 1], TB_GAMES[Match]=K2))
Update
In case you want to calculate the scores for all the Pick 1 players at once. You can try the following:
=LET(winSet, FILTER(TB_GAMES[Pick 1], TB_GAMES[W/L/D]="Win"),
matches,XMATCH(winSet, UNIQUE(winSet)),
freq,FREQUENCY(matches, UNIQUE(matches)), SORT(HSTACK(UNIQUE(winSet),
FILTER(freq, freq<>0)),2,-1)
)
Note: Because we are using a FILTER function we cannot use as range input argument for COUNTIF or COUNTFS, so we try to use XMATCH/FREQUENCY as a way to achieve the same result. For more information about this see my answer to the question: How to count the number of trades made on a Excel spreadsheet using a custom conditional formula?, we use here the same idea and the explanation would be the same.
The HSTACK function is used just to combine the result having the winners and the number of wins for each player. Finally the result is sorted by score.
This would be the result on O2 cell:

How do I get the top N results based on wildcard criteria via a formula in Excel 2019?

Since I've exhausted about every resource I could find in regards to this question, I figured it was finally time to ask this community.
I have a very large (15k+ row) dataset that I'm looking to generate a report on giving the top 25 largest values based on one of the columns, HOWEVER, there is additional criteria that needs to be considered other than just the values in one column. I have done this already with less criteria, but adding more is giving me trouble.
My (working) formula for Top N with some criteria:
{=LARGE(IF('IMPORTED DATA'!$X$4:$X$1048576 = IF('Data Cleanup'!$AX$3 = 1, "Gaming Designed", "Not Gaming Designed"), 'IMPORTED DATA'!$BH$4:$BH$1048576), ROW(A2) - ROW(A$1))}
The issue comes when I have another criteria I need to add that uses wildcard characters to distinguish the 'correct' criteria. Here is what I've come up with so far, but this just results in the COUNTIF portion always resulting in true, so not actually applying the added criteria:
{=LARGE(IF(COUNTIF('IMPORTED DATA'!$P$6:$P$1048576, IF('Data Cleanup'!$AX$3 = 1, "?????", "????")) * ('IMPORTED DATA'!$X$6:$X$1048576 = IF('Data Cleanup'!$AX$3 = 1, "Gaming Designed", "Not Gaming Designed")) * ('IMPORTED DATA'!$E$6:$E$1048576 <> "All Other (Suppressed)"), 'IMPORTED DATA'!$BH$6:$BH$1048576), ROW(A2) - ROW(A$1))}
I tried to work-around IF statements not accepting wildcard characters using the COUNTIF method but to no avail.
I understand that this is a bit of a rough question, but I'll do my best to respond to as many questions as I can to help clarify.
A couple more bits of information that may be helpful:
This is entirely based in Excel 2019, I know that FILTER would be an easy solution, but I don't have access to that in this version of excel.
The reason for using wildcards is because it was the easiest way to distinguish between the two categories to sort: above or below 100Hz. Anything under 100Hz will be 4 characters, while anything above will be 5.
I also need other data from the same row as the results, so any methods must be also applicable to MATCH criteria so that I can look up the rest of the data with the same search parameters.
its very hard to understand without seeing the data.
What i understood is that if you make a helper column in the dataset as per the criteria you want that would solve your problem.
at least thats how i am also using.
You need to create a ranking column in the data sheet.
Ranking Formula = =COUNTIFS($M$3:$M$233,">="&M3,$K$3:$K$233,K3)
with thise formula you can add as many as criteria as you want.
Index Formula = =INDEX($K$3:$K$233,MATCH(1,($K$3:$K$233=$B$1)*($N$3:$N$233=A3),0))
you need to change the columns names you want.
no need row() functions just try always to use simple sequence will work
Good luck
Ended up solving this in a very simple method thanks to Scott Craner's comment.
Since wildcards don't work in if statements, using LEN did the trick. Final formula ended up being:
{=LARGE(IF(LEN('IMPORTED DATA'!$P$6:$P$30000)=IF('Data Cleanup'!$AX$3 =1,5,4) * ('IMPORTED DATA'!$X$6:$X$30000 = IF('Data Cleanup'!$AX$3 = 1, "Gaming Designed", "Not Gaming Designed")) * ('IMPORTED DATA'!$E$6:$E$30000 <> "All Other (Suppressed)"), 'IMPORTED DATA'!$BH$6:$BH$30000), ROW(A2) - ROW(A$1))}
Thank you to everyone for your help!

How can I separate consecutive strings without any delimiters?

My input data is a VCF (Variant Call Format) file. Each line that I am interested in looks like this:
chrI 22232 DEL00BED N <DEL> . PASS SUPP=1159;SUPP_VEC=11111111111111111111111011111111111
I want to count the presence (1) of a specific deletion in a specific position (22232) supported by n samples. For this reason, I looked at SUPP_VEC= values, however, I don't know how to split each value as 1) it is a string, and 2) doesn't have delimiters. How could I add a space between every character? or How could I split/ count the values from SUPP_VEC= for Python3?
I was also curious to know what SUPP means. I found oneSUPP=2and I looked on Excel if the presence(1)\abscence(0) in the SUPP_VEC counted the value of SUPP, nevertheless, I could only count 1 instead of 2, probably does somebody know what SUPP means.
The reason for my procedure is to have a frequency table for a specific deletion type.
I hope I made myself clear.
Thank you in advance.

Pandas Check sequence or patters

I need to check if there is a special patter in the columns, its easier to see with some data.
now if you see there are to punch in in next to each other,
i need way to detect this patters, normally you need to clock out before you clock in
so this is a mistake from my system and i need a way to detect this on pandas.
i was thinking using .apply(function, axis=1)
thank you in advance.
best,
Using pandas.DataFrame.shift(), this code compares the row with the next row, creating a column 'flag' when they are exactly the same:
comparison = df == df.shift()
df['flag'] = comparison['Date'] & comparison['Name'] & comparison['Activity']]
With your data, the output is:

How to find a string within a string

I have the list with like 100,000 site link strings
Each link is unique, but it has consistent ?Item=
Then, it's either nothing or it continues after & symbol.
My question is: How do I pull out the item numbers?
I know replace function can offer similar functionality, but it works with Fixed sizes, in my case string can be different in size.
Link example:
www.site.com?sadfsf?sdfsdf&adfasfd?Item=JGFGGG55555
or
www.site.com?sadfsf?sdfsdf&adfasfd?Item=JGFGGG55555&sdafsdfsdfsdf
In both cases I need to get JGFGGG55555 only
If this always is the last portion of the string, you can use the following:
=MID(A1, FIND("?Item=", A1) + 6, 99)
This assumes:
no item numbers will be over 99 digits.
no additional fields follow the item number.
Edit:
With the update to your question, it is apparent you have some strings with additional data after the ?Item= field. Without using VBA there is not a simple means of using MID and FIND to extract this.
However you could create a column which acts as a placeholder.
For example, create a column using:
=MID(A1,FIND("?Item=",A1)+6,99)
This gets you the following value: JGFGGG55555&sdafsdfsdfsdf
Next, create a column using:
=IF(ISERROR(FIND("&",B2)),B2,LEFT(B2,FIND("&",B2)-1))
This produces: JGFGGG55555 by searching the first value for a & and using the portion before it. If it is not found, the first value is simply repeated.
This formula should work for both the examples given:
=MID(A1,FIND("=",A1),IFERROR(LEN(A1)-FIND("&",A1,FIND("=",A1))-1,LEN(A1)+1-FIND("=",A1)))

Resources