how do we find the probability of how many different sample? - statistics

we have 8 group cat : +group A:18 cat eat kind of food A.+group B: 3 cat eat kind of food B.+group C: 4 cat eat kind of food C. +group D: 2 cat eat kind of food D. +group E:4 cat eat kind of food E.+group F:2 cat eat kind of food F. +group G:13 cat eat kind of food G. +group H: 4 cat eat kind of food H.
A random sample of 8 cat.
a) How many different samples are possible?
b) How many different samples of size 8 are possible subject to the constraint that no 2
cat may have the same food type?
Would you help me a and b.

Related

Fuzzy String Matching using Python

I have a training dataset for eg.
Letter Word
A Apple
B Bat
C Cat
D Dog
E Elephant
and I need to check the dataframe such as
AD Apple Dog
AE Applet Elephant
DC Dog Cow
EB Elephant Bag
AED Apple Elephant Dog
D Door
ABC All Bat Cat
the instances AD,AE,EB are almost accurate (Apple and Applet are considered closer to each other, similar for Bat and Bag) but DC doesn't match.
Output Required:
Letters Words Status
AD Apple Dog Accept
AE Applet Elephant Accept
DC Dog Cow Reject
EB Elephant Bag Accept
AED Apple Elephant Dog Accept
D Door Reject
ABC All Bat Cat Accept
ABC accepted because 2 of 3 words match.
The words accepted need to be matched 70% (Fuzzy Match). yet, threshold subject to change.
How can I find these matches using Python.
You can use thefuzz to solve your problem:
# Python env: pip install thefuzz
# Conda env: conda install thefuzz
from thefuzz import fuzz
THRESHOLD = 70
df2['Others'] = (df2['Letters'].agg(list).explode().reset_index()
.merge(df1, left_on='Letters', right_on='Letter')
.groupby('index')['Word'].agg(' '.join))
df2['Ratio'] = df2.apply(lambda x: fuzz.ratio(x['Words'], x['Others']), axis=1)
df2['Status'] = np.where(df2['Ratio'] > THRESHOLD, 'Accept', 'Reject')
Output:
>>> df2
Letters Words Others Ratio Status
0 AD Apple Dog Apple Dog 100 Accept
1 AE Applet Elephant Apple Elephant 97 Accept
2 DC Dog Cow Dog Cat 71 Accept
3 EB Elephant Bag Elephant Bat 92 Accept
4 AED Apple Elephant Dog Apple Dog Elephant 78 Accept
5 D Door Dog 57 Reject
6 ABC All Bat Cat Apple Cat Bat 67 Reject

Using awk to include file name with format in column

I'm working on wrangling some data to ingest into Hive. The problem is, I have overwrites in my historical data so I need to include the file name in the text files so that I can dispose of the duplicated rows which have been updated in subsequent files.
The way I've chosen to go about this is to use awk to add the file name to each file, then after I ingest into Hive I can use HQL to filter out my deprecated rows.
Here is my sample data (tab-delimited):
animal legs eyes
hippo 4 2
spider 8 8
crab 8 2
mite 6 0
bird 2 2
I've named it long_name_20180901.txt
I've figured out how to add my new column from this post:
awk '{print FILENAME (NF?"\t":"") $0}' long_name_20180901.txt
which results in:
long_name_20180901.txt animal legs eyes
long_name_20180901.txt hippo 4 2
long_name_20180901.txt spider 8 8
long_name_20180901.txt crab 8 2
long_name_20180901.txt mite 6 0
long_name_20180901.txt bird 2 2
But, being a beginner, I don't know how to augment this command to:
make the column name (first line) something like "file_name"
implement regex in awk to just extract the part of the file name that I need, and dispose of the rest. I really just want "long_name_(.{8,}).txt" (the stuff in the capturing group.
Target output is:
file animal legs eyes
20180901 spider 8 8
20180901 crab 8 2
20180901 mite 6 0
20180901 bird 2 2
Thanks for your time!! I'm a total newbie to awk.
This would handle one or multiple input files:
awk -v OFS='\t' '
NR==1 { print "file", $0 }
FNR==1 { n=split(FILENAME,t,/[_.]/); fname=t[n-1]; next }
{ print fname, $0 }
' *.txt
You can use BEGIN that sets the "file" and then reset it to use the filename for the rest.
awk 'BEGIN{f="file\t"} NF{print f $0; if (f=="file\t") {l=split(FILENAME, a, /[_.]/); f=a[l-1]"\t"};}' long_name_20180901.txt

Bash code to struture proteomics data

I need help concerning retructuring my dataset so that I can perform the downstream analysis. I am presently dealing with proteomics data and want to perform comparative analysis. The problem is the protein ids. In general one protein can have more then 1 id and they are separated by ";". I need to print the entire line of the same protein with different protein ids. for example:-
Input file :
tom dick harry jan
a;b;c 1 2 3 4
d;e 4 5 7 3
desirable output:
tom dick harry jan
a 1 2 3 4
b 1 2 3 4
c 1 2 3 4
d 4 5 7 3
e 4 5 7 3
many many thanks in advance
$ awk 'NR==1{$0="key "$0} {split($1,a,/;/); for (i=1; i in a; i++) { $1=a[i]; print } }' file | column -t
key tom dick harry jan
a 1 2 3 4
b 1 2 3 4
c 1 2 3 4
d 4 5 7 3
e 4 5 7 3
You can trivially remove the word "key" from the output if you don't like it but IMHO having some columns with and some without headers is a very bad idea - just makes any further processing more difficult.
#!/bin/bash
read header
printf "%4s %s\n" "" "$header"
while true
do
read ids values
for id in $(tr ';' ' ' <<< "$ids")
do
printf "%-4s %s\n" "$id" "$values"
done
done
This reads the header and prints is (just slightly differently formatted), then it reads each line and prints for each of these a bunch of lines, one line for each id given in the beginning of the line. For finding the ids, the ids string is split over semicolon (;).

How to characterize a distribution of values?

I try to explain it with an example.
In a school there are n classes. In each classe there are k students, with k from 1 to 700, both n and k are known.
I need a way to characterize, for each class, the distribution of the names of students. For example, in class A there are 10 students, 3 are named "John", 3 "Mark" and 3 "Anne". In another class there are 100 student and everyone is named "Anton".
I need a measure able to be indicative of names distribution in each class. For example, (it's not important), it may be 1 if everyone in a class has the same name and 0 if there aren't 2 identical names in the same class.
In other words a way to sort classes by the distribution of names.
Sounds like you want a "contingency table". It's arbitrary which of your variables you want to have as rows vs. columns, but the table entries are either counts or proportions of how many occurrences fall in the intersection of the categories.
With the example you gave:
Class
A B
_________________
Anne | 3 | 0 | 3
Names Anton | 0 | 100 | 100
John | 3 | 0 | 3
Mark | 3 | 0 | 3
Unknown | 1 | 0 | 1
|--------|--------|----
10 100 | 110
Values at the right and along the bottom are called the "marginal totals", or if proportions, "marginal distributions". The bottom right corner is the grand total of your data, obtained by summing the row or column margins. (They better come out the same!) For proportions, the sum must be 1.

Creating pivot tables and tables with relationships

I am doing my final year dissertation research and I just gathered all the results from my questionnaire, from which I got 210 responses.
I am really struggling with Excel here, so I hope somebody can help me with this.
I am running Mac Office 2011. I have a single file which is so composed:
You are a... | You like... | Gender
participant 1 Carrot | blu, red | male
participant 2 Pear | blu | female
participant 3 Carrot | red | female
I was able to create a Pivot table of how "you are a...", like so:
Carrot | 2
Pear | 1
Total | 3
I was also able to convert this total in percentage, by right clicking in the cell and selecting
Field settings > options > Show data as > % of total.
(by the way, there is a way to add another column with percentages next to the original numbers)
Anyway, my issues are the following:
1) how can I see ho many carrots and pears replied to the gender? Something like
Male | Female | Total
Carrot 1 | 1 | 2
Pear 0 | 1 | 1
Total 1 | 2 | 3
2) Also, how many participants liked blue given that the field is single and contains all the values?
ps. all the data is already consistent as apart from a few fields, questions were multiple choice selection
Thank you very much for those who will help!

Resources