How do i analyze goodness of fit between two contingency tables?

How do i analyze goodness of fit between two contingency tables? - statistics

I have two contingency tables that I performed a chi-square test on. I would like to know if these two tables have similar distributions/frequency of the data using a goodness of fit test. I'm not sure how to do this and how to format the data. Thanks in advance for your help!
discharge = data.frame (decreasing.sign =c(0,0,9,7,1,1 ),
decreasing.trend= c(2,3,35,27,8,6),
increase.trend = c(8,27,34,16,4,3),
increase.sign = c(0,2,7,0,0,0),
row.names= c("Ridge and Valley", "Blue Ridge", "Piedmont", "Southeastern Plains", "Middle Atlantic Coastal Plain","Southern Coastal Plain"))
groundwater = data.frame (decreasing.sign =c(0,1,6,45,6,16),
decreasing.trend= c(0, 1, 3,28, 5,5),
increase.trend = c(1,5,6,32,9,5),
increase.sign = c(1,0,0,4,2,20),
row.names= c("Ridge and Valley", "Blue Ridge", "Piedmont", "Southeastern Plains", "Middle Atlantic Coastal Plain","Southern Coastal Plain"))
chisq=chisq.test(discharge) #add ",simulate.p.value" if there are zeros within parentheses
chisq
chisq2=chisq.test(groundwater) #add ",simulate.p.value" if there are zeros within parentheses
chisq2

Related

Normalising units/Replace substrings based on lists using Python

I am trying to normalize weight units in a string.
Eg:
1.SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre - SUCO MARACUJA COM GENGIBRE PCS 300 ML
2. OVOS CAIPIRAS ANA MARIA BRAGA 10UN - OVOS CAIPIRAS ANA MARIA BRAGA 10U
3. SUCO MARACUJA MAMAO PCS 300 Gram - SUCO MARACUJA MAMAO PCS 300 G
4. SUCO ABACAXI COM MACA PCS 300Milli litre - SUCO ABACAXI COM MACA PCS 300ML
The keyword table is :
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
I tried to take up these lists as a table but am having difficulty in comparing two dataframes or tables in python.
I tried the below code.
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
z='SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre'
#for row in mongo_docs:
#z = row['clean_hntproductname']
for x in unit:
for y in norm_unit:
if (re.search(r'\s'+x+r'$',z,re.I)):
# clean_hntproductname = t.lower().replace(x.lower(),y.lower())
# myquery3 = { "_id" : row['_id']}
# newvalues3 = { "$set": {"clean_hntproductname" : 'clean_hntproductname'} }
# ds_hnt_prod_data.update_one(myquery3, newvalues3)
I'm using Python(Jupyter) with MongoDb(Compass). Fetching data from Mongo and writing back to it.

From my understanding you want to:
Update all the rows in a table which contain the words in the unit array, to the ones in norm_unit.
(Disclaimer: I'm not familiar with MongoDB or Python.)
What you want is to create a mapping (using a hash) of the words you want to change.
Here's a trivial solution (i.e. not best solution but would probably point you in the right direction.)
unit_conversions = {
'Kilo': 'KG'
'Kilogram': 'KG',
'Gram': 'G'
}
# pseudo-code
for each row that you want to update
item_description = get the value of the string in the column
for each key in unit_conversion (e.g. 'Kilo')
see if the item_description contains the key
if it does, replace it with unit_convertion[key] (e.g. 'KG')
update the row

In python, headings not in the same row

I extracted three columns from a larger data frame (recent_grads) as follows...
df = recent_grads.groupby('Major_category')['Men', 'Women'].sum()
However, when I print df, it comes up as follows...
Men Women
Major_category
Agriculture & Natural Resources 40357.0 35263.0
Arts 134390.0 222740.0
Biology & Life Science 184919.0 268943.0
Business 667852.0 634524.0
Communications & Journalism 131921.0 260680.0
Computers & Mathematics 208725.0 90283.0
Education 103526.0 455603.0
Engineering 408307.0 129276.0
Health 75517.0 387713.0
Humanities & Liberal Arts 272846.0 440622.0
Industrial Arts & Consumer Services 103781.0 126011.0
Interdisciplinary 2817.0 9479.0
Law & Public Policy 91129.0 87978.0
Physical Sciences 95390.0 90089.0
Psychology & Social Work 98115.0 382892.0
Social Science 256834.0 273132.0
How do I get Major_category heading in the same row as Men and Women headings? I tried to put the three columns in a new data frame as follows...
df1 = df[['Major_category', 'Men', 'Women']].copy()
This gives me an error (Major_category not in index)

Hi man you should try reset_index https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html:
df = df.groupby('Major_category')['Men', 'Women'].sum()
# Print the output.
md = df.reset_index()
print(md)

Seems like you want to convert the groupby object back to a dataframe try:
df['Major_category'].apply(pd.DataFrame)

Cosine Similarity of rows based on selected pandas columns

I am trying to make an item-item based movie recommender. In the movies dataset I have meta data about movies such as title, genres, directors, actors, producers, writers, year_of_release etc. Currently I am calculating the similarity based on the genre column only using tf-idf vectorizer by splitting the genre column into a list and it is working completely fine. Here is the code I am using:
def vector_cosine(df, index):
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
return cosine_sim[index]
# Function that get movie recommendations based on the cosine similarity score of movie genres
def genre_recommendations(title, idx):
sim_scores = list(enumerate(vector_cosine(movies,idx)))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:6]
movie_indices = [i[0] for i in sim_scores]
recommendations = list(titles.iloc[movie_indices])
if title in recommendations: recommendations.remove(title)
return recommendations
and this is the sample data:
{"movie_id":"217d1207-effc-4605-bdef-899339615fe6","title":"Mamma Mia! Here We Go Again","release_date":"2018","actor_names":["Amanda Seyfried","Andy Garc\u00eda","Cher","Christine Baranski","Colin Firth","Dominic Cooper","Jessica Keenan Wynn","Julie Walters","Lily James","Meryl Streep","Pierce Brosnan","Stellan Skarsg\u00e5rd"],"director_names":["Ol Parker"],"producer_names":["Gary Goetzman","Judy Craymer"],"genres":"['Comedy', 'Music', 'Romance']","rating_aus":"PG","rating_nzl":"PG","rating_usa":null} {"movie_id":"59a0a1cd-5b83-401d-9b30-caa2c7e4f46c","title":"The Spy Who Dumped Me","release_date":"2018","actor_names":["Carolyn Pickles","Fred Melamed","Gillian Anderson","Hasan Minhaj","Ivanna Sakhno","James Fleet","Jane Curtin","Justin Theroux","Kate McKinnon","Mila Kunis","Paul Reiser","Sam Heughan"],"director_names":["Susanna Fogel"],"producer_names":null,"genres":"['Action', 'Comedy']","rating_aus":null,"rating_nzl":"R16","rating_usa":null} {"movie_id":"35e19192-6fb0-4c2b-9591-0d70deb30db6","title":"Beirut","release_date":"2018","actor_names":["Alon Aboutboul","Dean Norris","Douglas Hodge","Jon Hamm","Jonny Coyne","Kate Fleetwood","Larry Pine","Le\u00efla Bekhti","Mark Pellegrino","Rosamund Pike","Shea Whigham","Sonia Okacha"],"director_names":["Brad Anderson"],"producer_names":["Mike Weber","Monica Levinson","Shivani Rawat","Ted Field","Tony Gilroy"],"genres":"['Action', 'Thriller']","rating_aus":"MA15+","rating_nzl":"M","rating_usa":null} {"movie_id":"6e3c57d8-51c9-4b20-afe7-e09d949b2d7f","title":"Mile 22","release_date":"2018","actor_names":["Alexandra Vino","Iko Uwais","John Malkovich","Lauren Cohan","Lauren Mary Kim","Mark Wahlberg","Nikolai Nikolaeff","Poorna Jagannathan","Ronda Jean Rousey","Sala Baker","Sam Medina","Terry Kinney"],"director_names":["Peter Berg"],"producer_names":["Mark Wahlberg","Peter Berg","Stephen Levinson"],"genres":"['Action']","rating_aus":"MA15+","rating_nzl":"R16","rating_usa":null} {"movie_id":"cde85555-c944-44ef-a3b1-a5bf842646ad","title":"The Meg","release_date":"2018","actor_names":["Cliff Curtis","James Gaylyn","Jason Statham","Jessica McNamee","Li Bingbing","Masi Oka","Page Kennedy","Rainn Wilson","Robert Taylor","Ruby Rose","Tawanda Manyimo","Winston Chao"],"director_names":["Jon Turteltaub"],"producer_names":["Belle Avery","Colin Wilson","Lorenzo di Bonaventura"],"genres":"['Action', 'Science Fiction', 'Thriller']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"d1496722-2713-4b29-a323-18a0e2a0d6f3","title":"Monster's Ball","release_date":"2001","actor_names":["Amber Rules","Billy Bob Thornton","Charles Cowan Jr.","Coronji Calhoun","Gabrielle Witcher","Halle Berry","Heath Ledger","Peter Boyle","Sean Combs","Taylor LaGrange","Taylor Simpson","Yasiin Bey"],"director_names":["Marc Forster"],"producer_names":["Lee Daniels"],"genres":"['Drama', 'Romance']","rating_aus":null,"rating_nzl":"R16","rating_usa":null} {"movie_id":"844b309e-34ef-47bf-b981-eadc1d915886","title":"How to Be a Latin Lover","release_date":"2017","actor_names":["Anne McDaniels","Eugenio Derbez","Kristen Bell","Mckenna Grace","Michael Cera","Michaela Watkins","Raquel Welch","Rob Corddry","Rob Huebel","Rob Lowe","Rob Riggle","Salma Hayek"],"director_names":["Ken Marino"],"producer_names":null,"genres":"['Comedy']","rating_aus":"M","rating_nzl":"R13","rating_usa":null} {"movie_id":"991e3711-9918-41a0-b660-3c53ffa4901c","title":"Good Fortune","release_date":"2016","actor_names":null,"director_names":["Joshua Tickell","Rebecca Harrell Tickell"],"producer_names":null,"genres":"['Documentary']","rating_aus":"PG","rating_nzl":null,"rating_usa":null} {"movie_id":"7bf1935e-7d1a-437c-a71a-b4c70eb4f853","title":"Paper Heart","release_date":"2009","actor_names":["Charlyne Yi","Gill Summers","Jake Johnson","Martin Starr","Michael Cera","Seth Rogen"],"director_names":["Nicholas Jasenovec"],"producer_names":null,"genres":"['Comedy', 'Drama', 'Romance']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"4cd0e423-acde-4bf7-bf93-f77777a4de6f","title":"Daybreakers","release_date":"2009","actor_names":["Claudia Karvan","Emma Randall","Ethan Hawke","Harriet Minto","Day","Isabel Lucas","Jay Laga'aia","Michael Dorman","Mungo McKay","Sam Neill","Tiffany Lamb","Vince Colosimo","Willem Dafoe"],"director_names":["Michael Spierig","Peter Spierig"],"producer_names":["Bryan Furst","Chris Brown","Sean Furst","Todd Fellman"],"genres":"['Action', 'Fantasy', 'Horror', 'Science Fiction']","rating_aus":"MA15+","rating_nzl":"R16","rating_usa":null} {"movie_id":"c0a84525-46a4-4977-86af-7fc8cb014683","title":"Requiem for a Dream","release_date":"2000","actor_names":["Charlotte Aronofsky","Christopher McDonald","Ellen Burstyn","Janet Sarno","Jared Leto","Jennifer Connelly","Joanne Gordon","Louise Lasser","Marcia Jean Kurtz","Mark Margolis","Marlon Wayans","Suzanne Shepherd"],"director_names":["Darren Aronofsky"],"producer_names":["Eric Watson","Palmer West"],"genres":"['Crime', 'Drama']","rating_aus":null,"rating_nzl":"R18","rating_usa":null} {"movie_id":"fe5367fe-b558-4fbe-872f-ab041ef58213","title":"Grizzly Man","release_date":"2005","actor_names":["David Letterman","Jewel Palovak","Kathleen Parker","Sam Egli","Timothy Treadwell","Warren Queeney","Werner Herzog","Willy Fulton"],"director_names":["Werner Herzog"],"producer_names":["Erik Nelson"],"genres":"['Documentary']","rating_aus":"M","rating_nzl":null,"rating_usa":null} {"movie_id":"6bd66f6b-2834-4013-884f-7eaf257e09fb","title":"The Great Buck Howard","release_date":"2008","actor_names":["Adam Scott","Colin Hanks","Debra Monk","Emily Blunt","Griffin Dunne","John Malkovich","Jonathan Ames","Patrick Fischler","Ricky Jay","Steve Zahn","Tom Hanks","Wallace Langham"],"director_names":["Sean McGinly"],"producer_names":["Gary Goetzman","Tom Hanks"],"genres":"['Comedy', 'Drama']","rating_aus":"G","rating_nzl":"G","rating_usa":null} {"movie_id":"78b22fc2-5069-40da-b4ac-790ec3902a32","title":"Boo! A Madea Halloween","release_date":"2016","actor_names":["Andre Hall","Bella Thorne","Brock O'Hurn","Cassi Davis","Diamond White","Jimmy Tatro","Kian Lawley","Lexy Panterra","Liza Koshy","Patrice Lovely","Tyler Perry","Yousef Erakat"],"director_names":["Tyler Perry"],"producer_names":null,"genres":"['Comedy', 'Drama', 'Horror']","rating_aus":"M","rating_nzl":null,"rating_usa":null} {"movie_id":"bc5c3635-bbfb-4dd7-b0ed-408787fd5f43","title":"Fantastic Beasts and Where to Find Them","release_date":"2016","actor_names":["Alison Sudol","Carmen Ejogo","Colin Farrell","Dan Fogler","Eddie Redmayne","Ezra Miller","Johnny Depp","Jon Voight","Katherine Waterston","Ron Perlman","Samantha Morton","Zo\u00eb Kravitz"],"director_names":["David Yates"],"producer_names":["David Heyman","J.K. Rowling","Lionel Wigram","Steve Kloves"],"genres":"['Adventure', 'Family', 'Fantasy']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"1ec3a043-a6c4-44a0-9fb1-c948eb07cf85","title":"Silver Linings Playbook","release_date":"2012","actor_names":["Anupam Kher","Bonnie Aarons","Bradley Cooper","Brea Bee","Chris Tucker","Dash Mihok","Jacki Weaver","Jennifer Lawrence","John Ortiz","Julia Stiles","Robert De Niro","Shea Whigham"],"director_names":["David O. Russell"],"producer_names":["Bruce Cohen","Donna Gigliotti","Jonathan Gordon","Mark Kamine"],"genres":"['Comedy', 'Drama', 'Romance']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"4c0bbde5-7a34-4556-a481-3357ef69b651","title":"The Equalizer","release_date":"2014","actor_names":["Alex Veadov","Bill Pullman","Chlo\u00eb Grace Moretz","David Harbour","David Meunier","Denzel Washington","E. Roger Mitchell","Haley Bennett","Johnny Skourtis","Marton Csokas","Melissa Leo","Vladimir Kulich"],"director_names":["Antoine Fuqua"],"producer_names":["Alex Siskin","Denzel Washington","Jason Blumenthal","Mace Neufeld","Michael Sloan","Richard Wenk","Steve Tisch","Todd Black","Tony Eldridge"],"genres":"['Action', 'Crime', 'Thriller']","rating_aus":"MA15+","rating_nzl":"R18","rating_usa":null} {"movie_id":"809f3131-7445-4ffb-8b77-6de6699c85c4","title":"The Notebook","release_date":"2004","actor_names":["David Thornton","Ed Grady","Gena Rowlands","James Garner","James Marsden","Jennifer Echols","Joan Allen","Kevin Connolly","Rachel McAdams","Ryan Gosling","Sam Shepard","Starletta DuPois"],"director_names":["Nick Cassavetes"],"producer_names":["Lynn Harris","Mark Johnson"],"genres":"['Drama', 'Romance']","rating_aus":"PG","rating_nzl":"PG","rating_usa":null} {"movie_id":"49652b6d-b818-4fc4-9c57-9ec7b5c346cc","title":"The Matrix","release_date":"1999","actor_names":["Anthony Ray Parker","Belinda McClory","Carrie","Anne Moss","Gloria Foster","Hugo Weaving","Joe Pantoliano","Julian Arahanga","Keanu Reeves","Laurence Fishburne","Marcus Chong","Paul Goddard","Robert Taylor"],"director_names":["Lana Wachowski","Lilly Wachowski"],"producer_names":["Joel Silver"],"genres":"['Action', 'Science Fiction']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"d356a087-4e89-420b-867c-618544969302","title":"The Hunger Games","release_date":"2012","actor_names":["Alexander Ludwig","Donald Sutherland","Elizabeth Banks","Isabelle Fuhrman","Jennifer Lawrence","Josh Hutcherson","Lenny Kravitz","Liam Hemsworth","Stanley Tucci","Toby Jones","Wes Bentley","Woody Harrelson"],"director_names":["Gary Ross"],"producer_names":["Jon Kilik","Nina Jacobson"],"genres":"['Adventure', 'Fantasy', 'Science Fiction']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"0ef84c8c-ffc5-4896-9e3b-5303acba0ff3","title":"The Wolf of Wall Street","release_date":"2013","actor_names":["Brian Sacca","Henry Zebrowski","Jon Bernthal","Jon Favreau","Jonah Hill","Kenneth Choi","Kyle Chandler","Leonardo DiCaprio","Margot Robbie","Matthew McConaughey","P. J. Byrne","Rob Reiner"],"director_names":["Martin Scorsese"],"producer_names":["Emma Tillinger Koskoff","Joey McFarland","Leonardo DiCaprio","Martin Scorsese","Riza Aziz"],"genres":"['Comedy', 'Crime', 'Drama']","rating_aus":"R18+","rating_nzl":"R18","rating_usa":null}
What I want next is to calculate similarity based on multiple columns. Can you please guide me how can I achieve that.

You can use cosine_similarity from sklearn.metrics.pairwise like:
sim_df = cosine_similarity(df)
Then you can get the top 10 similar movies for a particular movie like:
sim_df[movie_index].argsort()[-10:][::-1]

pie doesn't allow negative values

I am trying to draw a pie chart using Matplotlib, though there are no negative values present, I keep getting the error "pie doesn't allow negative values"!
contrib = sales_data.groupby('Region')['Sales2016'].sum().round().reset_index()
contrib["Percentage"] = (contrib.Sales2016/sum(contrib.Sales2016))*100
contrib = contrib.drop(columns = ["Sales2016"])
contrib.plot(kind = "pie", subplots = True).plot(kind = "pie",subplots=True,legend=False,figsize=(12,5),autopct="%.2f%%")
plt.show()
Is it possible to point out where am I going wrong? The following is the output for contrib:
Region Percentage
0 Central 32.994771
1 East 42.701319
2 West 24.303911

Define argument y in the pie plot:
contrib.plot(kind = "pie",y="Percentage",labels=['Region'],legend=False,figsize=(12,5),autopct="%.2f%%")

In Stata, how can I group coefplot's means across categorical variable?

I'm working with coefplot command (source, docs) in Stata plotting means of continuous variable over cateories.
Small reporoducible example:
sysuse auto, clear
drop if rep78 < 3
la de rep78 3 "Three" 4 "Four" 5 "Five"
la val rep78 rep78
mean mpg if foreign == 0, over(rep78)
eststo Domestic
mean mpg if foreign == 1, over(rep78)
eststo Foreign
su mpg, mean
coefplot Domestic Foreign , xtitle(Mpg) xline(`r(mean)')
Gives me result:
What I'd like to add is an extra 'group' label for Y axis. Trying options from regression examples doesn't seem to do the job:
coefplot Domestic Foreign , headings(0.rep78 = "Repair Record 1978")
coefplot Domestic Foreign , groups(?.rep78 = "Repair Record 1978")
Any other possibilities?

This seems to do the job
coefplot Domestic Foreign , xtitle(Mpg) xline(`r(mean)') ///
groups(Three Four Five = "Repair Record 1978")
I don't know however how it will handle situations with categorical variables with the same labels?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do i analyze goodness of fit between two contingency tables? - statistics

Related

Normalising units/Replace substrings based on lists using Python

In python, headings not in the same row

Cosine Similarity of rows based on selected pandas columns

pie doesn't allow negative values

In Stata, how can I group coefplot's means across categorical variable?

Categories

Resources