In Stata, how can I group coefplot's means across categorical variable? - graphics

I'm working with coefplot command (source, docs) in Stata plotting means of continuous variable over cateories.
Small reporoducible example:
sysuse auto, clear
drop if rep78 < 3
la de rep78 3 "Three" 4 "Four" 5 "Five"
la val rep78 rep78
mean mpg if foreign == 0, over(rep78)
eststo Domestic
mean mpg if foreign == 1, over(rep78)
eststo Foreign
su mpg, mean
coefplot Domestic Foreign , xtitle(Mpg) xline(`r(mean)')
Gives me result:
What I'd like to add is an extra 'group' label for Y axis. Trying options from regression examples doesn't seem to do the job:
coefplot Domestic Foreign , headings(0.rep78 = "Repair Record 1978")
coefplot Domestic Foreign , groups(?.rep78 = "Repair Record 1978")
Any other possibilities?

This seems to do the job
coefplot Domestic Foreign , xtitle(Mpg) xline(`r(mean)') ///
groups(Three Four Five = "Repair Record 1978")
I don't know however how it will handle situations with categorical variables with the same labels?

Related

How do i analyze goodness of fit between two contingency tables?

I have two contingency tables that I performed a chi-square test on. I would like to know if these two tables have similar distributions/frequency of the data using a goodness of fit test. I'm not sure how to do this and how to format the data. Thanks in advance for your help!
discharge = data.frame (decreasing.sign =c(0,0,9,7,1,1 ),
decreasing.trend= c(2,3,35,27,8,6),
increase.trend = c(8,27,34,16,4,3),
increase.sign = c(0,2,7,0,0,0),
row.names= c("Ridge and Valley", "Blue Ridge", "Piedmont", "Southeastern Plains", "Middle Atlantic Coastal Plain","Southern Coastal Plain"))
groundwater = data.frame (decreasing.sign =c(0,1,6,45,6,16),
decreasing.trend= c(0, 1, 3,28, 5,5),
increase.trend = c(1,5,6,32,9,5),
increase.sign = c(1,0,0,4,2,20),
row.names= c("Ridge and Valley", "Blue Ridge", "Piedmont", "Southeastern Plains", "Middle Atlantic Coastal Plain","Southern Coastal Plain"))
chisq=chisq.test(discharge) #add ",simulate.p.value" if there are zeros within parentheses
chisq
chisq2=chisq.test(groundwater) #add ",simulate.p.value" if there are zeros within parentheses
chisq2

Normalising units/Replace substrings based on lists using Python

I am trying to normalize weight units in a string.
Eg:
1.SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre - SUCO MARACUJA COM GENGIBRE PCS 300 ML
2. OVOS CAIPIRAS ANA MARIA BRAGA 10UN - OVOS CAIPIRAS ANA MARIA BRAGA 10U
3. SUCO MARACUJA MAMAO PCS 300 Gram - SUCO MARACUJA MAMAO PCS 300 G
4. SUCO ABACAXI COM MACA PCS 300Milli litre - SUCO ABACAXI COM MACA PCS 300ML
The keyword table is :
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
I tried to take up these lists as a table but am having difficulty in comparing two dataframes or tables in python.
I tried the below code.
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
z='SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre'
#for row in mongo_docs:
#z = row['clean_hntproductname']
for x in unit:
for y in norm_unit:
if (re.search(r'\s'+x+r'$',z,re.I)):
# clean_hntproductname = t.lower().replace(x.lower(),y.lower())
# myquery3 = { "_id" : row['_id']}
# newvalues3 = { "$set": {"clean_hntproductname" : 'clean_hntproductname'} }
# ds_hnt_prod_data.update_one(myquery3, newvalues3)
I'm using Python(Jupyter) with MongoDb(Compass). Fetching data from Mongo and writing back to it.
From my understanding you want to:
Update all the rows in a table which contain the words in the unit array, to the ones in norm_unit.
(Disclaimer: I'm not familiar with MongoDB or Python.)
What you want is to create a mapping (using a hash) of the words you want to change.
Here's a trivial solution (i.e. not best solution but would probably point you in the right direction.)
unit_conversions = {
'Kilo': 'KG'
'Kilogram': 'KG',
'Gram': 'G'
}
# pseudo-code
for each row that you want to update
item_description = get the value of the string in the column
for each key in unit_conversion (e.g. 'Kilo')
see if the item_description contains the key
if it does, replace it with unit_convertion[key] (e.g. 'KG')
update the row

Reformat csv file using python?

I have this csv file with only two entries. Here it is:
Meat One,['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']
First one is title and second is a business headings.
Problem lies with entry two.
Here is my code:
import csv
with open('phonebookCOMPK-Directory.csv', "rt") as textfile:
reader = csv.reader(textfile)
for row in reader:
row5 = row[5].replace("[", "").replace("]", "")
listt = [(''.join(row5))]
print (listt[0])
it prints:
'Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers'
What i need to do is that i want to create a list containing these words and then print them like this using for loop to print every item separately:
Abattoirs
Exporters
Food Delivery
Butchers Retail
Meat Dealers-Retail
Meat Freezer
Meat Packers
Actually I am trying to reformat my current csv file and clean it so it can be more precise and understandable.
Complete 1st line of csv is this:
Meat One,+92-21-111163281,Al Shaheer Corporation,Retailers,2008,"['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']","[[' Outlets Address : Shop No. Z-10, Station Shopping Complex, MES Market, Malir-Cantt, Karachi. Landmarks : MES Market, Station Shopping Complex City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi. Landmarks : Nadra Chowrangi, Sky Garden, Tipu Sultan Road City : Karachi UAN : +92-21-111163281 '], ["" Outlets Address : Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi. Landmarks : Boat Basin, Jans Broast, Khayaban-e-Roomi City : Karachi UAN : +92-21-111163281 View Map ""], [' Outlets Address : Gulistan-e-Johar, Karachi. Landmarks : Perfume Chowk City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Tee Emm Mart, Creek Vista Appartments, Khayaban-e-Shaheen, Phase VIII, DHA, Karachi. Landmarks : Creek Vista Appartments, Nueplex Cinema, Tee Emm Mart, The Place City : Karachi Mobile : 0302-8333666 '], [' Outlets Address : Y-Block, DHA, Lahore. Landmarks : Y-Block City : Lahore UAN : +92-42-111163281 '], [' Outlets Address : Adj. PSO, Main Bhittai Road, Jinnah Supermarket, F-7 Markaz, Islamabad. Landmarks : Bhittai Road, Jinnah Super Market, PSO Petrol Pump City : Islamabad UAN : +92-51-111163281 ']]","Agriculture, fishing & Forestry > Farming equipment & services > Abattoirs in Pakistan"
First column is Name
Second column is Number
Third column is Owner
Forth column is Business type
Fifth column is Y.O.E
Sixth column is Business Headings
Seventh column is Outlets (List of lists containing every branch address)
Eighth column is classification
There is no restriction of using csv.reader, I am open to any technique available to clean my file.
Think of it in terms of two separate tasks:
Collect some data items from a ‘dirty’ source (this CSV file)
Store that data somewhere so that it’s easy to access and manipulate programmatically (according to what you want to do with it)
Processing dirty CSV
One way to do this is to have a function deserialize_business() to distill structured business information from each incoming line in your CSV. This function can be complex because that’s the nature of the task, but still it’s advisable to split it into self-containing smaller functions (such as get_outlets(), get_headings(), and so on). This function can return a dictionary but depending on what you want it can be a [named] tuple, a custom object, etc.
This function would be an ‘adapter’ for this particular CSV data source.
Example of deserialization function:
def deserialize_business(csv_line):
"""
Distills structured business information from given raw CSV line.
Returns a dictionary like {name, phone, owner,
btype, yoe, headings[], outlets[], category}.
"""
pieces = [piece.strip("[[\"\']] ") for piece in line.strip().split(',')]
name = pieces[0]
phone = pieces[1]
owner = pieces[2]
btype = pieces[3]
yoe = pieces[4]
# after yoe headings begin, until substring Outlets Address
headings = pieces[4:pieces.index("Outlets Address")]
# outlets go from substring Outlets Address until category
outlet_pieces = pieces[pieces.index("Outlets Address"):-1]
# combine each individual outlet information into a string
# and let ``deserialize_outlet()`` deal with that
raw_outlets = ', '.join(outlet_pieces).split("Outlets Address")
outlets = [deserialize_outlet(outlet) for outlet in raw_outlets]
# category is the last piece
category = pieces[-1]
return {
'name': name,
'phone': phone,
'owner': owner,
'btype': btype,
'yoe': yoe,
'headings': headings,
'outlets': outlets,
'category': category,
}
Example of calling it:
with open("phonebookCOMPK-Directory.csv") as f:
lineno = 0
for line in f:
lineno += 1
try:
business = deserialize_business(line)
except:
# Bad line formatting?
log.exception(u"Failed to deserialize line #%s!", lineno)
else:
# All is well
store_business(business)
Storing the data
You’ll have the store_business() function take your data structure and write it somewhere. Maybe it’ll be another CSV that’s better structured, maybe multiple CSVs, a JSON file, or you can make use of SQLite relational database facilities since Python has it built-in.
It all depends on what you want to do later.
Relational example
In this case your data would be split across multiple tables. (I’m using the word “table” but it can be a CSV file, although you can as well make use of an SQLite DB since Python has that built-in.)
Table identifying all possible business headings:
business heading ID, name
1, Abattoirs
2, Exporters
3, Food Delivery
4, Butchers Retail
5, Meat Dealers-Retail
6, Meat Freezer
7, Meat Packers
Table identifying all possible categories:
category ID, parent category, name
1, NULL, "Agriculture, fishing & Forestry"
2, 1, "Farming equipment & services"
3, 2, "Abattoirs in Pakistan"
Table identifying businesses:
business ID, name, phone, owner, type, yoe, category
1, Meat One, +92-21-111163281, Al Shaheer Corporation, Retailers, 2008, 3
Table describing their outlets:
business ID, city, address, landmarks, phone
1, Karachi UAN, "Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi", "Nadra Chowrangi, Sky Garden, Tipu Sultan Road", +92-21-111163281
1, Karachi UAN, "Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi", "Boat Basin, Jans Broast, Khayaban-e-Roomi", +92-21-111163281
Table describing their headings:
business ID, business heading ID
1, 1
1, 2
1, 3
…
Handling all this would require a complex store_business() function. It may be worth looking into SQLite and some ORM framework, if going with relational way of keeping the data.
You can just replace the line :
print(listt[0])
with :
print(*listt[0], sep='\n')

Dodging error bars in marginsplot in Stata

I am using marginsplot to draw some error bars between two different groups. The error bars overlap though, so I'm trying to dodge them slightly left-or-right from one another.
Here is an example slightly edited from the marginsplot help that illustrates the problem:
use http://www.stata-press.com/data/r13/nhanes2
quietly regress bpsystol agegrp##sex
quietly margins agegrp#sex
marginsplot, recast(scatter) ciopts(recast(rspike))
Is there any easy way to dodge the blue Male points and bars slightly to the left, and the red Female points and bars slightly to the right (or vice-versa)? Like what is done is dodged bar charts.
Here it would work out fine to recast the confidence intervals to an area and make it slightly transparent as in the help example further down the line. However, for my actual use I would like to keep the points and spikes.
Here is an approach using the community-contributed commands parmest and eclplot.
The trick is to adjust the values of the group variable by a small amount, for example 0.1, and then to use the subby option of eclplot:
** a short version
use http://www.stata-press.com/data/r13/nhanes2
qui reg bpsystol agegrp##sex
qui margins agegrp#sex
qui parmest , bmat(r(b)) vmat(r(V)) level( `cilevel' ) fast
qui split parm, parse( . # )
qui destring parm*, replace
replace parm1 = parm1 - ( 0.05 )
eclplot estimate min95 max95 parm1, eplot(sc) rplottype(rspike) supby(parm3, spaceby(0.1))
However, the problem with this approach is that all the labels get lost but I do not know of a good way to retrieve them, other than by brute force.
The following is an extended version of the code where I tried to automate re-application of all the value labels by a brute force method:
use http://www.stata-press.com/data/r13/nhanes2, clear
** specify parameters and variables
local cilevel = 95
local groupvar agegrp
local typevar sex
local ytitle "Linear Prediction"
local title "Adjust Predictions of `groupvar'#`typevar' with `cilevel'% CIs"
local eplot scatter
local rplottype rspike
local spaceby 0.1 /* use this param to control the dodge */
** store labels of groupvar ("agegrp") and typevar ("sex")
local varlist `groupvar' `typevar'
foreach vv of var `varlist' {
local `vv'_varlab : var lab `vv'
qui levelsof `vv', local( `vv'_vals )
foreach vl of local `vv'_vals {
local `vv'_`vl'lab : lab `vv' `vl'
lab def `vv'_vallab `vl' "``vv'_`vl'lab'", add
}
}
** run analysis
qui reg bpsystol `groupvar'##`typevar'
margins `groupvar'#`typevar'
** use parmest to store estimates
preserve
parmest , bmat(r(b)) vmat(r(V)) level( `cilevel' ) fast
lab var estimate "`ytitle'"
split parm, parse( . # )
qui destring parm*, replace
rename parm1 `groupvar'
rename parm3 `typevar'
** reaply stored labels
foreach vv of var `varlist' {
lab var `vv' "``vv'_varlab'"
lab val `vv' `vv'_vallab
}
** dodge and plot
replace agegrp = agegrp - ( `spaceby' / 2 )
eclplot estimate min95 max95 agegrp ///
, eplot( `eplot' ) rplottype( `rplottype' ) ///
supby( sex, spaceby( `spaceby' ) ) ///
estopts1( mcolor( navy ) ) estopts2( mcolor( maroon ) ) ///
ciopts1( lcolor( navy ) ) ciopts2( lcolor( maroon ) ) ///
title( "`title'" )
restore

Concatenating strings in a column according to values in another column in a dataframe

I have a data.frame with two columns of strings as follows.
nos <- c("JM1", "JM2", "JM3", "JM1", "JM5", "JM45", "JM3", "JM45")
ren <- c("book, vend, spent", "marigold, fortune", "smoke, parchment, smell, book", "mental, past, create", "key, fortune, mask, federal", "tell, warn, slip", "wire, dg333, uv12", "tell, warn, slip, furniture")
d <- data.frame(nos, ren, stringsAsFactors=FALSE)
d
nos ren
1 JM1 book, vend, spent
2 JM2 marigold, fortune
3 JM3 smoke, parchment, smell, book
4 JM1 mental, past, create
5 JM5 key, fortune, mask, federal
6 JM45 tell, warn, slip
7 JM3 wire, dg333, uv12
8 JM45 tell, warn, slip, furniture
I want to concatenate the elements of ren column according to the strings in nos column.
For example in the sample data, the elements associated with JM1 which occurs twice should be merged ("book, vend, spent, mental, past, create").
Also the elements associated with JM45 should be merged keeping only unique words. ("tell, warn, slip, furniture")
The output that I am trying to get is like below.
nos1 <- c("JM1", "JM2", "JM3", "JM5", "JM45")
ren1 <- c("book, vend, spent, mental, past, create", "marigold, fortune", "smoke, parchment, smell, book, wire, dg333, uv12", "key, fortune, mask, federal", "tell, warn, slip, furniture")
out <- data.frame(nos1, ren1, stringsAsFactors=FALSE)
out
nos1 ren1
1 JM1 book, vend, spent, mental, past, create
2 JM2 marigold, fortune
3 JM3 smoke, parchment, smell, book, wire, dg333, uv12
4 JM5 key, fortune, mask, federal
5 JM45 tell, warn, slip, furniture
How to do this in R? My original data set has thousands of such rows in a data.frame.
Using plyr package you could do it like this
ddply(d, .(nos), summarise, ren1=paste0(ren, collapse=", "))
or if you want unique values in ren1 like this
ddply(d, .(nos), summarise,
paste0(unique(unlist(strsplit(ren, split=", "))), collapse=", "))

Resources