What 's' represents in cells in this dataset? - statistics

I need to use this public dataset:
https://catalog.data.gov/dataset/2015-2016-demographic-data-grades-k-8-school
You can view data in this table viewer:
https://data.cityofnewyork.us/Education/2015-2016-Demographic-Data-Grades-K-8-School/7yc5-fec2
Many cells have No data in them, which is clear for me. However many others have s which is unclear. I didn't find any explanation.
I guess maybe it is some standard way of telling why data is absent in that cell. Or maybe not and only authors of data know it.
Please tell me what s mean or may mean.

There is no "universal" meaning behind this.
It could be a number of things:
Data not present
Not applicable
Use case specific information
Error in the dataset itself
...
If you want to create value out of this data, don't make assumptions, look for documentation or description of dataset, which describes its columns, types and expected content. The owner or creator of this dataset should of course also be able to inform you of what it represents.

Related

What is the role of feature type in AzureML?

I want to know what is the difference between feature numeric and numeric columns in Azure Machine Learning Studio.
The documentation site states:
Because all columns are initially treated as features, for modules
that perform mathematical operations, you might need to use this
option to prevent numeric columns from being treated as variables.
But nothing more. Not what a feature is, in which modules you need features. Nothing.
I specifically would like to understand if the clear feature dropdown option in the fields in the edit metadata-module has any effect. Can somebody give me a szenario where this clear feature-operation changes the ML outcome? Thank you
According to the documentation in ought to have an effect:
Use the Fields option if you want to change the way that Azure Machine
Learning uses the data in a model.
But what can this effect be? Any example might help
As you suspect, setting a column as feature does have an effect, and it's actually quite important - when training a model, the algorithms will only take into account columns with the feature flag, effectively ignoring the others.
For example, if you have a dataset with columns Feature1, Feature2, and Label and you want to try out just Feature1, you would apply clear feature to the Feature2 column (while making sure that Feature1 has the feature label set, of course).

n columns of data frame discarded

I am using spatstat package in R to read my road network shapefile which also has some additional attributes.
When i am reading my shapefiles and converting them to as.psp(before I make them an object of linnet), I am getting n columns of data frame discarded. I do not understand why? The columns being discarded are my covariates for a linear network, so I am not able to bring them into my analysis.
Could someone give me an idea why this happens and how to correct it?
Why it happens:
I would guess that we (spatstat authors) need to spend a bit of time discussing with the maptools guys how to handle the additional info in the SpatialLinesDataFrame object, and we haven't done that yet.
How to correct it:
You have to write some code on your own at the moment. You can extract the data from SpatialLinesDataFrame object by accessing the #data slot. Please provide specific data and how you need to use the additional data (what format do you need it in) if you need more help. You can find a few helpful commands here: https://cran.r-project.org/web/packages/spatstat/vignettes/shapefiles.pdf

Freebase: Format search result to list all properties of object of unknown type(s)

I'm trying to write a MQL query to format a search result in freebase (the "output" parameter in the search API). I essentially want to find the (simple) values of all the properties of a given search result (without knowing anything about the types of the result a priori). By "simple", I mean only the default properties if the values are complex objects.
E.g., if I search for "Yo La Tengo" and this takes me to the result for "/en/yo_la_tengo", I want to be able to get the group's members (I just need names, not instruments or dates started), albums (again, just names), films contributed to (again, just names), etc.
Is there a simple way to do this with a search output query, given that I know nothing about the types? I imagine there's some sort of reflection magic I can use, and I've tried mucking about with "/type/reflect", but I'm not getting anywhere. I'm brand-new to MQL (though I have extensive SQL experience), so this is a little daunting. Any ideas?
Edit: So to clarify, I think the problem I'm seeing is due to mediator types like "performance" (an actor in a film) or "marriage". E.g., with a query about Yo La Tengo, I can see most (all?) information that I'm interested in, but a similar query about [The Muppet Movie]( freebase.com/api/service/search?limit=1&mql_output=%5B%7B%22%2Ftype%2Freflect%2Fany_reverse%22%3A%5B%7B%7D%5D%2C%22%2Ftype%2Freflect%2Fany_master%22%3A%5B%7B%7D%5D%2C%22%2Ftype%2Freflect%2Fany_value%22%3A%5B%7B%7D%5D%7D%5D&query=The%20Muppet%20Movie -- sorry, SO thinks I'm a spammer so I can't make this a link), I don't see Frank Oz reference at all (probably because his performance is referenced instead). Is there a generic way for me to "follow" mediator types to get all their properties? E.g., is there a single output MQL that would allow me to get the actor in a performance (when linked form a film search result) and give the the spouse in a marriage (when linked from a person)?
Querying not only every property, but then following those properties another ply deep in the graph for all search results is going to be an incredibly expensive operation. What is the use case for this? Do you really have a UI where the user can see and effectively absorb all this information? To answer the question directly though, it's not possible to unpack mediator types automatically using mql_output on the search API.
I'd suggest combining a basic set of information on the search query with a deeper set of information on a topic that the user has expressed interest (e.g. by hovering over). This UI experience would be similar to that of Freebase Suggest.
In the years since the question was originally asked there have been some additional useful things added such as the "notable" pseudo-property which lets you see what the topic is notable for.
Of course everyone also needs to be moving to the new API, so the queries would be:
https://www.googleapis.com/freebase/v1/search?query=%22the%20muppet%20movie%22&limit=1&indent=true
https://www.googleapis.com/freebase/v1/topic/en/the_muppet_movie
AFAIK there is no way to do this in outright MQL, but you can:
Get all the properties of an object or type of object, then
Programmatically construct another MQL query to get those objects you want to know more about.
Look at this example:
[{
"type|=": [
"/film/actor",
"/tv/tv_actor",
"/celebrities/celebrity"
],
"*": [{}]
}]​
It grabs all the properties of all objects that have the type actor, tv_actor, or celebrity. When you run it, you'll see all the possible "follow" points you can explore.
This is not exactly what you want, but it should get you closer.

Organizing Lots of Data in Search Results

I'm working on a pretty basic web app (not much more than CRUD stuff). However, the requirements call for a bunch of data to be displayed with each item in the search results - IDs, dates, email addresses, long descriptions... too much to fit neatly into a simple grid, and too dissimilar to make them flow together (like the natural language example from this article.)
Is there a design pattern for attractively displaying many descriptive fields with each search result?
(Please don't tell me to just remove some fields from the results; that's not an option for this project.)
Obviously there are many ways you can handle this, and to a degree it's a factor of your information design abilities and preferences.
Natural Data Groupings
What I would do is try to organize your data into a small number of "buckets." You state that the data are too dissimilar to be arranged into a sentence, but it's likely you can create a few logical groups. Since we can't see all your data, I'll guess that you have information about a person (email, name, ID?), about some sort of event (dates? type?), or maybe about some kind of object related to the person (orders? classes?). Whatever they are, some of the data will be more closely related to each other than others.
Designing in Chunks
Take each loose "bucket" and design a kind of "plate" -- a grouping just for the information in that bucket. The design problem within this constrained chunk is easier to tackle: maybe it's a little table-like layout, maybe it's something non-tabular, like the stackoverflow user "nameplate". Maybe long textual data have their own plates, or maybe they're grouped into a single plate, but with a preview/detail click-for-more arrangement.
Using a Grid
Now that you have a small number of "plates," go back to a grid-like approach for your overall search result row design. Arrange the plates as units within the row, and be sure to keep them aligned. Following an overall grid (HTML table or otherwise) for the plates will avoid an "information soup" problem. You'll have clean columns that scan well, and a readable, natural information hierarchy. The natural language example you cite would indeed be difficult to parse if it were one of many rows displayed in a search results grid.
Consistency
Be sure to use a common "design vocabulary" when you're working on the chunks -- consistent styling of labels, consistent spacing... so when everything's displayed, despite the bulk of information, it all feels like it's part of the same family.
It's an interesting design exercise. Many comps, lots of iteration, and some brainstorming should get you where you need to be.
It probably depends on the content you're displaying. Look at the StackOverflow layout for this question. It has Votes, Title, Description, Tags, Author, etc. The content wouldn't work well in a grid for sure, nor does it flow nicely on it's own.
I think it's time to get creative ;)
No one ever thinks about what this is going to look like on their screen, do they?
One thing you can do is truncate the displayed text, and then display the expanded version in a tooltip on hover, or after the user clicks on it.
For example, display only the two-letter state abbreviation but show the full state name on hover.
Or, to save even more space, only display the state abbreviation, and put the entire address in the tooltip.
For long descriptions, you can display only the first few characters, followed by an ellipsis or the word "More". Then, show the full text either on hover or on click.
One disadvantage of the hover approach is that you can't sort the column on that text. There's nothing for the user to click to request the sort.

Cross Referencing Databases on Fuzzy Data

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?
This does not have to be done in real-time and in this case accuracy is more important than speed.
Current ideas for this are:
Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
Some combination of the above
Anyone have any feedback on any of these or ideas of their own?
One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.
First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.
Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)
Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.

Resources