words usage database? - statistics

Is there any free database/place out there with commonality/usage ratios of English words? (British or U.S. English, doesn't matter)
I don't care about the exact numbers, only relative to eachother. Something like:
the | 0.2
car | 0.08
chroma | 0.005
overspread | 0.0000007
Edit:
I have found http://en.wiktionary.org/wiki/Wiktionary%3aFrequency_lists which I can scrape for data. However I would prefer an sql-format which is easier to work with.

The term you want to google is "word frequency". One of the top hits is http://www.wordfrequency.info/

Related

Adding a regex condition to a listcomp in python

Hi everyone or anyone,
The aim: The aim is to divide text by 50 characters, BUT if there is a . in the sentence, Break and make start from the point after the .
I have this code
txt = "Greatly cottage thought fortune no mention he. Of mr certainty arranging am smallness by conveying. Him plate you allow built grave. Sigh sang nay sex high yet door game. She dissimilar was favourable unreserved nay expression contrasted saw. Past her find she like bore pain open. Shy lose need eyes son not shot. Jennings removing are his eat dashwood. Middleton as pretended listening he smallness perceived. Now his but two green spoil drift."
n = 50
some_text = [txt[i:i+n] for i in range(0, len(txt), n)]
This code divides the text into a list of strings made of 50 chars, perfect, but I also
need to add the condition of the .. If a . occurs, it must break and create string,
and continue from that point.
What it looks like:
print(some_text)
['Greatly cottage thought fortune no mention he. Of ', 'mr certainty arranging am smallness by conveying. ', 'Him plate you allow built grave. Sigh sang nay sex', ' high yet door game. She dissimilar was favourable', ' unreserved nay expression contrasted saw. Past he', 'r find she like bore pain open. Shy lose need eyes', ' son not shot. Jennings removing are his eat dashw', 'ood. Middleton as pretended listening he smallness', ' perceived. Now his but two green spoil drift.']
What I want it to look like:
['Greatly cottage thought fortune no mention he.', 'Of mr certainty arranging am smallness by'
'conveying.' 'Him plate you allow built grave.'
and so on...
Maybe you can use txt.split('.') with textwrap.wrap?
from textwrap import wrap
txt = "Greatly cottage thought fortune no mention he. Of mr certainty arranging am smallness by conveying. Him plate you allow built grave. Sigh sang nay sex high yet door game. She dissimilar was favourable unreserved nay expression contrasted saw. Past her find she like bore pain open. Shy lose need eyes son not shot. Jennings removing are his eat dashwood. Middleton as pretended listening he smallness perceived. Now his but two green spoil drift."
lines = []
for sentence in txt.split('.'):
if not sentence.strip():
continue
for line in wrap(sentence+'.', 50):
lines.append(line.strip())
# print lines:
print(*lines, sep='\n')
Prints:
Greatly cottage thought fortune no mention he.
Of mr certainty arranging am smallness by
conveying.
Him plate you allow built grave.
Sigh sang nay sex high yet door game.
She dissimilar was favourable unreserved nay
expression contrasted saw.
Past her find she like bore pain open.
Shy lose need eyes son not shot.
Jennings removing are his eat dashwood.
Middleton as pretended listening he smallness
perceived.
Now his but two green spoil drift.

How to compare different groups with different sample size?

I am plotting students' data from different schools to see the difference between male and female student numbers at some majors. I am using python, I already plot the data for some schools and as I expected male numbers are genuinely higher, then I realized that for each school I have a different number of total students. does my work make any sense when the sample size is different? if not may I have some suggestion to make some changes.
Now I'm realizing. Look: you have two classes where the first has 2 men, the second one - 20 men. And their marks. 2 men - both are 90/100. And 20 marks in the second one. Let it be a range from 40 to 80. Will it be correct if we say "Well, the first class made the test much better then the second"? Ofc, not.
To solve this problem just take a min(sizes of samples). If it looks too small, so throw away this programm, because you have not enough data to say something. And put a total size of sample via proxy legend or text, or add it in title. Anyway it will show you reliability of your results.
This question is not about programming, but rather about statistics, but I will try to answer.
Important question I didn't get there: What are you doing it for? If you ask question like "Hmm... Are there more men than women in the population(in this case, population = all persons in major programm)?". So each schools aren't important for you,and you can work with samples as you work with one (but don't forger gather them).
But you may ask question: "are there any difference between schools in samples?". In this case, gathering is not correct. For this purpose I highly recommend barh plot with stucked=True for each school. And for normalization just use percents. And difference between samples' size won't be problem.
PLS, If you ask question, put some code. 3 rows and one plot from a sample would be very helpful...

Outline detection from patterns in a list of textual articles

Are there NLP algorithms dealing with detecting the repeating
patterns in a a list of texts from which a topic keywords
and other associative keywords can be derived?
I will show it as an example:
You have a search query "vegan food for something health"
(where something is a part of body you need an advice about).
The search engine will return a list of articles.
The algorithm will search for patterns in these articles.
E.g. it notices that 80 % of them have a paragraph with
at least 4 multiple instances of a word orange, similarly
carrot, apples, cucumbers.
So it will give you an outline (textual mindmap)
orange
carrot -->
vitamin A
apple
banana -->
vitamin B
run a lot
Once I watched a video about semantic web on youtube and know that Tim Berners-Lee talked about something similar, but I have lost the link. Could you
keyword me to that direction again?
Probably you are looking for word2vec -- described patterns can be described in terms of distance between words.

Improving a search engine

I'm working on a search engine. For the most part, I'm simply using Appache's Lucene, which is working great so far, but I also wanted to improve the search results by establishing good "heuristics" within the search. (For example, if someone searches 'couch' and I have all of the couches cataloged as type 'sofa', I want the search algorithm to make the connection.)
I know this sounds a bit vague, but I don't know where to continue searching to find further reading in this study. (I Googled terms like 'heuristic search', 'heuristic function', etc, but they're not referring to the same thing I am.) So, I wanted to know if any of you guys worked on similar problems in search engines, and if you would recommend anything.
I had to build something similar for my Artificial Intelligence class. I build a web crawler that associated synonyms of words similar to what your looking to do. When a user searches for a term such as 'couch', I grabbed all of the synonyms of couch and stored them in a database with a reference to the original word. When the engine gets run again and 'sofa' gets searched, the application will again grab synonyms of 'sofa' (which is a synonym of couch). You should then be able to match that association.
There are plenty of free api's to get the synonym of a word. Try changing your google searches to Topic Specific Web Crawlers, or Topic specific search engines. You will gather better results
One of the "quick n' dirty" hack which is popped in my mind can be to implement a dictionary which holds similarities in context. e.g. make sofa and couch group similar. Or much better approach could be to build a square matrix to hold "similarity score" for each word pairs. Here is random matrix about what I mean:
couch sofa chair
couch | 100 | 95 | 75 |
sofa | 95 | 100 | 65 |
chair | 75 | 65 | 100 |
Another approach could be adaptively update that matrix with users selection. e.g. if a user search couch and then click chair, then you can increase couch-chair score by a defined threshold (of course, you should also renormalize all scores after each update).

How can I represent a road system in software?

I'm looking to do some traffic simulation as a side project but I'm having trouble coming up with ideas for how I should represent the road itself. I like the idea of using a series of waypoints (straight lines using lat/long coords) but it seems difficult to represent different lanes of traffic using this method. I've also been looking at some of the other traffic-simulation questions and one of them mentions using a bitmap but I'm having trouble deciding how this would allow me to easily assign real world lengths to road segments and lane widths, etc. Does anyone have any helpful hints or other ideas that would allow a car to exist at a specific point on a road and be able to switch lanes, etc?
I would start with a grid of connected nodes. A node would represent a change in the road condition like a crossing, a beginning or ending lane, a widening of the road itself etc. Either you do complex connections that store all information (lanes in both directions? how many lanes per direction? lane properties etc) or you save one connection for each lane. To be sure that 2 connections on different sides of a node are related to the same lane, you can use lane-ids on a per node base.
This way you have a graph you can run calculations on and you have all data to visualize the whole net.
It really depends on what you want to do with your model, so it's hard to come up with "the correct" answer here.
If you want to model congestions, you might not need a network at all. You can simulate that on a circular road.
Any do you really need the concept of lanes? If you do, you could just model them as separate lines between nodes, or maybe it's sufficient to just store the number of lanes per road.
Anyway, what I'm getting at is that you should first think a bit deeper about what you want to achieve before you start thinking about the exact data model.
In a previous job, I was the lead developer on a driving simulator, in particular road network modelling. I built what I called the Logical Road Network which was an abstract description of the road network that is used for tracking vehicles along the road.
A lane was simply a path that followed the road but was offset by a positive or negative distance from the central path. Each road was either a straight or curved section and was essentially a path of central vertices with one or more offset lanes either side. The autonomous cars could then follow a lane path.
In short, the polygons that made up the road were built around the central path along the road, e.g.
*------*------*
|\ |\ |
| \ | \ |
| \ | \ |
| \ | \ |
| \ | \ |
| \| \|
*------*------*
where * is a vertex, creating 4 polygons for this simple straight road segment.
Interpolation between 2 vertices along a path provided a simple way to move a vehicle in a given direction. On top of this simple path, we then introduced some fuzziness for the autonomous vehicles so that small deviations in the path emerged (creating more realistic traffic). Logically, vehicles were added to and removed from a road segment and vehicles could inspect the segment to see other vehicles in front, behind or on a different lane. This allowed some degree of AI within each vehicle, so that they could slow down behind another vehicle or wait for oncoming traffic to pass before making a turn.
Not sure if this is exactly what you are after, but I hope it helps nonetheless :-)

Resources