Where to find the full documentation for tcp.h? - linux

The Kernel has some nice TCP statistics capabilities in tcp.h. For example, one can obtain the round trip time (RTT) of the previous TCP communication by reading from tcpi_rtt.
However, not all of these metrics are documented in tcp.h. Where can I find the full documentation for all the metrics from tcp.h?
For example, what exactly does tcpi_snd_ssthresh record?

Where can I find the full documentation for all the metrics from tcp.h?
I believe you have found the "full documentation" already in the source code. More documentation if any will be in documentation/ directory inside sources, but the best way is to ask the maintainer.
what exactly does tcpi_snd_ssthresh record?
Let's first search tcpi_snd_ssthresh on elixir, we find out single assignment of the member, we then go to struct tcp::snd_ssthresh, and I also seen this comment above this assigment mentioning the draft-stevens-tcpca-spec-01. From search google for draft-stevens-tcpca-spec-01 I found this github repo that in this rfc2001.txt explains the "slow start algorithm" - ssthresh is one of the parameters of the algorithm.

Related

Viewing Linux TCP TCB

I need to find out information that should be kept in the TCP transmission control block (TCB), specifically I need to find out what sequence numbers are used for any particular session.
I have posted to other forums, looked through the procfs, searched Google, sent myself links from lmgtfy (dot) com :) No luck.
If there is no tool or hints in the procfs, would it be possible to somehow find out where this sort of information exists in memory and gather it from there such as using to dd to copy /dev/mem?
Thanks for any help on this in advance!!!!!
Well, I guess you first need to know what sequence number is and why it's been used, then you can look at particular implementation of sequence number generation.
Sequence numbers are 32 bit field and it's been used to mark every packet uniquely as if they can be acknowledged. And, being acknowledged
is important and it's an important feature of tcp for maintaining connection reliability. A full details could be found at the TCP rfc (http://www.ietf.org/rfc/rfc793.txt - section 3.3).
Now if you need to find out how Linux does it, you need to look at net/ipv4/tcp_ipv4.c::tcp_v4_init_sequence() this is used for generating ISN (Initial Sequence Number) whenever a new connection to be established and how latter sequence numbers are generated that's explained at rfc. So, look at the implementation of tcp_v4_init_sequence() and rfc, that will help you to understand uses and implemetation of sequence number usefully. Hope this will help!

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

Open source projects for email scrubbing generating structured data from unstructured source?

Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?
There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?
Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.

Searching for information about the Go language

I was trying to look for information about generic database drivers for Google's Go language, when I notice I have hard time to get any result.
Go SQL returns nothing related to the Go language, and golang SQL only returns useful results from the mailing list (and not, say, from github).
Is there any smarter way I can look for information about the go language?
One of the creator said that search engines will recognize the context of the overloaded word "go", and it'll eliminate my problem, but I say - why bother? go issue9!
Take a look at the dashboard: http://godashboard.appspot.com/project. It's the best overview of current Go related projects. Another one is http://go-lang.cat-v.org/.
mue

"Did you mean?" feature in Lucene.net

Can someone please let me know how do I implement "Did you mean" feature in Lucene.net?
Thanks!
You should look into the SpellChecker module in the contrib dir. It's a port of Java lucene's SpellChecker module, so its documentation should be helpful.
(From the javadocs:)
Example Usage:
import org.apache.lucene.search.spell.SpellChecker;
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
AFAIK Lucene supports proximity-search, meaning that if you use something like:
field:stirng~0.5
(it s a tilde-sign)
will match "string". the float is how "tolerant" the search would be, where 1.0 is exact match and 0.0 is match everything (sort of).
Different parsers will however implement this differently.
A proximity-search is much slower than a fuzzy-search (stri*) so use it with caution. In your case, one would assume that if you find no matches on a regular search, you try a proximity-search to see what you find, and present "did you mean" based on the result somehow.
Might be useful to cache this sort of lookups for very common mispellings, for performance reasons.
Google's "Did you mean?" is (probably; they're secretive, of course) implemented by consulting their query log. Look to see if people who searched for the query you're processing searched for something very similar soon after; if so, it indicates they made a mistake, and realized what they ought to be searching for.
Since you probably don't have a huge query log, you could approximate it. Take the query, split up the terms, see if there are any similar terms in the database (by edit distance, whatever); replace your terms with those nearby terms, and rerun the query. If you get more hits, that was probably a better query. Suggest it to the user. (And since you've already got the hits, and most people only look at the top 2 results, show them those.)
Take a look at google code project called semanticvectors.
There's a decent amount of discussion on the Lucene mailing lists for doing functionality like what you're after using it - however it is written in java.
You will probably have to parse and use some machine learning algorithms on your search logs to build a feature like this!

Resources