Counting the frequency of a words in Wikipedia - nlp

I need to extract information from wikipedia, but I have no idea on how to proceed. What I have to do is the following:
Given a word 'w', how can I count the number of times 'w' appears in the whole English Wikipedia? Is there a list already available online? If not, how could I do such thing? I am new to coding and I'm trying to do some experiments in some NLP-related tasks.

First download the wikipedia dump (in XML format for example)
If you are using a UNIX based OS (ex. LINUX or Mac OS X) you can use grep.
see here
Python can also be used to count occurrences of a specified string in a file
see here

Related

GNU Assembly split string of integers to integers

I'm working on a project for school.
The assignment is as follows:
Implement a sorting algorithm of your choosing in assembly (we are using the GNU Assembler). The input is a text-file with a series of numbers separated by newline.
I'm then trying to implement insertion sort.
I have already opened and read the file and i'm able to print the content to terminal.
My problem is now how to split each number from the file in order to compare and sort them.
I believe google is glowing at the moment due to my effort to find and answer (maybe I don't know what I need to type or where to look).
I have tried to get each character from the string, which i'm able to do BUT I don't know to put them together again as integers (we only have integers).
If anybody could help with some keywords to search for it would be much appreciated.

A new idea on how to beat the 32,767 text limit in Excel

So as many others have asked in the past is there a way to beat the 32k limit per cell in Excel?
I have found ways to do it by splitting the work load into two different .txt files and then merging the two .txt files, however it is a giant PITA and more often then not I end up only using excel to its limits as I do not have time to validate the data after .txt file merges anymore this is a long process and tedious IMO.
However I think that if the limitation is there it is there because it was coded when Microsoft developed Excel, and since they have yet to raise it (2013 version the limit is still the same limit so it would do no good to upgrade)
I also know that many will say if you have a need for information in a single cell in that length then you should use ACCESS well I have no idea how to use ACCESS or how to import a tab delimited file into ACCESS like you would into EXCEL, and then even if I could figure that out I still now have to figure out how to learn all the new commands and he EXCEL equivalents if there is even such a thing.
So I was browsing some blog posts the other day on how to beat limitations by software and I read something about reverse engineering.
Would it be possible to load excel into a hex editor, go in and change every instance of 32767 to something greater?
While 32767 may seem like an arbitrary number, it's actually the upper limit of a 16-bit signed integer (called a short in C). The range of a short goes from -32768 to 32767.
A 16-bit integer can also be unsigned, in which case its range is 0 to 65535.
Since it's impossible for a cell to have a negative number of characters, it seems odd that Microsoft would limit a cell's length based on a signed rather than unsigned 16-bit integer. When they wrote the original program, they probably couldn't imagine anyone storing so much information in a single cell. Using shorts may have simplified the code. (My first computer had only 4K of memory, so it's still amazing to me that Excel can store 8 times that much information in a single cell.)
Microsoft may have kept the 32767 limit to maintain backward compatibility with previous versions of Excel. However, that doesn't really make sense, because the row and column counts greatly increased in recent versions of Excel, making large spreadsheets incompatible with previous versions.
Now to your question of reverse-engineering Excel. It would be a gargantuan task, but not impossible. In the early '90s, I reverse-engineered and wrote vaccines for a few small computer viruses (several hundred bytes). In the '80s, I reverse-engineered an 8KB computer chess program.
When reverse-engineering an executable, you'll need a good disassembler or decompiler. Depending on what you use, you may get assembly-language or C code as the output. But note that this will not be commented code, and you will not see meaningful variable or function names. You'll have to read every line of code to determine what it does. And you'll quickly discover that the executable is the least of your worries. Excel's executable links in a number of DLL files, which would also need reverse-engineering.
To be successful, you will need an extensive knowledge of Windows programming in addition to C or Intel assembly code – not to mention a large amount of patience. Learning Access would be a much simpler task.
I'd be interested in why 32767 is insufficient for your needs. A database may make more sense, and it wouldn't necessarily need to duplicate the functionality of Excel. I store information in a database for output to Web pages, in which case I use HTML+JavaScript for anything that needs to be interactive.
In case anyone is still having this issue:
I had the same problem with generating a pipe-separated file of longitudinal research data. The header row exceeded the 32767 limit. Not an issue unless the end-user opens the file in excel. Work around is to have end-user open file in google sheets, perform the text-to-columns transformation, then download and open file in excel.
https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Length-limit-of-cell-contents-in-Excel-when-opening-exported-bibliographic-data?language=en_US
Jack Straw from Wichita (https://stackoverflow.com/users/10327211/jack-straw-from-wichita) surely you can do an import of a pipe separated file directly into Excel, using Data>Get Data? For me it finds the pipe and treats the piped file in the same way as a CSV. Even if for you it did not, you have an option on the import to specify the separator that you are using in your text file.
Kind regards
Sefton Hall

How to compare text files character by character using Linux

I am pretty new to Linux, and I would like to know how can I compare two text files in Linux in order to determine the first character of difference between them. Any help will be appreciate it

Build a dictionary from two sentenced-aligned texts?

I have a text corpus which is already aligned at sentence level by construction - it is a list of pairs of English strings and their translation in another language. I have about 10 000 strings of 5 - 20 words each and their translations. My goal is to try to build a metric of the quality of the translation - automatically of course, because I'm dealing with languages I know nothing about :)
I'd like to build a dictionary from this list of translations that would give me the (most probable) translation of each word in the source English strings into the other language. I know the dictionary will be far from perfect but I'm hoping I can have something good enough to flag when a word is not consistently translated, for example, if my dictionary says "Store" is to be tranlated into French by "Magasin" then if I spot some place where "Store" is translated as "Boutique" I can suspect that something is wrong.
So I'd need to:
build a dictionary from my corpus
align the words inside the string/translation pairs
Do you have good references on how to do this? Known algorithms? I found many links about text alignment but they seem to be more at the sentence level than at the word level...
Any other suggestion on how to automatically check whether a translation is consistent would be greatly appreciated!
Thanks in advance.
A freely available (specifically, GPL-licensed) tool for word alignment is GIZA++. I trains the well-known IBM models mentioned in other answers, as well as other statistical models.
You can download it from the GIZA++ site at Google Code, and there is a brief introduction to its usage found at the GIZA++ Apertium. It boils down to this procedure:
Create your parallel corpus, sentence-aligned (you seem to have this already)
Apply the plain2snt tool included in GIZA++ to extract word lists and sentence lists in GIZA++ format
(Optional – only used for some models:) Generate word classes using the mkcls tool (also included)
Run the actual word alignment tool GIZA++. There are various optional configuration settings you can use to determine the type of model generated.
Before you can do this, you must build the tool from source code by running make. The code is written in C++ and compiles well with recent GCC versions.
A few final notes:
There are more than one possible translations for every word; you shouldn't rely on the assumption that a specific translation found in one text is necessarily wrong just because the same word is translated differently in another text;
One word may be translated into a (not necessarily contiguous) sequence of several words, and vice versa. Some words are not translated at all;
GIZA++ is a statistical tool that approximates the correct word alignment; many of the alignments it generates are questionable or incorrect.
This a pretty standard statistical machine translation problem called 'word alignment'.
There are bunch of EM clustering-based models developed by researchers at IBM which I think are the base for most other cooler models being developed today.
Google for 'ibm word alignment models' to find about IBM Models 1 to 5.
This presentation - http://www.stanford.edu/class/cs224n/handouts/cs224n-lecture-05-2011-MT.pdf seems like a good place to start.
Are you using spaces between words? Whatever character you are using, you might check out the slice command in Linux. It gives you the ability to filter words in-between spaces and other characters.

3d cloud viewer?

I have a file which contains a bunch of points with their x,y, z locations.I am looking for a simple viewer where in I can load this point data and view it .Rotation is essential for me to check the depth of the cloud being generated .Can someone point out a light weight viewer with minimum installation overhead for this?
I have used MeshLab and it worked well for me. IIRC it uses your average Windows installer.
You could also try CyArk viewer (a Java applet), or Leica Cyclone -- I haven't used either one.
Of course if your data format is not a standard one, they may not be able to read it.
R+ is an open source statistical program that I have used for this exact purpose. It can be accomplished in only a few lines of code.
First add the rgl and plotrix libraries.
Enter the following code:
pcd <- read.table(file.choose(),sep="",skip=10)
names(pcd) <- c("x","y","z")
plot3d(pcd$x,pcd$y,pcd$z,col=color.scale(pcd$z,c(0,1,1),c(1,1,0),c(1,0,1)))
Where pcd is the type of file (if I remember correctly), the first ten lines are skipped as they are a header (skip=10) and sep"" represents the delimiter used in the file. This last line of code plots the points and sets the color based upon depth.
I vote for Paraview. I am shocked that no one mentioned it before I did.
No matter which OS you use, Windows, Mac OS or Linux you can use it without any (big) problems. (You know software always has bugs)
Meshlab is good too. In fact you can convert file format easily to make sure they can be used in different software, if you can learn Python.
I do believe someones has already done it. Eg. .off (Meshlab format) to .vtk(Paraview format), like this one.
Update1:
Most visualization softwares are user-friendly, so maybe the problem you have is mainly about how to convert the source data you have to the specified format which can be used in these viewers. It may be helpful if there is an example of data you have.
matlab is a quite good tool to visualize your pcl, expecially for further analysis.

Resources