How to split comma-separated words? - text

How can i split my words in new line (i have a lot of them) currently separated with comma,
Example of my file contains words in a single line:
Viktor, Vajt, Adios, Test, Line, Word1, Word2, etc...
The the output file should look like:
Viktor
Vajt
Adios
Test
...

If you are using NotePad++, this can easily be done. See image below

If you – for some reason – want to stick with the doc Format (to keep formatting, etc.) you could use LibreOffice (http://de.libreoffice.org/) to do the following replacement:
I agree installing LibreOffice just for this replacement would be overkill though.

Not sure what language you are in but you could use an explode/split function which would create an array of values split at ','. Then you could loop through the array and append the new line special character "\n". You would wind up with something like:
$fileContentsAsString; //read file into string variable
$valuesArray = explode(',' $fileContentsAsString);
$outputString;
foreach($valuesArray as $item){
$outputString .= $item . "\n";
}

For a quick text edititng i'm using online tool (http://regexptool.org/). Also you can do it step by step (screen).

Related

How to detect what kind of break line in a text file in python?

My problem is the following. I have a text file with a bunch of lines in it. The problem is this text might have been created by Windows or Unix or Mac.
I want to open this text in python (as a string block) and split according to a break line to get an array at the end with all lines. The problem is I only tested this with a windows created file so I can split the string block easily according \n. But if I understand correctly other environnement use \r \r\n ...Etc
I want a general solution where I can detect what kind of line break is used in a file before I start splitting in order to split it correctly. Is that possible to do?
thanks;
UNIX_NEWLINE = '\n'
WINDOWS_NEWLINE = '\r\n'
MAC_NEWLINE = '\r'
This will be how the different os apply line breaks in a file and how python sees it

Python - How do I separate data into multiple lines

I have two strings that i want to put into a txt file but when I try and write then, it's only on the first line, I want the string to be on separate lines how do I do so?
Here is the writing part of my code btw:
saveFile = open('points.txt', 'w')
saveFile.write(str(jakesPoints))
saveFile.write(str(alexsPoints))
saveFile.close
if jakesPoints was 10 and alexsPoints was 12 then the text file would be
1012
but i want to to be
10
12
You can use a newline character (\n) to move to a new line. For your example:
with open('points.txt', 'w') as saveFile:
saveFile.write("{}\n".format(jakesPoints))
saveFile.write("{}\n".format(alexsPoints))
The other things to note:
It is helpful to open files using with - this will take care of opening and closing the file automatically (which is typically preferred over trying to remember to .close()).
The {}.format() section is used to convert your numbers to a string and add the newline character. I found https://pyformat.info/ explained the string formatters pretty good and highlight all the main advantages.
with open('points.txt', 'w') as saveFile:
saveFile.write(str(jakesPoints))
saveFile.write("\n")
saveFile.write(str(alexsPoints))
See difference betweenw and a used in open(). Also see join() .

Using multiple delimiters with scanner - Java

I'm trying to use both tabs and newlines as delimiters to read from a .txt file. What I have at the moment is:
Scanner fileScanner = new Scanner(new FileReader("propertys.txt"));
fileScanner.useDelimiter("[\\t\\n]");
I've tried:
fileScanner.useDelimiter("\\t|\\n");
and
fileScanner.useDelimiter("[\\t|\\n]");
I've got no idea what's going wrong, I've searched around a lot and it looks like one of those should be working. Clearly I'm doing something wrong.
fileScanner.useDelimiter("\t|\n");
should work.
If you have two slashes "\n" the first acts as an escape and it won't work right.
For the regular expression used as a parameter in useDelimiter method, you should use newline as \n instead of \\n and tab as \t instead of \\t. From Java Pattern class: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.
A part from that, I think you should define you regular expression like, for example, this:
fileScanner.useDelimiter("\\s*[\t\n]\\s*");
to limit strings (\\s) between newline or tab characters.

Embedding text in AS2, like HEREDOC or CDATA

I'm loading a text file into a string variable using LoadVars(). For the final version of the code I want to be able to put that text as part of the actionscript code and assign it to the string, instead of loading it from an external file.
Something along the lines of HEREDOC syntax in PHP, or CDATA in AS3 ( http://dougmccune.com/blog/2007/05/15/multi-line-strings-in-actionscript-3/ )
Quick and dirty solutions I've found is to put the text into a text object in a movieclip and then get the value, but I dont like it
Btw: the text is multiline, and can include single quotes and double quotes.
Thanks!
I think in AS2 the only way seems to do it dirty. In AS3 you can embed resources with the Embed tag, but as far as I know not in AS2.
If it's a final version and it means you don't want to edit the text anymore, you could escape the characters and use \n as a line break.
var str = "\'one\' \"two\"\nthree";
trace(str);
outputs:
'one' "two"
three
Now just copy the text into your favourite text editor and change every ' and " to \' and \", also the line breaks to \n.
Cristian, anemgyenge's solution works when you realize it's a single line. It can be selected and replaced in a simple operation.
Don't edit the doc in the code editor. Edit the doc in a doc editor and create a process that converts it to a long string (say running it through a quick PHP script). Take the converted string and paste it in over the old string. Repeat as necessary.
It's way less than ideal from a long-term management perspective, especially if code maintenance gets handed off without you handing off the parser, but it gets around some of your maintenance issues.
Use a new pair of quotes on each line and add a space as the word delimiter:
var foo = "Example of string " +
"spanning multiple lines " +
"using heredoc syntax."
There is a project which may help that adds partial E4X support to ActionScript 2:
as24x
As well as a project which adds E4X support to Haxe, which can compile to a JavaScript target:
E4X Macro for Haxe

Determining Word Frequency of Specific Terms

I'm a non-computer science student doing a history thesis that involves determining the frequency of specific terms in a number of texts and then plotting these frequencies over time to determine changes and trends. While I have figured out how to determine word frequencies for a given text file, I am dealing with a (relatively, for me) large number of files (>100) and for consistencies sake would like to limit the words included in the frequency count to a specific set of terms (sort of like the opposite of a "stop list")
This should be kept very simple. At the end all I need to have is the frequencies for the specific words for each text file I process, preferably in spreadsheet format (tab delineated file) so that I can then create graphs and visualizations using that data.
I use Linux day-to-day, am comfortable using the command line, and would love an open-source solution (or something I could run with WINE). That is not a requirement however:
I see two ways to solve this problem:
Find a way strip-out all the words in a text file EXCEPT for the pre-defined list and then do the frequency count from there, or:
Find a way to do a frequency count using just the terms from the pre-defined list.
Any ideas?
I would go with the second idea. Here is a simple Perl program that will read a list of words from the first file provided and print a count of each word in the list from the second file provided in tab-separated format. The list of words in the first file should be provided one per line.
#!/usr/bin/perl
use strict;
use warnings;
my $word_list_file = shift;
my $process_file = shift;
my %word_counts;
# Open the word list file, read a line at a time, remove the newline,
# add it to the hash of words to track, initialize the count to zero
open(WORDS, $word_list_file) or die "Failed to open list file: $!\n";
while (<WORDS>) {
chomp;
# Store words in lowercase for case-insensitive match
$word_counts{lc($_)} = 0;
}
close(WORDS);
# Read the text file one line at a time, break the text up into words
# based on word boundaries (\b), iterate through each word incrementing
# the word count in the word hash if the word is in the hash
open(FILE, $process_file) or die "Failed to open process file: $!\n";
while (<FILE>) {
chomp;
while ( /-$/ ) {
# If the line ends in a hyphen, remove the hyphen and
# continue reading lines until we find one that doesn't
chop;
my $next_line = <FILE>;
defined($next_line) ? $_ .= $next_line : last;
}
my #words = split /\b/, lc; # Split the lower-cased version of the string
foreach my $word (#words) {
$word_counts{$word}++ if exists $word_counts{$word};
}
}
close(FILE);
# Print each word in the hash in alphabetical order along with the
# number of time encountered, delimited by tabs (\t)
foreach my $word (sort keys %word_counts)
{
print "$word\t$word_counts{$word}\n"
}
If the file words.txt contains:
linux
frequencies
science
words
And the file text.txt contains the text of your post, the following command:
perl analyze.pl words.txt text.txt
will print:
frequencies 3
linux 1
science 1
words 3
Note that breaking on word boundaries using \b may not work the way you want in all cases, for example, if your text files contain words that are hyphenated across lines you will need to do something a little more intelligent to match these. In this case you could check to see if the last character in a line is a hyphen and, if it is, just remove the hyphen and read another line before splitting the line into words.
Edit: Updated version that handles words case-insensitively and handles hyphenated words across lines.
Note that if there are hyphenated words, some of which are broken across lines and some that are not, this won't find them all because it only removed hyphens at the end of a line. In this case you may want to just remove all hyphens and match words after the hyphens are removed. You can do this by simply adding the following line right before the split function:
s/-//g;
I do this sort of thing with a script like following (in bash syntax):
for file in *.txt
do
sed -r 's/([^ ]+) +/\1\n/g' "$file" \
| grep -F -f 'go-words' \
| sort | uniq -c > "${file}.frq"
done
You can tweak the regex you use to delimit individual words; in the example I just treat whitespace as the delimiter. The -f argument to grep is a file that contains your words of interest, one per line.
First familiarize yourself with lexical analysis and how to write a scanner generator specification. Read the introductions to using tools like YACC, Lex, Bison, or my personal favorite, JFlex. Here you define what constitutes a token. This is where you learn about how to create a tokenizer.
Next you have what is called a seed list. The opposite of the stop list is usually referred to as the start list or limited lexicon. Lexicon would also be a good thing to learn about. Part of the app needs to load the start list into memory so it can be quickly queried. The typical way to store is a file with one word per line, then read this in at the start of the app, once, into something like a map. You might want to learn about the concept of hashing.
From here you want to think about the basic algorithm and the data structures necessary to store the result. A distribution is easily represented as a two dimensional sparse array. Learn the basics of a sparse matrix. You don't need 6 months of linear algebra to understand what it does.
Because you are working with larger files, I would advocate a stream-based approach. Don't read in the whole file into memory. Read it as a stream into the tokenizer that produces a stream of tokens.
In the next part of the algorithm think about how to transform the token list into a list containing only the words you want. If you think about it, the list is in memory and can be very large, so it is better to filter out non-start-words at the start. So at the critical point where you get a new token from the tokenizer and before adding it to the token list, do a lookup in the in-memory start-words-list to see if the word is a start word. If so, keep it in the output token list. Otherwise ignore it and move to the next token until the whole file is read.
Now you have a list of tokens only of interest. The thing is, you are not looking at other indexing metrics like position and case and context. Therefore, you really don't need a list of all tokens. You really just want a sparse matrix of distinct tokens with associated counts.
So,first create an empty sparse matrix. Then think about the insertion of the newly found token during parsing. When it occurs, increment its count if its in the list or otherwise insert a new token with a count of 1. This time, at the end of parsing the file, you have a list of distinct tokens, each with a frequency of at least 1.
That list is now in-mem and you can do whatever you want. Dumping it to a CSV file would be a trivial process of iterating over the entries and writing each entry per line with its count.
For that matter, take a look at the non-commercial product called "GATE" or a commercial product like TextAnalyst or products listed at http://textanalysis.info
I'm guessing that new files get introduced over time, and that's how things change?
I reckon your best bet would be to go with something like your option 2. There's not much point pre-processing the files, if all you want to do is count occurrences of keywords. I'd just go through each file once, counting each time a word in your list appears. Personally I'd do it in Ruby, but a language like perl or python would also make this task pretty straightforward. E.g., you could use an associative array with the keywords as the keys, and a count of occurrences as the values. (But this might be too simplistic if you need to store more information about the occurrences).
I'm not sure if you want to store information per file, or about the whole dataset? I guess that wouldn't be too hard to incorporate.
I'm not sure about what to do with the data once you've got it -- exporting it to a spreadsheet would be fine, if that gives you what you need. Or you might find it easier in the long-run just to write a bit of extra code that displays the data nicely for you. Depends on what you want to do with the data (e.g. if you want to produce just a few charts at the end of the exercise and put them into a report, then exporting to CSV would probably make most sense, whereas if you want to generate a new set of data every day for a year then building a tool to do that automatically is almost certainly the best idea.
Edit: I just figured out that since you're studying history, the chances are your documents are not changing over time, but rather reflect a set of changes that happened already. Sorry for misunderstanding that. Anyway, I think pretty much everything I said above still applies, but I guess you'll lean towards going with exporting to CSV or what have you rather than an automated display.
Sounds like a fun project -- good luck!
Ben
I'd do a "grep" on the files to find all the lines that contain your key words. (Grep -f can be used to specify an input file of words to search for (pipe the output of grep to a file). That will give you a list of lines which contain instances of your words. Then do a "sed" to replace your word separators (most likely spaces) with newlines, to give you a file of separate words (one word per line). Now run through grep again, with your same word list, except this time specify -c (to get a count of the lines with the specified words; i.e. a count of the occurrences of the word in the original file).
The two-pass method simply makes life easier for "sed"; the first grep should eliminate a lot of lines.
You can do this all in basic linux command-line commands. Once you're comfortable with the process, you can put it all into shell script pretty easily.
Another Perl attempt:
#!/usr/bin/perl -w
use strict;
use File::Slurp;
use Tie::File;
# Usage:
#
# $ perl WordCount.pl <Files>
#
# Example:
#
# $ perl WordCount.pl *.text
#
# Counts words in all files given as arguments.
# The words are taken from the file "WordList".
# The output is appended to the file "WordCount.out" in the format implied in the
# following example:
#
# File,Word1,Word2,Word3,...
# File1,0,5,3,...
# File2,6,3,4,...
# .
# .
# .
#
### Configuration
my $CaseSensitive = 1; # 0 or 1
my $OutputSeparator = ","; # another option might be "\t" (TAB)
my $RemoveHyphenation = 0; # 0 or 1. Careful, may be too greedy.
###
my #WordList = read_file("WordList");
chomp #WordList;
tie (my #Output, 'Tie::File', "WordCount.out");
push (#Output, join ($OutputSeparator, "File", #WordList));
for my $InFile (#ARGV)
{ my $Text = read_file($InFile);
if ($RemoveHyphenation) { $Text =~ s/-\n//g; };
my %Count;
for my $Word (#WordList)
{ if ($CaseSensitive)
{ $Count{$Word} = ($Text =~ s/(\b$Word\b)/$1/g); }
else
{ $Count{$Word} = ($Text =~ s/(\b$Word\b)/$1/gi); }; };
my $OutputLine = "$InFile";
for my $Word (#WordList)
{ if ($Count{$Word})
{ $OutputLine .= $OutputSeparator . $Count{$Word}; }
else
{ $OutputLine .= $OutputSeparator . "0"; }; };
push (#Output, $OutputLine); };
untie #Output;
When I put your question in the file wc-test and Robert Gamble's answer into wc-ans-test, the Output file looks like this:
File,linux,frequencies,science,words
wc-ans-test,2,2,2,12
wc-test,1,3,1,3
This is a comma separated value (csv) file (but you can change the separator in the script). It should be readable for any spreadsheet application. For plotting graphs, I would recommend gnuplot, which is fully scriptable, so you can tweak your output independently of the input data.
To hell with big scripts. If you're willing to grab all words, try this shell fu:
cat *.txt | tr A-Z a-z | tr -cs a-z '\n' | sort | uniq -c | sort -rn |
sed '/[0-9] /&, /'
That (tested) will give you a list of all words sorted by frequency in CSV format, easily imported by your favorite spreadsheet. If you must have the stop words then try inserting grep -w -F -f stopwords.txt into the pipeline (not tested).

Resources