Labelling text using Notepad++ or any other tool - python-3.x

I have several .dat, containing information about hotel reviews as below
/*
<Author> simmotours
<Content> review......goes here
<Date>Nov 18, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>4`enter code here`
<Value>4
<Rooms>3
<Location>4
<Cleanliness>4
<Check in / front desk>4
<Service>4
<Business service>-1
*/
I want to classify the review into two pos and neg , i.e. have two folder pos and neg containing several files with reviews above 3 classified as positive and below 3 classified as negative.
How can I quickly and efficiently automate this process?

You could write up a python script to read the overall score. Do this by looping over the the lines using readline() See here. Find the "Overall" Score using some string parsing. Then move the file into the right directory. All very simple things to do in Python, just break it down into steps and search for answers to those steps.

Notepad++ can do replacements with regular expressions. And allows the definition of macros. Use them to convert the file to an XML file. Check out the help file.
Then you can read it with any scripting language and do what you want.
Alternatively you could change the file to a form where you can load it into Excel and do the analysis there.

Related

Excel: Export to XML - With XML in cells

I'm trying to export a spreadsheet that has some XML in some of the cells of the table.
ID (column A): 23455
FACT (column B) (this code is copied & pasted from a sample cell - they don't all have this simplicity or structure):
"<div class=""fact"">
<p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p>
</div>
"
I'd like to have XML like the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div></FACT>
</record>
This is complex enough that I doubt that Excel's native XML schema export will work (that thing is persnickety enough that I can't get it to work with simplest of data values).
My current thought is to write a Perl script, to read this as a CSV file and export XML. However, I've noticed that CSV does a poor job handling XML that's been "embedded" like this.
I'm hoping someone else might have a better suggestion for how to pull this information out.
Edit: Finally figured out the mistake I made with export. Can export and get the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div&gt
</FACT>
</record>
I think I can work with this...some regex and it might be good enough (looking for all < might put me at risk of killing a true less-than sign).
So I'm still open to suggestions
Just posting this as the answer...
If you export the column as text you can get the following:
<record>
<ID>23455</ID>
<FACT><div class="fact"><p><strong>FACT.</strong> The closest star to our solar system is Alpha Centauri.</p></div&gt
</FACT>
</record>
In an XML editor I did a find and replace to get all the tags using the following regex: s/<(\/?[\w\s="-_]+?)>/<$1>/
It's a bit dangerous if there are actual signs in the document, but you'd need a case where it was < /maybe and text with common tag symbols ="-_ > - possible but most equations are of the form X < Y < Z. Our content doesn't use <> all that much, so I can be fairly confident it won't catch the edge case.
I also "fixed" all the HTML (s/<b>/<b/>/ and s/<img (.*?)>/<img $1/>/) and checked parsing (theoretically an edge case would cause a parsing error).
And yes, I now have a doc in mixed DTD that will make all true XML peeps quake with horror, but I can work with it.

batch file extract numbers from text file with little information

So This is related to my other two posts. Im dealing with extracting text from a text file and analyzing it and I've run into some problems. For A while I've been using a method that sets all the text between two other strings as a variable, but here is the situation I have. I need to extract the speed (numbers) from the below string: "etc...,query":{"ping":47855},"cmts":...etc. The problem is that the text cmts sometimes changes to something else so really I need to extract all the numbers from this:
,query":{"ping":47855},"
One more thing that makes this difficult is that the characters }," Are all over the file. Thank you for helping me! -Lucas EDG Programmer.
Here's the full file:
{"_id":53291,"ip":"158.69.22.95","domain":"jectile.com","port":25565,"url":"","date_add":1453897770,"status":1,"scan":1,"uptime":99.53,"last_update":1485436105,"geo":{"country":"US","country_name":"United States","city":"Lake Forest"},"info":{"name":" Jectile | jectile.com [1.8-1.11]\n Shoota (Call of Duty) \/ Zambies (Zombie Survival)","type":"FML","version":"1.10","plugins":[],"players":18,"max_players":420,"players_list":[],"map":"world","software":"BungeeCord 1.8.x, 1.9.x, 1.10.x, 1.11.x","avg_player_day":24.458333,"avg_load_day":5.8234,"platform":"MINECRAFT","icon":true},"counter":{"online":47871,"offline":228,"players":{"date":"2017-01-26","total":0},"last_offline":0,"query":{"ping":47855},"cmts":1},"rating":{"main":19.24,"difference":-0.64,"content_up":0.15,"K":0},"last":{"offline":1485415702,"online":1485436105},"chart":{"14:30":14,"14:40":16,"14:50":15,"15:00":18,"15:10":12,"15:20":13,"15:30":9,"15:40":9,"15:50":11,"16:00":12,"16:10":11,"16:20":11,"16:30":18,"16:40":25,"16:50":23,"17:00":27,"17:10":27,"17:20":23,"17:30":24,"17:40":26,"17:50":33,"18:00":31,"18:10":31,"18:20":32,"18:30":37,"18:40":38,"18:50":39,"19:00":38,"19:10":34,"19:20":33,"19:30":40,"19:40":36,"19:50":37,"20:00":38,"20:10":36,"20:20":38,"20:30":37,"20:40":37,"20:50":37,"21:00":34,"21:10":32,"21:20":33,"21:30":33,"21:40":29,"21:50":28,"22:00":26,"22:10":21,"22:20":24,"22:30":29,"22:40":22,"22:50":23,"23:00":27,"23:10":24,"23:20":26,"23:30":25,"23:40":28,"23:50":27,"00:00":32,"00:10":29,"00:20":33,"00:30":32,"00:40":31,"00:50":33,"01:00":40,"01:10":40,"01:20":40,"01:30":41,"01:40":45,"01:50":48,"02:00":43,"02:10":45,"02:20":46,"02:30":46,"02:40":43,"02:50":42,"03:00":39,"03:10":36,"03:20":44,"03:30":34,"03:40":0,"03:50":32,"04:00":35,"04:10":35,"04:20":33,"04:30":43,"04:40":37,"04:50":26,"05:00":31,"05:10":31,"05:20":27,"05:30":25,"05:40":26,"05:50":18,"06:00":13,"06:10":15,"06:20":17,"06:30":18,"06:40":17,"06:50":15,"07:00":16,"07:10":17,"07:20":16,"07:30":16,"07:40":18,"07:50":19,"08:00":14,"08:10":12,"08:20":12,"08:30":13,"08:40":17,"08:50":20,"09:00":18,"09:10":0,"09:20":0,"09:30":27,"09:40":18,"09:50":20,"10:00":15,"10:10":13,"10:20":12,"10:30":10,"10:40":10,"10:50":11,"11:00":13,"11:10":13,"11:20":16,"11:30":19,"11:40":17,"11:50":13,"12:00":10,"12:10":11,"12:20":12,"12:30":16,"12:40":15,"12:50":16,"13:00":14,"13:10":10,"13:20":13,"13:30":16,"13:40":16,"13:50":17,"14:00":20,"14:10":16,"14:20":16},"query":"ping","max_stat":{"max_online":{"date":1470764061,"players":129}},"status_query":"ok"}
By the way, the reason things change is because it looks at info from different servers
Very similar to ther answer I gave you to your first question:
#Echo Off
Set/P var=<some.json
Set var=%var:*:{"ping":=%
Set var=%var:},=&:%
Echo=%var%
Timeout -1

Perl: Find duplicate in excel/csv file and write an output file with them

First of all I am a completely newbie (for now) with Perl, and I would like to ask you a quick advice.
I have to deal with a some lists of journals and publishers in different Excel/CSV files. I would like to find a way to cross the data in order to have the list of the titles & publishers in common between two files, and a list with the publisher and the number of journal published.
I would like to ask you if it is possible to do it with Perl (it should be the best method for what I understood, but I would like a confirmation!), and how advance it is.
Sorry for the strange request but I am writing my thesis and I would not like to spend time on something and discover that is not possible!
Thanks!
Yes.
To parse a CSV file:
Text::CSV_XS
To parse an Excel file:
Spreadsheet::ParseExcel
To find the common elements of two lists:
my %list1 = map { $_ => 1 } #list1;
my #common = grep $list1{$_}, #list2;

How to concatenate SVG files lengthwise from linux command line?

I have a series of square SVG files that I would like to arrange lengthwise into one super long SVG file.
I attempted to use imagemagick to combine them. Based on this page:
http://linux.about.com/library/cmd/blcmdl1_ImageMagick.htm
and this
http://www.imagemagick.org/Usage/compose/
I tried this command
composite 'file1.svg' 'file2.svg' +adjoin 'outputfile.svg'
However, I received the following error message:
composite: unrecognized option '+adjoin' # error/composite.c/CompositeImageCommand/565.
I tried several other imagemagick commands (convert, display), but had no success. How can I combine these files on the command line? Is there an inkscape command that does this?
There's currently no convenient way to do this with only the command line and no custom scripting.
Closest pre-written thing I could find currently (4-16-2012) is https://github.com/astraw/svg_stack, which lets you write commands of the form:
svg_stack.py --direction=h --margin=100 red_ball.svg blue_triangle.svg > shapes.svg
to concatenate.
It should be pretty easy if you're willing to use a scripting language. For each file, just add a prefix to all id tags; so in file 1, id="circle" becomes id="file1_circle", and in file 2, id="circle" becomes id="file2_circle".
In most cases you would get away with a trivial search and replace (find id=" and replace it with id="fileX_) although it is possible to have cases where this won't work (specifically if that find string appears in an item of text, for example).
If you want to do this 'the proper way', you'll need an XML parser (such as XMLReader in PHP).

Creating a tag cloud from text:nsp output

I am using Text::NSP which creates n-gram from text files. Is it possible to create tag clouds from an output file of Text-NSP? I have used and liked IBM Word Cloud Generator which only gives a tag cloud output from the frequency of each word within a file. However, I am working with 2-grams and 3-grams. In short, I need a tag cloud generator which will accept an input file with words and their occurrence number. I am running on Debian.
Thanks all.
I started to use R snippets package which is what I was searching for.
The output of text::nsp should be changed with some bash scripts in order to obtain a dataframe acceptable by the R.

Resources